Huge malware dataset gives AI-driven threat detection a major boost

Late last year, global cybersecurity vendor Sophos announced it was sharing various tools and technologies in a bid to promote wider use of AI-based cybersecurity systems. This includes what Sophos describes as the first production-scale malware research dataset available to the general public.

Here’s a closer look at these new offerings, and at how far threat detection is currently being reshaped by artificial intelligence.

What has Sophos made available?

Sophos has opened up the following items: 

  • SOREL-20M: a large dataset, including 10 million disarmed malware samples, for the purposes of malware detection research.
  • AI-powered Impersonation Protection Method: designed to shield against emails spearphishing attacks.
  • Digital Epidemiology to Determine Undetected Malware: a set of statistical models for estimating the prevalence of malware infections.
  • YaraML Automatic Signature Generation Tools: a much faster, AI-based method of signature generation for the detection of malware.

According to a press release from Sophos, the company is making these assets generally available in order to “open its data science breakthroughs and make the use of AI in cybersecurity more transparent, all with the aim of better protecting organizations against all forms of cybercrime.” 

Here’s a closer view of the various assets in context, and what they tell us about the growing use of AI in cybersecurity. 


Sophos has worked with threat intelligence specialists, ReversingLabs on the creation of the SOREL-20M project. It consists of a production-scale dataset of 20 million Windows Portable Executable (PE) files. It also includes 10 million disarmed malware samples available for download, for the use in research and feature extraction. 

The dataset should help organizations create and optimize their own machine learning (ML) based threat detection tools. 

ML-based tools have the ability to “learn without being explicitly programmed”. This is especially useful in the field of malware detection, where novel, previously unseen threats are appearing all the time. Through machine learning, threat detection tools are able to algorithmically ‘reason’ and then identify the properties of previously unseen malware samples. 

However, this is a heavily data-driven approach. The more sample data a solution is fed, the more effective it becomes at recognizing malicious script. 

But as Hacker News points out, it is actually hard to get hold of suitable datasets in order to create ML models. This is due to the presence of protected personal information, private intellectual property and sensitive network infrastructure data. There is also the risk of putting malware into the hands of threat actors. 

SOREL-20M basically provides developers with a safe, ‘oven-ready’ dataset to work with. 

Impersonation Protection 

As part of its proprietary email filtering solution, Sophos offers an impersonation protection capability. This is to counter the threat of spearphishing: i.e. where influential people from within an organization are impersonated (usually via use of a spoof email address) to trick recipients into taking harmful action for the benefit of the attacker. 

The Sophos email tool scans incoming mail for display name variations associated with those users. Generously, Sophos has shown its workings, by basically opening up its AI-driven protection method so that other developers can develop similar tools. 

Statistical modeling

As we’ve all been reminded recently, the steps you take to counter a virus depend in large part on how prevalent that virus happens to be. 

Sophos has built a set of statistical models for estimating the prevalence of malware infections. According to the company, these models also happen to be effective at identifying malicious “dark matter” that is easy to miss, as well as “future matter”; i.e. malware that is still being developed. Sophos has made this methodology publicly available. 

Automatic signature generation 

Most anti-malware tools deploy signature detection: i.e. identifying the telltale semantics (signature) of the malicious script. However, like other types of virus, malware strains can change over time, or they may be previously unseen, which limits the effectiveness of signature detection. 

One way of dealing with this limitation is through the creation of base signatures that can pick up entire classes or ‘families’ of malware. As Sophos explains however, the creation of these signatures is a “laborious, manual process”. In response, the company has developed a new method for automatic signature detection, dubbed YaraML. 

This AI-driven tool directly automatically “writes” effective signatures rapidly and without the need for manual input. YaraML has now been designated open-source. 

So will all this mean fewer jobs for security specialists?

The robots are highly unlikely to take over in the near future. The type of technologies that Sophos and others are developing should certainly help solutions such as antivirus, email filtering and systems scanning to become more powerful. 

But those solutions still have to be chosen, configured, deployed and monitored. Staff still have to be educated on good hygiene. Backup and recovery plans still have to be drawn up and tested regularly. In other words, AI complements human expertise - but it doesn’t replace it.    

If you are interested in a career in cyber security. Try ​VIP membership to the StationX Cyber Security ​Career Development Platform.

Level Up in Cyber Security: Join Our Membership Today!

vip cta image
vip cta details
  • Nathan House

    Nathan House is the founder and CEO of StationX. He has over 25 years of experience in cyber security, where he has advised some of the largest companies in the world. Nathan is the author of the popular "The Complete Cyber Security Course", which has been taken by over half a million students in 195 countries. He is the winner of the AI "Cyber Security Educator of the Year 2020" award and finalist for Influencer of the year 2022.

  • Dray says:

    Personally I think this is a leap forward in cybersecurity, malicious script are been developed daily for attack purposes hence a counter measure must be developed to mitigate against it. Just as the saying goes, since the birds decided to fly without perching so also the hunters are shooting without missing.

    Thank you Nathan for your constant update.

  • Cysecon says:


    I got such a good information on this topic its very interesting one. You made a good site.

  • >