Huge malware dataset gives AI-driven threat detection a major boost

Late last year, global cybersecurity vendor Sophos announced it was sharing various tools and technologies in a bid to promote wider use of AI-based cybersecurity systems. This includes what Sophos describes as the first production-scale malware research dataset available to the general public.

Here’s a closer look at these new offerings, and at how far threat detection is currently being reshaped by artificial intelligence.

Table Of Contents

What has Sophos made available?
So will all this mean fewer jobs for security specialists?

What has Sophos made available?

Sophos has opened up the following items:

SOREL-20M: a large dataset, including 10 million disarmed malware samples, for the purposes of malware detection research.
AI-powered Impersonation Protection Method: designed to shield against emails spearphishing attacks.
Digital Epidemiology to Determine Undetected Malware: a set of statistical models for estimating the prevalence of malware infections.
YaraML Automatic Signature Generation Tools: a much faster, AI-based method of signature generation for the detection of malware.

According to a press release from Sophos, the company is making these assets generally available in order to “open its data science breakthroughs and make the use of AI in cybersecurity more transparent, all with the aim of better protecting organizations against all forms of cybercrime.”

Here’s a closer view of the various assets in context, and what they tell us about the growing use of AI in cybersecurity.

SOREL-20M

Sophos has worked with threat intelligence specialists, ReversingLabs on the creation of the SOREL-20M project. It consists of a production-scale dataset of 20 million Windows Portable Executable (PE) files. It also includes 10 million disarmed malware samples available for download, for the use in research and feature extraction.

The dataset should help organizations create and optimize their own machine learning (ML) based threat detection tools.

ML-based tools have the ability to “learn without being explicitly programmed”. This is especially useful in the field of malware detection, where novel, previously unseen threats are appearing all the time. Through machine learning, threat detection tools are able to algorithmically ‘reason’ and then identify the properties of previously unseen malware samples.

However, this is a heavily data-driven approach. The more sample data a solution is fed, the more effective it becomes at recognizing malicious script.

But as Hacker News points out, it is actually hard to get hold of suitable datasets in order to create ML models. This is due to the presence of protected personal information, private intellectual property and sensitive network infrastructure data. There is also the risk of putting malware into the hands of threat actors.

SOREL-20M basically provides developers with a safe, ‘oven-ready’ dataset to work with.

Impersonation Protection

As part of its proprietary email filtering solution, Sophos offers an impersonation protection capability. This is to counter the threat of spearphishing: i.e. where influential people from within an organization are impersonated (usually via use of a spoof email address) to trick recipients into taking harmful action for the benefit of the attacker.

The Sophos email tool scans incoming mail for display name variations associated with those users. Generously, Sophos has shown its workings, by basically opening up its AI-driven protection method so that other developers can develop similar tools.

Statistical modeling

As we’ve all been reminded recently, the steps you take to counter a virus depend in large part on how prevalent that virus happens to be.

Sophos has built a set of statistical models for estimating the prevalence of malware infections. According to the company, these models also happen to be effective at identifying malicious “dark matter” that is easy to miss, as well as “future matter”; i.e. malware that is still being developed. Sophos has made this methodology publicly available.

Automatic signature generation

Most anti-malware tools deploy signature detection: i.e. identifying the telltale semantics (signature) of the malicious script. However, like other types of virus, malware strains can change over time, or they may be previously unseen, which limits the effectiveness of signature detection.

One way of dealing with this limitation is through the creation of base signatures that can pick up entire classes or ‘families’ of malware. As Sophos explains however, the creation of these signatures is a “laborious, manual process”. In response, the company has developed a new method for automatic signature detection, dubbed YaraML.

This AI-driven tool directly automatically “writes” effective signatures rapidly and without the need for manual input. YaraML has now been designated open-source.

So will all this mean fewer jobs for security specialists?

The robots are highly unlikely to take over in the near future. The type of technologies that Sophos and others are developing should certainly help solutions such as antivirus, email filtering and systems scanning to become more powerful.

But those solutions still have to be chosen, configured, deployed and monitored. Staff still have to be educated on good hygiene. Backup and recovery plans still have to be drawn up and tested regularly. In other words, AI complements human expertise - but it doesn’t replace it.

If you are interested in a career in cyber security. Try VIP membership to the StationX Cyber Security Career Development Platform.
https://www.stationx.net/vip-membership

StationX AI-Driven Cyber Security Engineering Training Program

Become the one in the room everyone turns to — the expert AI can’t replace.

The StationX Master’s Program gives you a rare ability companies will pay almost anything for — then it’s yours to point wherever you want your life to go.

A senior role at the top of your pay grade. Your own consultancy. Or a business of your own. One capability, three futures — you choose, and you can change your mind.

SEE THE AI MASTER’S PROGRAM

Nathan House

Nathan House is the founder and CEO of StationX. He has over 25 years of experience in cyber security, where he has advised some of the largest companies in the world. Nathan is the author of the popular "The Complete Cyber Security Course", which has been taken by over half a million students in 195 countries. He is the winner of the AI "Cyber Security Educator of the Year 2020" award and finalist for Influencer of the year 2022.

Dray says:

May 23, 2021 at 2:21 pm

Personally I think this is a leap forward in cybersecurity, malicious script are been developed daily for attack purposes hence a counter measure must be developed to mitigate against it. Just as the saying goes, since the birds decided to fly without perching so also the hunters are shooting without missing.

Thank you Nathan for your constant update.

Cysecon says:

February 18, 2022 at 10:07 am

hi…..

I got such a good information on this topic its very interesting one. You made a good site.

Huge malware dataset gives AI-driven threat detection a major boost

What has Sophos made available?

SOREL-20M

Impersonation Protection

Statistical modeling

Automatic signature generation

So will all this mean fewer jobs for security specialists?

Become the one in the room everyone turns to — the expert AI can’t replace.

Related Articles

StationX Accelerator Pro

StationX Accelerator Premium

StationX Master's Program