Your work phone rings. You answer to be greeted by the unmistakable voice of your CEO. She asks you to wire over some cash, so you do it right away. After all, why would you second-guess the boss?
A new threat technique may soon give you pause for thought. Welcome to the world of deepfake audio: whereby scammers use AI software to clone the voices of key figures within organizations. From here, the fraudsters can hoodwink a company’s minions into doing their bidding.
Here’s what we know so far, and what you can do to minimize the risks.
Add a header to begin generating the table of contents
Is voice cloning a real threat yet?
At this stage deepfake voice scamming is still in the category of emerging as opposed to established threats. The Wall Street Journal recently ran a story of a case involving a $200k+ scam to illustrate what we could soon be up against. Here’s what happened…
- The CEO of an unnamed UK-based energy firm received a telephone call from someone he thought was his boss; the chief of the firm’s German parent company.
- The caller told the UK CEO to transfer the equivalent of $243,000 to a Hungarian supplier within the hour.
- The fraudster behind this incident appears to have used AI-based software to mimic the German CEO’s voice over the phone. The UK CEO recognized his boss’ slight German accent and the distinct cadences in his voice.
- After the money was transferred to the bank account in Hungary, it was subsequently sent on to Mexico before being distributed to other locations.
Pindrop, a company specializing in detecting voice fraud, says that it has come across around a dozen instances of deepfake audio scans so far.
How does it work?
The technology behind this type of scam is advancing quickly.
For instance, in July last year, the BBC reported findings from the security firm, Symantec. At the time, Symantec had reported seeing three cases of deepfake audio being used to trick senior executives.
But as the article pointed out, mounting such a scam would require hours of good quality audio - things like corporate videos, media appearances and conference keynote presentations. It’s because you generally need lots of data in the form of speech extracts to train and hone a convincing voice model. This scamming technique was doable in theory, but was both expensive and time-consuming to put into practice.
Shortly after this however, news came through of an open sourced Github project that enables anyone to effectively clone a voice from as little as five seconds of sample audio. You just have to input a short voice sample and the model can deliver text-to-speech utterances right away.
The results are reportedly pretty impressive with just a single short sample. However, the more audio you can feed into the model, the better. If your CEO currently stars in a very short YouTube video, it raises the possibility of being able to create an entirely convincing scamming tool.
Voice cloning is becoming ever-more accessible, so it will come as little surprise if it becomes used a lot more frequently as a means of attack.
Tips for staying safe?
Voice cloning is basically a high-tech variant of ‘vishing’ - i.e. where fraudsters use phone calls to scam people into handing over money or revealing personal information. It’s just that in a classic vishing attempt, the fraudster usually pretends to be from another company, and relies on you having no knowledge of the voice of the person they are pretending to be.
Voice cloning, by contrast, relies on familiarity to lull you into a false sense of security: you recognise the voice, so you do what it says.
Other than telling your senior managers to take a vow of online silence, there’s not a lot you can do to prevent your organization from being targeted. That said, there are certain procedures you can follow to stop any scam attempt in its tracks.
Multi-factor authorization. Scams like these work on a certain assumption: that all it takes is a single voice message from one individual, and a particular action will be carried out. For sensitive actions such as financial transactions and the handing over of information, you should have set processes in place. These processes should be made up of more than one step, ideally involving different channels. In the case of a big transaction for instance, you should stipulate that requests over the telephone should always be backed up by an email message. It’s much less likely that a fraudster has cloned a key insider’s voice and hacked their account.
The rules apply to everyone. Voice cloning fraudsters make a further important assumption: that people will always do what the boss tells them. But of course, if the boss is following the same rules as everyone else, they won’t be getting on the phone to order a cash transfer. Instead, they’ll be following the same procedure as everyone else. An out-of-the-blue call would immediately be flagged as suspicious, no matter how convincing it sounded.
It’s a reminder that for security policies to work, you need buy-in from everyone within an organization; no matter how important!
This scam seems to be a new challenge for business organizations and can be considered as a negative side of AI.
Nice post! Need to be vigilant to avoid getting trapped.
That means preventing voice deep fakes will favour those organisations which are less autocratic, where subordinates do not automatically jump into “obedience mode” when the big boss makes demands.
Thanks for sharing! May it be used to stole bank identity? Some banks use voice to authenticate users. Sounds really bad!
Looks like we’re entering an era of “zero-trust communications”, where face, voice, and natural language deepfakes will require digitally signing everything on pre-approved channels.
Imagine having to revamp the modern telephone system to achieve this, or better yet, designing an app that makes keys for users, and makes it as easy as clicking the Phone app to get verification that you and other parties are authenticated and who they say they are.
It could be a Phone app-specific PGP-like system with no configuration by the user needed that integrates with other phone systems used in business applications. Perhaps it could also transfer a nonaudible/high-pitched preamble tone sequence for each call to transfer the key. Spitballing pretty heavily here, but there could be a market for this.