Resemble AI Creates Synthetic Audio Watermark to Tag Deepfake Speech

Synthetic speech and voice cloning startup Resemble AI has introduced an “audio watermark” to tag AI-generated speech without compromising sound quality. The new PerTh Perceptual Threshold) Watermarker embeds the sonic signature of Resemble’s synthetic media engine into a recording to mark its AI origin regardless of future audio manipulation, yet subtle enough that no human can hear it.

Audio Watermarking

Visual watermarking hides one image within another, invisible without a computer scanner in the case of particularly high-security documents. The same principle applies to audio watermarks, except it’s a very soft sound that people won’t notice but encoded with information that a computer could decipher. The concept isn’t new, but Resemble has leveraged its audio AI to make PerTh more reliable without compromising the realism of its synthetic speech creation.

Quiet sounds can be obliterated easily in most cases, but Resemble figured out a way to hide its identification tones within the sounds of speech. As people talking is the point of Resemble’s services, the audio watermark is much more likely to come through an edit unscathed. Resemble takes advantage of how humans tend to focus on specific frequencies and how louder sounds can hide quieter noises that are close in frequency. The combination masks and protects the watermark sound from humans noticing or being able to extract the audio watermark. Resemble’s machine learning model can determine where to embed the quiet sonic tag, generate the appropriate sound, and put it in place. The diagram below illustrates how the watermark hides in plain sight, or sound in this case.

Blog