Extracting Audio from Silent Images: A Surprising Possibility

Prepare to have your mind blown! Researchers have discovered a mind-boggling way to extract audio from still images and even soundless videos. And guess what? It all started with a sci-fi TV show called Fringe. How cool is that?

In this show, the FBI manages to extract recorded sound from a melted pane of glass. Now, some might call this idea a “ridiculous pseudo-science technique,” and they wouldn’t be entirely wrong. But hold on, because Professor Kevin Fu from Northeastern University’s electrical and computer engineering and computer science department saw this review and decided to prove them wrong. He set out to show that extracting audio from images and silent videos is not only possible but also mind-blowingly awesome!

So, how does this mind-bending phenomenon occur? Well, it turns out that cameras, while primarily focused on capturing visual information, unintentionally pick up audio information too. Most camera phones have built-in image stabilization technology. The camera lens is suspended in liquid by springs, and an electromagnet moves the lens to reduce camera shake.

Now, here’s where it gets really interesting. As someone or something makes a noise near the camera lens, the springs vibrate ever so slightly, causing the light to bend. It’s so subtle that you wouldn’t notice it unless you were specifically looking for it. But wait, there’s more!

Modern phone cameras have another feature that amplifies this audio capture. Instead of scanning all pixels of an image simultaneously, they do it row by row, hundreds of thousands of times in a single photo. This means that the frequency information captured can be amplified by over a thousand times, giving us the granularity of the audio. Mind-blowing, right?

Using this incredible information, which is essentially a byproduct of how photographs are taken, researchers have developed a machine-learning algorithm named Side Eye. With Side Eye, they can extract fairly muffled audio from almost any photo that contains light. Talk about turning a photo into a hidden audio treasure trove!

The team tested their system on 10 different smartphones and achieved mind-boggling results. They were able to recognize spoken digits with 80.66 percent accuracy, identify the speakers with 91.28 percent accuracy, and even guess the gender of the speakers with a whopping 99.67 percent accuracy. Absolutely mind-blowing!

Now, before you start panicking about the potential cybersecurity nightmare this could create, the team has already thought about that. They are exploring solutions such as stronger springs, locking lenses, and randomizing how the rolling shutter captures pixels. Safety first, folks!

But here’s the really exciting part. The team is more interested in how this extracted audio could be used in legal cases. Imagine having an alibi and needing to prove your presence or absence in court. With an authenticated video and a known timestamp, this technique could be a game-changer. If your voice is heard in the video, you’re more than likely there. Talk about a powerful tool for justice!

If you’re as fascinated by this mind-blowing discovery as we are, you can check out the full study on the pre-print server arXiv. The research was also presented at the 2023 IEEE Symposium on Security and Privacy. Prepare to have your mind blown even further!