MIT Scientists Can Eavesdrop on Silent Video

This new technique reads tiny vibrations in an object to extract audio from it.

Now here’s something straight out of a sci-fi movie: scientists from the Massachusetts Institute of Technology have joined forces with researchers from Microsoft and Adobe to develop an algorithm that can analyze silent image and extract sound from it. And they developed the hell out of it!

The scientists recorded a video of a plant and a bag of potato chips vibrating to some sounds and music streamed through a loudspeaker in the room. Note that the camera is not recording any sound, but recording only the objects vibrating to the sounds. The invisible-to-the-naked-eye vibration is then analyzed by a computer software, which runs the silent video, and sound is extracted from it.

Here’s a video from one of the researchers:

This process works better with high-speed cameras because they are able to record over 2,000 frames per second and thus gather more information. (Compare that with most DSLRs and smartphones which shoot at 60 fps. ) High-speed cameras come with a debt-creating price tag, so this technology is not really accessible to the average consumer, but it’s curious to witness what can only be described as the birth of a new technology.

This is how MIT explains the process:

That technique passes successive frames of video through a battery of image filters, which are used to measure fluctuations, such as the changing color values at boundaries, at several different orientations — say, horizontal, vertical, and diagonal — and several different scales.

The researchers developed an algorithm that combines the output of the filters to infer the motions of an object as a whole when it’s struck by sound waves. Different edges of the object may be moving in different directions, so the algorithm first aligns all the measurements so that they won’t cancel each other out. And it gives greater weight to measurements made at very distinct edges — clear boundaries between different color values.

The researchers also produced a variation on the algorithm for analyzing conventional video. The sensor of a digital camera consists of an array of photodetectors — millions of them, even in commodity devices. As it turns out, it’s less expensive to design the sensor hardware so that it reads off the measurements of one row of photodetectors at a time. Ordinarily, that’s not a problem, but with fast-moving objects, it can lead to odd visual artifacts. An object — say, the rotor of a helicopter — may actually move detectably between the reading of one row and the reading of the next.

For Davis and his colleagues, this bug is a feature. Slight distortions of the edges of objects in conventional video, though invisible to the naked eye, contain information about the objects’ high-frequency vibration. And that information is enough to yield a murky but potentially useful audio signal.

Law enforcement and forensics are just two fields that can greatly benefit from this technology. Pretty slick, huh? What do you think?

(via MIT)