Contents
American researchers have developed an algorithm that independently selects the soundtrack for a silent video
What’s going on
- A group of researchers from Carnegie Mellon University (Pennsylvania, USA) and the Runway company created an algorithm for voicing video: depending on the picture in the frame, the neural network independently selects the necessary sounds.
- The development was called Soundify. Its work is divided into three stages: first, the algorithm detects sound sources and classifies them – these can be specific objects or places with a characteristic background sound (road, cafe, and so on).
- The algorithm then uses the Epidemic Sound database, which contains about 90 sounds, to find the desired sound. For each scene, Soundify selects the five most likely sound effects: one of them is installed by default, but the user can enable additional ones.
- At the second stage, the algorithm sets the time intervals for the sound of each effect, depending on how long the object has been in the frame.
- At the last stage, the neural network breaks each scene into seconds and selects the necessary volume parameters to ensure realistic sound.
- Soundify is supposed to make it easier for editors to work with video without sound – primarily for drone filming, since the latter, as a rule, lack a microphone.
What does it mean
“Smart” algorithms once again prove their effectiveness in working with massive amounts of data. In this case, the neural network can greatly facilitate the painstaking and time-consuming work of selecting and editing sound in a video. It is worth noting that earlier scientists from the Massachusetts Institute of Technology and the Stanford Laboratory also tried to train the neural network to voice the video, however, the developed artificial intelligence system could only generate the sounds that are obtained by contact with the object and made mistakes when moving quickly.
Until recently, the interest of researchers in the field of neural networks in sound effects was limited to speech recognition systems – most of us are familiar with such voice assistants as Siri (Apple), Alexa (Amazon) and Alice (Yandex).
It is worth noting that earlier artificial intelligence was also trained to generate images based on a text description — in October 2021, Sber introduced the ruDALL-E neural network.