The act of creation is often a collaborative effort. This holds true the most when creating something for TV or Film. But in the Times of Corona a lot of show makers have had to resort to using their homes as studios with mobile phones as their sole audio and video recorders. This means no green screens or proper sets with live audience or live mixing/post production of audio to add audience cheers, claps, reactions etc.
Let’s take the Daily Show for example. We here at Kubric are big fans of the daily show and Trevor’s act! However, ever since the Daily Show Studio was canned, and Trevor began broadcasting from home, the jokes are just not there anymore. Even in our darkest times we need comedy to help us walk through them. And production plays such a big role in how we perceive comedy and content, even when it's mostly hidden behind the scenes. The laughter tracks, the background music, the audience background are all a part of the experiencing The Daily Show. A daily show without its production is just not the same. That got us thinking - Is there a way to bring the studio, the audience to Trevor's home casts and enhance the experience of making a home-show.
A simplified DAW
The current situation underscores the need to empower content creators with the power of a Digital Audio Workstations(DAW). The problem with DAWs however is that even people with a degree in music technology (like me!) get overwhelmed with the large number of features and functionalities they provide. There is a long learning curve associated with it. They work great for the target audience they cater to, which is professional music producers, electronic music makers, film music composers etc. but for a person who just wants to put up a show on YouTube without the hassle of knowing the A-Z of audio production, they just don’t work.
Someone like that needs a simplified version with the absolutely essential ingredients. At Kubric, we are trying to develop tools to automate the content creation parts which are repetitive and boring so that creators can focus on the creative aspects of the content. In this blog post, I will list down those essential ingredients and talk about how Kubric automates them.
A very simple but extendable pipeline for audio editing would be -
- Separate out the audio track from video
- Noise Reduction
- Audio loudness leveling
- Dynamic range compression
- Adding background tracks and sound effects
- Final mix
I will show these transformations through a sample audio I recorded using my mobile phone -
Separate out the audio track
Like most other companies working with content, we also use ffmpeg a lot. And with ffmpeg, extracting audio from video is easy-peasy -
ffmpeg -i video_original.mp4 --output audio_original.mp3
Noise reduction
As the name suggests, it is the process of reducing or suppressing the noise from the audio signal. Noise could be present due to being in a noisy environment or due to some faulty equipment being used. Also, objects like fans or machinery often produce static noise while wind, TV, dog barking etc. produce dynamic noise and we need to treat them differently. We built two different types of noise filters here at Kubric -
i) Weiner filtering (Pure signal processing based approach) -
Here, we pass the noisy audio signal through the voice activity detection module(VAD) which labels the parts where a person is speaking and where there is only background sounds. Then it tries to estimate the spectrum of noise(from the non-speech region as detected by VAD). Then, the noise estimate is subtracted from the whole audio signal in the frequency domain itself. Then, take the inverse fourier transform of the frequency domain signal to get the denoised time domain signal.
Webrtc has an open source implementation of this type of noise reduction in C. It is intended to be used in the voice call streaming context so it works pretty well in real-time. We have our audio service in python and so we compiled this code for python using Swig.
Denoised -
ii) Hybrid noise reduction system (Signal processing + Neural Networks based approach) -
You can replace some of the above signal processing modules like VAD, Noise spectral estimation by neural networks. Something like this was done by RNNoise. Since noises are of many different types and can also vary over time for an audio signal, it is better to let a neural network handle these situations and leave the more predictable parts of the algorithm to signal processing.
RNNoise takes in audio as raw bytes so we wrote a python wrapper on top to use it with a variety of audio formats(wav, mp3, aac etc.). We have built an API based system to use the denoiser of choice.
Denoised -
This does not seem to be working well for this audio so I will use the weiner filtering method in the further steps of this blog.
Audio Loudness Leveling
Sometimes while listening to different audios or even within a single audio, we need to adjust our system volume to balance the variance in loudness levels. In music, this may be the desired effect of the music(dynamics of a piece of music) but in broadcast/online content, the loudness levels should more or less be the same. This is an important step in producing all kinds of content before releasing for public consumption. This step also ensures that all your content has a consistent loudness. This step also acts as a limiter which prevents your audio from clipping at loud volumes.
We have developed two types of audio leveling APIs.
First is the EBU R128 broadcast loudness standard which was introduced due to producers making their content louder than the competitors for the content to stand out. The EBU R128 standard provides an algorithm to analyze the sound intelligently and similar to how we hear it, i.e. it takes into account that we hear frequencies between 1000 – 6000 Hz as louder than the other frequency ranges.
Volume leveled -
Here, it is clear that the system has leveled the volume for the whole audio including the noise parts so it is important to reduce the noise as much as you can before this step.
Denoised and volume leveled -
In case the volume of the input audio varies significantly over time, we use a different audio normalization algorithm which divides the audio into chunks and analyzes the local loudness context and estimates the gain values. It then smoothens the audio loudness levels between separate chunks. So this more or less keeps loudness levels the same without compromising the dynamic range.
Denoised and volume leveled -
Dynamic Range Compression
This restricts the range of volume of an audio which means that the whole audio will fit within a specified quietest and loudest volume level. It works by increasing the volume of the quieter parts of the audio and simultaneously reducing or limiting the volume peaks of the loudest sounds. Its main use is keeping the loudness levels fairly constant but a word of caution: too much compression leads to a boxed sound which is not desirable. We have defined different levels of compression and provide them as presets.
Denoised and compressed -
Using all the steps till now, we have this audio(I have denoised the audio once again after the above steps) -
I want to add some background music to it now and also want to loop another audio track at the end of this one.
Adding sound effects and background music
Music and sound effect plays a huge role in making the production come alive. Sound effects like applause, cheering, chatting etc. are a staple of humor based shows. A lot of sounds are added to direct the focus on the right content. Adding the right sounds and background music helps to make the consumption of the content a much more engaging experience.
Processed track with background music (and a bonus loop section 🙂) -
Final mix
Once you have all the tracks you need for creating the final audio, you just need to place all those audios with the correct relative audio levels and correct relative positions in the mix. For this, you just need to mention the respective gain levels and the L(eft)/R(ight) position in the stereo mix. Also, you can choose to equalize the audio which means boosting some frequencies or removing them from the separate audios or the mix. Here you need to decide if you want to repeat part(s) of audio, add silence for some parts or amplify other parts of the audio. This part is called mixing.
Now, you have your track ready for a release 😃