What is immersive audio and how does it work?

The hottest topic in audio right now is immersive and spatial audio. Just about every audio trade mag and AES conference has highlighted presentations with the word “immersive” all over it. Spatial and immersive categories featured on streaming services like Apple Music, Tidal, Netflix, and Amazon HD have captivated listeners’ fascination with the promise of a novel listening experience for old favorite songs. Luxury automotive companies now offer audio upgrades with immersive playback systems that are giving audiophiles goosebumps. Consumer delivery services are excited about this, and content creators are too.

You might expect such a new and shiny thing like immersive to be incredibly complicated. But it really isn’t as difficult to grasp as most imagine. To be sure, there are new aspects to working in immersive that take a minute to digest; but successful engineers are always ready to learn new technology and adapt to fast changes in the market. Engineers who embrace immersive audio early will be the first to reap the rewards.

Why is now a good time to take the leap into immersive? Large labels like Capitol have already begun a remix of their entire stereo back catalogue into immersive formats—financial evidence that labels take immersive seriously. Many popular artists have already released immersive mixes, giving their fans a new way to enjoy their art. There are technical standards in place to help you build an immersive system with accuracy, but the creative rules haven’t yet been written. This early stage of evolution affords an opportune moment for creators, producers, and engineers to learn about immersive audio.

In this introduction, engineers who are familiar with how to mix in stereo and are curious about the world of immersive audio will gain a better understanding of what immersive and spatial audio are, and how to conceptualize the workflow. Readers are encouraged to think of this as a prep/pep-talk for a larger conversation about how to integrate immersive and spatial audio into a professional practice. It’s not as complicated as it seems, and it’s a lot of fun!

What is immersive audio?

The goal of an immersive system is to create an auditory experience that envelops the listener from every angle in three dimensions. This may be to recreate the experience of an event such as a basketball game, or an orchestra concert. It may be used in a movie to draw the viewer into a scene with ambient cues from above and behind. It can be used to imagine new sonic landscapes where instruments seem to defy the laws of physics, or present fantastic reimaginations of music that was made a long time ago. To envelop the listener, whether it be with precise realism or whimsical fantasy, one must first understand how the listener creates their own perception of the world around them.

Our hearing apparatus is already immersive. With our eyes closed we are still able to understand sound cues that arrive from any angle around our heads. We take advantage of this ability when we create immersive listening environments. Sounds that come out of speakers behind and above us are convincing because they are actually behind or above us. There is no special computer processing needed for our brains to be able to decipher directionality.

The reason we are able to perform this trick is because our ears and our brain work together as a team. The way our brains use ears to hear is similar to how we use eyes to see. Our vision is binocular, which means we have one dominant eye and a second eye that is compared against the first. This comparative math is what we use to generate our sense of depth. Want a quick demonstration? Try catching a tossed small object with one hand from a friend about six feet away. Not too hard. Now, repeat the experiment, but cover one of your eyes. With one eye closed, it is harder to discern distance. You will likely notice that it’s difficult to gauge where the ball is in space as it comes towards you. We’ve modified your binocular vision into monocular vision. Fun!

The brain/ear team works in a similar fashion. When sound arrives at our two ears (left and right), the brain calculates differences in the time and amplitude of both signals to understand where the source of that sound is located. A sound directly in front of you will arrive at both ears at the same time and amplitude—the 12 o’clock position. As the sound moves to the left or right, sound will arrive at one ear slightly sooner and louder. The brain does quick comparative math to understand the exact difference in time and amplitude and can translate that data into an approximation of the location of the sound’s source without the help of the eyes. This works for vertical sounds as well. We call this system of hearing “binaural” (bi-two, aural-ear). Our binaural capabilities are what allow us to perceive vertical and lateral sound cues in the three dimensional world around us. It’s a pretty cool trick, and it was critical to our survival as a species when we hadn’t yet reached the pinnacle of the food chain. Immersive audio and spatial audio were devised to take advantage of the natural way we receive and process sound.

Understanding how we listen prepares us for understanding and engaging with the art and science of immersive audio. This review of the basics is enough to get the conversation started, but it will be beneficial for the reader to continue their own research on this topic. A search of the Journal of the Audio Engineering Society will reveal a wealth of peer-reviewed resources on the subject, and there are tons of videos and articles out there on the web that cover more advanced concepts relating to our aural cognitive perception. For now, we have reviewed what is necessary to define immersive audio, and understand its basic principles of operation.

So, what is “Immersive” audio, and how does it work?

Sometimes referred to as “ambisonics”, immersive audio is a 3-D sound field created by a combination of lateral and overhead speakers. All immersive audio systems are by definition multi-channel, and must include speakers from above (height channels). You can visualize this as speakers pointing at you from several directions around your head at ear-level, with additional speakers pointing at you from above your ears. In immersive, all speakers are focused directly at one fixed-point called the “listening position”. The exact placement of these speakers is directed by the immersive format you choose. Popular formats include Dolby ATMOS, Sony 360, and Auro 3D. Each format has strict standards for where speakers should be placed in a room for optimized listening and content creation.

All immersive formats use dedicated software to emulate the 3D environment. The software consists of a virtual 3-dimensional box that represents the walls, ceiling and floor of the room you listen in. When the software is installed, the user inputs how many speakers are connected to the system, and their relative position and orientation in the room. Once calibrated, the software becomes a bridge between the virtual and physical spaces. The mix engineer can use the software to navigate and place sound elements in the virtual space. The software then figures out what combination of speakers in the room will best recreate that position in the physical listening environment.

The software is best thought of as a “renderer” that takes raw data like the position of a sound in the 3-D virtual environment, and renders it into a sonic experience for the mix engineer. The renderer also generates the master file which contains all of the audio and decoding information necessary for a consumer to play the immersive mix the way the mix engineer intended.

This concept is quite different from a traditional stereo workflow. When performing a mix in stereo, sounds are placed between two speakers; “left channel” and “right channel”. A pan knob is used to send the signal to the left, the right, or somewhere in between. When assigning the location of a sound in immersive software, the user operates controls that represent the lateral (x) position, the depth position (y), and the vertical position (z) of the sound. The software then computes the most accurate speaker, or combination of speakers necessary to recreate the sound in the room for the mix engineer. The placement and calibration process when the system is set up is critical because it ensures an intuitive relationship between the cubic digital space represented in the software and the physical space of the room.

Once the position and relative levels of all signals are set, the software captures the audio into a master file, and embeds positioning information as metadata. This master file is what is sent to the label, and ultimately distributed to consumers.

There are a number of ways that consumers can experience an immersive audio file. This requires a receiver equipped with a decoder that matches the immersive format being streamed. When a consumer plays an immersive file, a re-rendering software analyzes the metadata in the master file and manages the playback experience. The consumer’s receiver is aware of the total number of speakers connected to the system, and their relative placement. (More about speaker placements in a subsequent article.) The re-renderer then considers the speakers in the physical space and routes the sound to the closest possible position to the original coordinates. For example, if the playback system is not equipped with rear height channels, audio that had been placed there would be re-routed to the closest set of lateral speakers. If there are more height channels available, the placement of the sound will be preserved, but the division of speakers used to create that experience may change. In this way, the re-renderer is flexible enough to generate an adaptive listening experience in a variety of playback speaker configurations.

The ability of the renderer to adapt to its host environment is remarkable. Smart speakers and sound bars equipped for immersive playback can recreate a version of an immersive mix without the operator having any knowledge of speaker placement or calibration. Consumer devices such as Amazon’s Echo Studio are a good example. The Echo Studio supports Dolby ATMOS and automatically calibrates an immersive playback experience using Alexa. Don’t let the cannister shape fool you, it actually sports four speakers inside that bounce sound off the ceiling towards the listener, simulating the presence of height channels. This is pretty amazing for a product that costs only $200. If you know someone who has one, you should ask to take it for a spin!

We now know a lot more about what immersive audio is and the basics of the workflow involved for multichannel systems. But how does this work in headphones? Is there such a thing as immersive headphones?

The answer is, no, not yet. But you can experience a version of immersive audio on regular headphones in something called “spatial audio”. Spatial audio is the way most consumers are connecting with immersive content right now on streaming platforms like Apple Music, Tidal, and Amazon HD. Though closely related, spatial audio is not the same as immersive audio. We can understand the simple differences between these two related, but different, formats without getting too technical (for now).

So what is “spatial” audio?

The reader will recall that in an immersive environment, the listening position is fixed, and the listener will get binaural cues about where things are from the speakers around them. Spatial audio can be described as a simulated binaural picture of a multi-channel, immersive listening experience. More simply—spatialized audio attempts to deliver in two channels what the ears would be hearing if they were sitting in the fixed listening position of an immersive audio system. Headphones or earbuds are ideal for generating this experience because the binaural signal simulates the sound at the moment it arrives at your ears.

Streaming software that offers spatialized content will generate the listening experience one of two ways. For some platforms, like Tidal which offers Dolby ATMOS, there is no specific “spatial” category. Any album offered in ATMOS will stream as an immersive multichannel file if connected to an immersive renderer. If streaming to headphones, the platform will use the Dolby ATMOS Renderer algorithm to generate a binaural (stereo) experience of the immersive mix.

Is spatial audio as good as immersive audio? The answer is no, and there are many reasons why. Although both binaural and spatial audio are generated in software that approximates our listening experience, spatial audio still falls short in accurately recreating the nuance of an immersive, multi-channel listening experience. In addition, another challenging issue for spatial audio is that not all streaming services provide the same stereo interpretation of an immersive file. Consequently, a consumer will have a different experience of the same immersive mix when they switch platforms. As mentioned earlier, Tidal uses the proprietary Dolby ATMOS renderer to generate a binaural experience. Other platforms, like Apple Music process the ATMOS master file with their own proprietary software that generates a different binaural stream. Apple has not yet shared how their special software works, nor is it integrated into the ATMOS renderer. This leaves engineers guessing as to how their immersive mixes will sound to those who choose to listen on Apple Music. The Apple spatialization process doesn’t just color the mix, it can sometimes dampen, misplace, or eliminate sounds in the immersive sound field. In fairness, the anomalies caused by the Apple algorithm can just as easily have a positive effect on a mix. To reiterate the problem, spatialized audio is inconsistent, and this is bad for artists, content creators, and consumers.

Another problem with binaural and spatial renders is the shape of our ears. Every one of us has a different outer ear, or pinnae, and this influences how sounds are collected into our brains. Because our brains get their own unique “picture”, everyone has their own specially individualized processing code. Out of necessity, software that generates binaural and spatial audio assumes a generic physical shape of the human ear. This is problematic. For one, we will all interpret this generic model differently, even when listening on the same platform. Some people may hear a sound placed in the rear clearly, where others might not hear that sound at all. There’s a diversity, equity, and inclusion issue that comes with this too. The generic model used to represent the ear is built upon a Euro-Centric (Caucasian) physical model. This puts any person who isn’t white at a significant disadvantage.

A logical solution would be to scan our own ears and insert our own custom models into the re-rendering software. There are a few companies who offer this solution to consumers. Sony offers a free app that maps the physical features of your ears and develops a custom profile that you can upload to certain models of their headphones. Their goal is to enhance the experience of listening to music in their Sony 360 platform. Hopefully, this trend will continue and more consumer brands will offer a similar option. It is certainly a possibility now that new models of the iPhone come equipped with LiDAR sensors that generate 3D scans of objects.

The bottom line is that if you have tried listening to spatial mixes before and have felt underwhelmed, you are not alone. There are good reasons why your experience with spatial audio may not have lived up to the hype. Despite its limitations, consumers are eating spatial up. Apple recently released a report stating that spatial audio is the fastest growing area of Apple Music’s streaming service. Spatial may be inferior to the immersive experience, but the fact that consumers like it (without having a clue as to how it works) is an encouraging sign.

Conclusion:

Immersive is an exciting and fresh way to experience sound. It may seem complicated, but human evolution has already achieved the miraculous part. Thanks to our binaural perception, we enjoy a rich world full of engaging soundscapes. When you work in immersive audio, you can recreate these events in life-like detail, generate new experiences that spark the imagination, and bring new opportunities for artists and content creators. It’s a very rewarding format. Learning a new workflow is fun, and now is a great time to get in on the ground-floor of this nascent technology. As you mull all of this over, I encourage those who haven’t done so to seek out a facility equipped with an immersive playback system and ask to take a test drive. Dive into spatial audio on your favorite streaming app and find out what’s out there.

“Believe what you see with your eyes, trust what you hear with your ears, know what you feel with your flesh” — Brian Staveley