High-end audio is not rocket science … but don’t you ever wonder what would happen if a proper rocket scientist were to apply his expertise to the field? I know I do. So I looked one up, and asked him.
Professor Edgar Choueiri is Director of Princeton University’s Program in Engineering Physics, and Director of Princeton’s Electric Propulsion and Plasma Dynamics Laboratory (EPPDyL). He is tenured Full Professor in the Applied Physics Group at the Mechanical and Aerospace Engineering Department, and associated faculty at the Astrophysical Sciences Department/Program in Plasma Physics at Princeton University. He is also Director of Princeton’s 3D Audio and Applied Acoustics (3D3A) Laboratory and has been interested in audio, acoustics and classical music recording for many years. He has invented a new technique for producing tonally pure three-dimensional sound from two loudspeakers. The technique allows a listener to hear sounds located in 3D space, as they would be heard in real life.
RM. So you really are a rocket scientist?
EC. Yes I am! Actually, I run two laboratories at Princeton University, a plasma space propulsion laboratory, and also an applied acoustics laboratory where we specialize in the reproduction of three-dimensional audio sound fields.
RM. Good. So you’d be the right person to explain to me how it is that we perceive sound in three dimensions, and what it would take to be able to properly reproduce that full three-dimensionality using a high-end audio system. This something that is close to the heart of most audiophiles. Perhaps you could start by explaining how we hear in 3D.
EC. We locate the source of a sound based on differences between how an individual sound is presented to our right and left ears, and in particular we rely on three types of cues. Because our two ears are located at different points in space, there will be a delay between when a given sound arrives at one ear and when it arrives at the other. We call this the Inter-aural Time Difference (ITD) and we can detect differences as short as 10μs. This lets us determine where a sound is coming from, from the left to the right. The same thing happens when we consider the Inter-aural Level Difference (ILD) where the sound reaches the first ear, and then decays somewhat before it reaches the second ear. The brain can detect ILD differences of 1dB or less.
RM. I can see how that helps us determine if something is to the left or to the right, but how about up and down? I’m pretty sure I am able to at least sense where things are located in the vertical dimension.
EC. If a sound source is directly in front of a person, the ITD and ILD will both be zero, so the ear/brain system clearly places the source as being straight in front of us. Yet the ear/brain system can also tell whether the source is at ear level, or if it is higher or lower, and neither ITD nor ILD can account for this, so clearly there is something else going on.
It turns out that we make use of spectral cues. As the sound makes its way to your ear canal, it interacts with your facial features, and in particular the pinnae of your ears. These effects serve to apply a tonal coloration to the sound. The tonal coloration applied in this way will be different depending on where in relation to your head the sound source is. This is why our pinnae [the flappy bits of our ears!] have evolved to be asymmetric – the tops are not the same as the bottoms – which helps us to locate sounds in the vertical plane.
RM. OK, that covers width and height, but what about depth?
EC. There is one other cue that we make use of to assess the distance of a sound source and that is the ratio of direct to reflected or reverberant sound. This is because the direct sound falls off in intensity at a consistent rate, whereas the reverberant sound tends not to drop nearly as quickly. So the ratio tends to change quite markedly with distance. These are things we interpret very strongly as depth cues.
RM. Can we measure any of these effects?
EC. We can easily measure all of these things with a small microphone inserted into our ear canals. And the result we get from your ears will be quite different from mine. It is like a fingerprint – we all seem to have a unique set of ears. If we measure the impulse response inside our ears using impulses located at all points in space – from points on a sphere around your head if you like – the result of that measurement is called the Head-Related Transfer Function (HRTF). There are a lot of very promising technologies in the fields of Virtual and Augmented Reality that rely heavily on HRTFs.
Measuring a person’s HRTF is time consuming and expensive as it requires specialized equipment and software. One of the key technological challenges today is to be able to do this as quickly and as cheaply as possible. Essentially, you sit the subject in an anechoic chamber, put microphones in their ears, and surround them with, effectively, a sphere of loudspeakers. It typically takes about two hours. But once we have a person’s HRTF it remains pretty much constant, and won’t change over time unless something happens to their pinnae. We can store an HRTF in a file using a format called SOFA [Spatially Oriented Format for Acoustics], and you can carry it around with you on a USB stick. Here at Princeton, we can now measure an HRTF in about ten minutes, and if you call in on our laboratory we’d be happy to measure yours for you! We would like to obtain a large library of HRTFs, as we believe we can use that data to further reduce the process time.
RM. That’s pretty advanced stuff, then.
EC. Actually, all of the above is pretty much text-book stuff. You will find no Acoustician or Spatial Audio Scientist who will want to disagree with any of it.
RM. So where do the real challenges lie?
EC. How can we record a sound field correctly, and play it back spatially correctly so the listener perceives it as true 3D? Well, to do so we first have to capture all of those cues. And we should only need two channels, because we only have two ears. So a binaural recording, made with two microphones inside your two ears, or inside the ears of a dummy head, should be an excellent way to accomplish that, with the caveat that if we record it with your head we will be recording the spectral cues that are correct for you, but not for me. And if we record with my head, it will be correct for me but not for you. Even so, a binaural recording made on a well-designed dummy head will capture enough of these cues that it can be interpreted by most listeners as a 3D image, provided those cues are delivered correctly to the listener during playback. So your ear/brain may make some errors in where specific sounds are localized – you may perceive that violin to be located at an angle of 60 degrees off to the right instead 50 degrees – but the result will still be in 3D.
We can talk about these errors, and about the fact that it is not an absolute necessity to record binaurally, but it is definitely the best way to definitively capture all these cues. But regardless of how we capture these cues, once we have done so we have all the information we need to be able to recreate the original 3D image for the listener. All we have to do is recreate those original sound pressure waves in the ear canals of the listener. If we can do that, the listener should in principle perceive the original 3D image … so long as these cues are transmitted correctly during playback.
RM. Is that a problem? Do we usually transmit the cues incorrectly?
EC. The idea is that we should recreate the original sound pressure field as close as we can to the ear canals, and the listener should then perceive a 3D image. And to an extent that actually happens. The most obvious way to transmit these cues correctly is using headphones. But if you take an ordinary binaural recording and play it back through headphones, what you find is that only about 30% of people will perceive an external 3D sound field, and 70% will not. Why is that? It is because there is a mismatch between the HRTF of the dummy head used to make the binaural recording, and the individual listener’s HRTF, and it turns out that only about 30% of people are tolerant of such a mismatch.
But a much bigger problem is that in the real world if you rotate your head, the sound field remains stationary, whereas when you listen on headphones the entire 3D soundstage rotates with you. The ear/brain system gets badly confused by this, and as a result the perceived 3D sound field collapses completely, and it can stay collapsed even if you then hold your head still. In fact the brain tends to respond by placing the 3D sound field inside your head, and this is a well-known problem for the majority of headphone users.
RM. Yes indeed. This “inside-the-head” problem is something that many headphone enthusiasts would dearly love to be able to eliminate.
EC. One of the goals for headphone users is to be able to be able to localize a sound field outside of your head, and keep it there even as the head rotates. So-called ‘Crossfeed’ techniques attempt to address this, but they don’t do it very well, and they can’t compensate for the problems associated with rotating your head.
RM. What happens when you play a binaural recording through loudspeakers? They at least stay put when you rotate your head.
EC. With loudspeakers you tend to get a slightly diffuse sound field which is locked in the middle between the two speakers, with only a very slight extent forwards and backwards. And moreover, what you hear is totally dependent on the position of the speakers. That itself should tell you that you have a fundamental problem with playback. If stereo is correct, the positions of the speakers should actually have nothing to do with the stereo image. If a violin was recorded 10 feet away from you, and off to the left, why should the position of the speaker determine where it appears to be located during playback? It tells you that something is wrong with stereo … and actually, that is very well understood. Unfortunately, it is not very well understood within the community of high-end manufacturers!
In fact we can formulate a test that will apply if we truly have a methodology to recreate the original 3D sound field – we should be able to position the speakers wherever we want, and it shouldn’t change the sound field to any significant degree. The speaker positioning should become completely immaterial. So long as the necessary cues are transmitted to the ear/brain system – through the speakers somehow – then the listener will perceive the violin to be located exactly where it was during the recording, regardless of where the speakers are placed. Do you agree with me that this would be a good test if we had such a technique?
RM. Yes, that does make sense I suppose. Although I’ll have to think about it some more …
… which I’ll do before the next issue of Copper, when we’ll continue our conversation with Edgar Choueiri. And things will get very interesting indeed.