Quibbles and Bits

DSD – Is It PCM, Or Isn’t It?

Issue 138

A few weeks back, a prominent person in the audio community shared with me their opinion that DSD is a different thing entirely from PCM, and was surprised to hear me insist otherwise. On the face of it, of course, they are indeed very different formats. PCM maps out the actual amplitude of the original waveform, sample by sample. DSD, on the other hand, looks like a completely different animal, and represents the waveform using a pulse-density modulation scheme, from which the original signal can be extracted by passing the digital bits themselves through a low-pass analog filter. For reasons that seem somewhat fuzzy to me, this is deemed to be inherently more akin to analog. Given that I’m a guy whose turntable still sits atop his audio stack, why would I bother to take issue with that?

For most practical purposes, the whole subject is an academic argument over little more than semantics. But DSD does have some very fundamental differences in its digital nature when compared to PCM. For example, you can’t mix multiple tracks together in DSD…or blend a stereo feed into a single mono channel. You can’t do EQ. You can’t even do something as rudimentary as volume control.

The big issue with DSD, and the reason for all these limitations, is that unlike with PCM, the signal itself cannot be extracted simply by examining the bitstream. Most people are apparently comfortable with the familiar arm-waving explanations for how a signal is represented within a DSD bitstream as a pulse-density-modulation scheme, but while that is indeed a valid characterization, it does little – nothing, actually – to help you understand how the signal is actually encoded within the bitstream. For example, on what basis would you determine what resolution it has – is it better or worse than CD? How low is the noise floor – is it better than 24-bit PCM?? What is the bandwidth – does it go above 50 kHz? In any PCM format, these questions are unambiguously answered. But not in DSD. Where exactly in that 1-bit stream can I find my high-resolution audio data? I thought it might be instructive if I were to return briefly to Copper and set about exploring this issue, and hopefully frame it in terms everybody can understand…or maybe at least follow.

PCM is based on the well-known sampling theory of Claude Shannon. Provided an analog signal includes no frequency components above a certain maximum frequency, then that analog signal can be losslessly represented in its entirety by discretely sampling it using a clock with a sample rate no less than twice the maximum frequency present within the signal. The italicized qualifier “in its entirety” is important here, because Shannon’s theory informs us that the totality of the sampled data enables the entirety of the original signal to be faultlessly reconstructed, even at arbitrary points in time that lie in between consecutive sampled values.

Let me start with a simple analog signal, a sine wave. In the diagram below, a sine wave with an amplitude of ±0.6 is shown. The red dots indicate the points at which the waveform is sampled. It represents an ideal situation in which the waveform is perfectly sampled:

Diagram of a perfectly sampled waveform.

 

In the next diagram, however, I introduce a crude sampling constraint. I require the samples to be evaluated to the nearest multiple of 0.2 (corresponding to a bit depth of approximately 4 bits), chosen for clarity’s sake so we can clearly observe the consequences. Here we can see that this results in a type of error known as a quantization error. In effect, the sampled signal is the sum of two signals, the original intact waveform (in blue), and the sampled waveform corresponding to the quantization error (in black).

Sampled waveform with quantization error.

 

This is PCM in its simplest form. It is an accurate representation of the original waveform, only to the extent that the waveform corresponding to the quantization error is inaudible. If it is in fact inaudible, then you will have succeeded in recreating the original waveform.

It must be noted, however, that the question of the audibility of the quantization error waveform is not as straightforward as you might imagine. At first glance it looks like random noise, and indeed if it were random noise we would be completely justified in treating it as inaudible, provided that it was at a sufficiently low level. But if components of the quantization error signal were correlated with the original waveform, their audibility can be quite significant, even at surprisingly low levels. Correlated signals fall into the broad category of “distortion,” which is a well-known and complicated topic. Some forms of distortion are known to be less pleasant on the ear, and significantly less tolerable than others. So it would be good if we could eliminate from the quantization error any remnants of correlated signals, and leave behind only pure random noise. It turns out that we can easily do exactly that, using a process called dithering, which I have covered previously, way back in Copper Issue 6.

That, in summary, is pure PCM. We can use it to accurately encode a mixture of the original analog signal, plus a sprinkling of added noise. And by using 24-bit encoding, that sprinkling can be waaaaay down below the residual noise floor of any analog signal that you might have available as your source.

Back to our little diagram, then. We can interpret the exact same picture in a slightly different way. We could interpret it as saying that by adding a very particular noise waveform to our original waveform, we made the amplitude of the resulting waveform correspond exactly to the nearest available quantization level at every single sampling point. In other words, we can argue that we effectively eliminated all quantization errors entirely, by the simple act of mixing in a special noise waveform of our own making.

At this point, I would really like you to go back and read the previous paragraph again. Keep reading it until you absolutely get it. It is key to what comes next.

OK, let’s move on. We’ve just established that an ideal PCM signal comprises the original audio signal, plus some added noise. We’re talking about the entirety of the original audio signal, in its pure unadulterated form. Not a 16-bit version of it, not a 24-bit version of it, but the entire, clean, original waveform in all its glory. The 16-bit (or 24-bit) version is what we get AFTER we’ve added in the noise, and stored the result in a 16-bit (or 24-bit) word.

We would find ourselves living in a perfect world if we could then just extract that added noise, because if we could, we would be left with only the pure, original, audio waveform. Unfortunately, that’s just not possible. The problem is that the noise, as described, occupies the exact same frequency space as the signal. Such noise cannot be extracted from a signal – that is the very nature of noise itself. The only practical way of separating out noise is to filter it out, and it doesn’t help all that much if we end up filtering out parts of the signal at the same time.

 

 

Let’s return to our little diagram, and this time take the same concept to its extreme. In the version below we have allowed only two quantization levels, at +1 and -1, and consequently the noise signal itself has become huge. The diagram, unfortunately, may be difficult to interpret, and will require some concentration on your part. The blue line and the red dots are the original waveform as before, and the black lines are the added noise samples. The green dots represent the sum of the waveform plus the added noise. The green dots, as you can see, occupy only the positions +1 and -1.

Extreme example of sampling with only two quantization levels.

 

You can clearly see that the added noise is of a much higher magnitude than was the case previously. In fact, the noise itself has peaks that are comfortably higher than ±1. [Note: For what it’s worth, although this noise signal does in fact encode a noise waveform, the noise waveform in itself is completely irrelevant – it is only the specific values at the sampling points that have any relevance at all.] Now, this serves to prompt a big question – “So what?” Because, if the noise is truly massive, and we can’t separate it from the signal, have we accomplished anything?

But first, we need to make a quick detour on the subject of implied precision. If I were to propose that you phone me at one o’clock, in your mind there will be some approximation as to how precise that appointment is meant to be. You might, for example, assume that five minutes before one, or five minutes after one, would be acceptable. But if I instructed you to phone me at 1:13:22 – that is, 13 minutes and 22 seconds after one pm, you may get a different impression of just how precisely I wished for our appointment to be. In other words, the precision with which a number is stipulated is often presumed to convey something about the precision of the quantity which the number represents. Therefore, the number “1” is taken to mean, “roughly, approximately, 1.” Whereas, in comparison, the number “1.000000” would be taken to mean exactly 1, with a precision of at least 6 decimal places. In the following paragraphs I have therefore chosen to write the numbers +1 and –1 as +1.00000[…] and –1.00000[…] respectively, to indicate that I am stipulating numbers with an extreme precision of many decimal places, that “+1” and “–1” might not otherwise convey.

Back to our “So what?” question. It turns out that we do in fact have a route forward. Since the addition of the signal plus the noise always comes to either +1.00000[…] or –1.00000[…] at the sampling points, all we need to represent the result of the addition is a 1-bit number. A +1.00000[…] result is represented using “1” and a –1.00000[…] result is represented using “0.” Compared to 24-bit PCM, we would only need 1/24th the amount of file space to store the data. Perhaps that could open the door to a practical solution. If, for example, we were to increase the sampling rate by a factor of 24, we’d end up with a file of the same size as the original, and maybe we could take advantage of that in some way. For example, by way of illustration, let’s consider a 24-bit 88.2kHz high-resolution PCM file. If we were to reduce the bit depth to 1-bit by adding some of that “magic noise,” and increase the sample rate by a factor of 24X to 2.114 MHz, we will end up with a 1-bit file the same size as the original 24-bit file. Can we do something useful with that?

The thing about a PCM file with a 2.114 MHz sample rate is that, according to the Nyquist criterion, it can encode signals with a bandwidth of up to 1.057 MHz. But we only need 20 kHz (let’s be all high-end-y and call it 30 kHz) to encode all of the audio frequencies. That means there is a region from 30 kHz all the way to 1.057 MHz that contains no useful audio frequencies at all. Let me write that another way at the risk of belaboring the point. The region of 0.030–1.057 MHz contains no audio data. All the audio data lives below 0.030 MHz. That’s a tiny corner of the overall addressable frequency space.

Time to go back to the last diagram and take a closer look at it. All those black lines represent added noise. Each line can take on one of two possible values. Either it can be a positive number, the exact amount needed to raise the amplitude to +1.00000[…], or it can be a negative number, the exact amount needed to reduce the amplitude to –1.00000[…]. The important thing to note is that, conceptually at least, it really doesn’t matter much to us either way.

Given that every second of music will require 2.114 million of these noise samples, and that each and every one of them can assume one of two possible values, there are an almost uncountable number of permutations available as to how we could arrange those noise samples across the entirety of an audio file. The question is, can some of those permutations possibly turn out to be useful?

Recall that the only way to separate noise from a signal is to filter it out. Suppose we were to arrange the noise so that it only occupied those frequencies above 0.03 MHz. If we could do that, then all we’d need to do is pass the signal through a low-pass filter with a cutoff of 30 kHz. All the noise would then be stripped off, and we’d be left with the original unadulterated audio signal. This would be a very workable solution indeed … if we could figure out a way to actually do it.

That way is called sigma-delta modulation, (or, delta-sigma modulation), but this is not the place for me to describe how it works. You’ll just have to take my word for it that it does. A sigma-delta modulator does just what we want here. It creates the exact kind of noise signal that we are looking for through a process called noise shaping. It creates a “magic” noise signal that raises or depresses every sample value to +1.00000[…] or –1.00000[…], while keeping virtually the entire spectrum of the noise carefully above the audio bandwidth, where it can be easily filtered out.

There are theories that analyze and quantify just how much of this noise it is possible to push up into the unused high frequencies. The basic noise shaping theory was developed by Gerzon and Craven, and we don’t need to go into it here, but it essentially quantifies the “no free lunch” aspects of the process, and provides fundamental limits beyond which the magic cannot be pushed. I don’t have space to dig into it here, but let’s be clear, those limits do not prevent us from achieving the kind of audiophile-grade performance we desire. Far from it.

The main finding from Gerzon and Craven is that we need to push the sample rate out to quite high levels. It turns out that our simple solution of pushing it out by a factor of 24 in order to keep the same file size is not really far enough. Actual DSD (Direct Stream Digital) runs at a sample rate of 64 times 44.1kHz (which comes to 2.82 MHz), and in practice that represents the lowest sample rate at which adequate performance can be achieved. Today, it is commonplace to refer to that as DSD64. By doubling the sample rate to 128 times 44.1 kHz we get DSD128, and that arguably represents the sweet spot for the technology. But DSD256 is now being used in serious high-end studio applications. There are even people who are playing around with DSD512. At these colossal sample rates, at least from a theoretical perspective, any additional benefits start to become increasingly marginal, and the file sizes needed to store them can start to become pretty unwieldy.

So there we are. DSD, at its core, is simply a way of representing PCM audio data. The actual audio signal itself is a PCM-encoded audio signal that has been subsumed within an avalanche of high-frequency noise, such that the only way of extracting it is to filter out that noise. The achievable resolution is determined by just how much of the “magic noise” signal still resides within the audio band, and current SDM technology places that at about –120 dB or thereabouts, corresponding to ~20-bits of PCM resolution. The achievable bandwidth is determined by the sample rate, and DSD64 can achieve something close to 30 kHz (depending on how you choose to define it). That the ‘magic noise’ can be filtered out as effectively in the analog domain as in the digital domain has opened the door to a class of DAC designs which today totally dominate the DAC market.

There are those who insist that DSD sounds better than PCM, but that’s not an argument that makes any sense to me. DSD and PCM don’t have “sounds” per se. The things that DO have “sounds” are the processes that encode a signal into PCM and/or DSD, and that decode them back into analog, because these processes are generally not lossless, and that’s a separate discussion entirely. By contrast, BY FAR the biggest contributors to sound quality are the original recording engineers, and the lengths they are willing to go to in order to achieve the best sound possible. For these people, recording in DSD forces them to abandon all of the über-convenient post-processing digital chicanery available at the click of a ProTools mouse. Maybe – just maybe – that’s the most important observation of all.

Header image courtesy of Wikimedia Commons/Pawel Zdziarski, cropped to fit format.

9 comments on “DSD – Is It PCM, Or Isn’t It?”

  1. A fine explanation à la bonheure! And I fully agree with the final conclusion concerning the quality of the recording and mixing/mastering. The best sound quality I ever encountered is found on this direct to disc copper (!) record: https://sfrshop.de/epages/8bc55054-7644-4582-abe9-aacc9769e6d3.sf/en_GB/?ViewObjectPath=%2FShops%2F8bc55054-7644-4582-abe9-aacc9769e6d3%2FProducts%2FSFR357.0001.1 . My impression is that the quality is even better than from noisy master tapes. However I liked to know which high (!) frequency content is found in sounds from percussion instruments (triangle) or plucked string instruments (acoustic guitar, cembalo) which characterizes the audible timbre of these instruments and isn’t catched by 16/44.1 RBCD limited to 22 kHz. And why some audio designer claim that a class D amp is the best solution for a subwoofer but never will have the finesse of a class A single ended vacuum tube amp best suited for tweeters and even plasma tweeters???

    1. Lots of questions there – can’t answer them all! One of the things about PCM is that all the arguments in its favour talk only about its theoretical underpinnings, as indeed I did above. And those underpinnings are totally solid. But the practical sonic challenge lies in taking an analog signal and extracting those sample values with the degree of precision required by whatever PCM format is being used. That is not only impractical, but is strictly speaking theoretically impossible to accomplish. All we can do in practice is to make as good a stab at it as technology (and budget) allows. So when you observe limitations in the sound of RBCD the question is, how much of what you hear is due to the fundamental limitations of the format itself, and how much is down to the process actually employed to encode the signal into that format? Actually, a germ of an idea is forming about how I might be able to practically illustrate that …. 🙂

      The subwoofer question is an easier one to answer. Purely for the sake of the argument, Class D amplifiers can be viewed as a cruder and clumsier variant of DSD. It is a heavily noise-shaped, high-sample rate, digital system. But where DSD works at 2.8MHz (for DSD64), typical Class D amplifiers work in the high 100’s of kHz. If they were to operate at DSD frequencies, they’d need radio broadcast licenses! Those 100’s of kHz are about ~30X the audio bandwidth, so the noise shaping challenge is a tough one. But it is about ~10,000X the bandwidth of a subwoofer, which is *tons* of noise shaping headroom, so truly exceptional performance can be achieved.

      1. Good morning Richard!
        You said something about, “there are some people that are playing around with 512 DSD.”
        I have two media players on my computer that can play DSD audio files.
        They are Foobar2000, and Resonic Beta.
        Only one of them can play 512 KHZ DSD files.
        Foobar2000 is the one that can do that, without any problem.
        Resonic Beta, crashes every time I try to put that kind of a DSD file threw it.
        Can you give me a reasonable explanation of why that is happening?
        Thanks!

        1. Sorry, but I use neither Windows nor Resonic Beta, so I’m, afraid I don’t have anything constructive to offer you. DSD512 playback is fundamentally challenging on a number of levels because of the huge bandwidth required. You need a sustained on-demand transmission rate between computer and DAC of 45MB/s for a 2-channel track in “native” DSD format (i.e. over ASIO), rising to over 67MB/s if you are using DoP encoding.

  2. Indeed, a lovely exposition. Many thanks!

    As to whether DSD sounds better than PCM, I think there’d be a simple explanation were all PCM DACs to be R2R ladders. For them, you need to be sure that every R is (relatively) correct to less than the value of the smallest bit. And that the power supply is stable to the same degree of accuracy.

    Running a DSD bitstream through an RC – easy. You get rid of half the problems at one swell foop. Just the volts need to be accurate.

    1. I wrote about R-2R ladders in Copper 11. There are some fundamental theoretical problems with them. Having said that, some of the best ones do sound quite magical. I had a Light Harmonic Da Vinci for a while, and it was truly exceptional to listen to, although I have been told that it did not measure well.

      As to running a DSD bitstream though an RC filter, while you’re quite correct, exceptional performance demands not only extremely low noise (what you mean when you write ‘stable’) on the voltage rails, but also extremely low phase noise (what we used to call ‘jitter’) on the clock. The latter is much more challenging (read ‘expensive’) than the former.

  3. I love both 24/192 and DSD, but what I like about 24/192 is that I can mix and master with in in my affordable Sony Sound Forge. I can also burn them to DATA discs as ISO+Joliet and share them. I could by a Tascam DA-3000, but then what?

    I have enjoyed the recordings I have made with my Tascam DR-680MK2 recorder, 2 tracks of 24/192 or 6 tracks of 2496. Hard to believe what can be done with such a small recorder.

    What I can say is that in 24/192 my ultra-sonic noise floor is below -90db and often at -100db. I have seen “professional” results with much higher noise floors and also adjacent channel spikes that are not time aligned and no one seems to know what those are. I am glad I don’t have them to worry about.

    I do believe that DSD is slightly smoother, but have also noticed a difference in playback quality with DSD with the Free Tascam player better than JRiver Media Center 27.

    Before I would invest in the Tascam DA-3000, Sony would have to allow us to work with it in Sound Forge, now owned my Magix, but I don’t think that is possible or going to happen. So I can live with 24/192 for now and it allows me to mix and master with what I own and it sounds great to me. I would rather spend more money on new microphones anyway.

    I also enjoy the challenge to recording and mixing in stereo 24/192 as it makes me work harder in mic placement and set up, and then just get the overall levels right. That is kind of like the old days. At nearly 74, I don’t mind it. Most of my work was on-location anyway and I hope now that the pandemic is over I can get back to it.

Leave a Reply

Also From This Issue

Phone-ing It In

And he continues to animate the party…the Continental 401 tape…

The Not-So-Great Wall

In the year 2002, my penchant for off-the-beaten-path travel was…

In a Latin Jazz Bag: The Recordings of Cal Tjader, Part One

Cal Tjader is a musician I have followed most of…

Classic Yes in the Digital Age

  My love for the band Yes goes all the…
Subscribe to Copper Magazine and never miss an issue.
Stop by for a tour:
Mon-Fri, 8:30am-5pm MST

4865 Sterling Dr.
Boulder, CO 80301
1-800-PSAUDIO

Join the hi-fi family

Stop by for a tour:
4865 Sterling Dr.
Boulder, CO 80301

Join the hi-fi family

linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram