Question: How does sample rate conversion work? I get asked this often, and the answer, like all things pertaining to digital audio, is both simple and complicated depending on how deeply you want to look into it. So here is a quick primer on the technical issues that underpin SRC.
First of all, what, exactly is Sample Rate Conversion? Well, digital audio works by encoding a waveform using a set of numbers. Each number represents the magnitude of the waveform at a particular instant in time, so in principle, each time we measure (or ‘sample’) the waveform we need to store two numbers. One number is the magnitude of the waveform itself and the other number is the exact point in time at which the number was measured. That’s a lot of numbers, but we can cut them in half if we can eliminate having to store all the timing numbers. Suppose we measure the waveform using a very specific regular timing pattern determined in advance? If we can do that, then we don’t have to store the timing information because we can simply use a very accurate clock to regenerate it during playback. This is how all digital audio is managed for consumer markets.
The “Sample Rate” is the rate at which we sample (or measure) the waveform. Provided we know exactly what the sample rate is, we can relatively easily reconstruct the original waveform using those stored numbers. The chosen sample rate imposes some very specific restrictions on the waveforms that we can encode in this manner. Most particularly we must observe the Shannon-Nyquist criterion. This states that the signal being sampled must contain no frequencies above one half of the sample rate. If any such frequencies are present in the signal, they must be filtered out very strictly before being sampled. Also, it is one of the simpler tenets of audio that human hearing is restricted to the frequency range below 20kHz. Based on those two things, we can derive a commonly-quoted requirement that in order to achieve high quality, digital audio must therefore have a sample rate of at least 40kHz. For those reasons, the standard sample rate which has been chosen for CD audio, and widely adopted for digital audio in general, is 44.1kHz. Interestingly, for DVD Audio and other video applications, a slightly different sample rate of 48kHz was adopted. These numbers – or rather the differences between them – end up having important consequences.
Of course, the above is not the whole story, and there are various reasons why you might want to re-sample your audio signal at sample rates other than 44.1kHz. As a result, audio recordings exist at all sorts of different sample rates, and for distribution or playback compatibility purposes you may well prefer to convert existing audio data from one sample rate to another. If you convert from a lower sample rate to a higher one, the process is called up-conversion. In the opposite case, conversion from a higher to a lower sample rate is called down-conversion. The alternative terminology of up-sampling and down-sampling can be interchangeably used (I tend to use both, according only to whim).
We’ll start with a simple case. Let’s say I have some music sampled at 44.1kHz and I want to convert it to a sample rate of 88.2kHz (which is a factor of exactly 2x the original sample rate). This is a very simple case because I can do that by taking the 44.1kHz samples, and inserting one additional sample exactly half way between each one. The process of inserting those additional samples is called interpolation. In effect, what I have to do is (i) figure out what the original analog waveform was, and then (ii) sample it at points in time located at the mid-points between each of the existing samples.
Obviously, the key point here is to recreate the original waveform, and I have already glibly stated that “we can relatively easily reconstruct the original waveform using the stored numbers
”. However, like a lot of digital audio, once you start to look closely at it you find that what is easy from a mathematical perspective, is often mightily tedious from a practical one. For example, Claude Shannon (he of the Shannon-Nyquist sampling theorem) proved that the mathematics of a perfect recreation of the analog signal involves the convolution of the sampled data with a continuous Sinc()
function – I have described this in some detail in Copper 23
. However, if you were to set about performing such a convolution, and evaluating the result at the interpolation points, you would find that it involves a truly massive amount of computation, and is not something you would want to do on any sort of routine basis. Nonetheless, convolution with a Sinc()
function does indeed give you a mathematically precise answer, and interpolations performed in this manner would in principle be as accurate as it is possible to make them.
So if a convolution is not practical, how else can we recreate the original analog signal? What we do is make a sensible guess for what the interpolated value ought to be, and pass the result through a digital brick-wall filter to filter out any errors we may have introduced via our guesswork. If we have made a good guess, then the filter will indeed filter out all of the errors. But if our guess is not so good, then the errors can contain components which fold down into our signal band and can degrade the signal. This filtering method typically has the disadvantage (if you want to think of it that way) of introducing phase errors into the signal, and has the effect that if you look closely at the resulting data stream you will see that most of the original 44.1kHz samples will also have been modified by the filter. Up-conversion in this manner is usually performed by a specialized filter which in effect combines the job of making the good guess and doing the filtering.
When up-converting by factors which are not nice round numbers (for example, when converting from 44.1kHz to 48kHz, the conversion factor is 1.088x
) the same process applies. However, it is further complicated by the fact that now you cannot rely on a significant fraction of the original samples being reusable as samples in the output. For example, if converting from 44.1kHz to 88.2kHz, every second sample in the output stream is derived from an interpolated value. The interpolated values, which contain the errors, alternate with original 44.1kHz sample values which, by definition, contain no errors. It can be seen, therefore, that the resultant error signal will be dominated by higher frequencies that were not present in the original music signal and can therefore be easily eliminated with a filter.
On the other hand, if I am converting from 44.1kHz to 48kHz, then only one in every 160 samples of the 48kHz output stream will correspond directly to original samples from the 44.1kHz data stream (you’ll have to take my word for that). In other words, 159 out of every 160 samples in the output stream will start off life as an interpolated value. Therefore the quality of this conversion is going to be entirely dependent on the accuracy of those initial interpolation guesses, or more specifically, the accuracy of the algorithm used to make those guesses (a more complicated topic that I won’t be going into here
). Again, the process of making a best guess and doing the filtering is typically combined into a specialized filter, but the principle of operation remains the same.
Down-conversion is very similar, but with an additional wrinkle. Let’s start with a very simple down-conversion from 88.2kHz to 44.1kHz. It ought to be quite straightforward – just throw away every second sample, no? No!
Here is the problem: With a 44.1kHz sample rate you cannot encode any frequencies above 22.05kHz (i.e. one-half of the 44.1kHz sample rate). On the other hand, if you have a music file sampled at 88.2kHz you must assume that it has encoded frequencies all the way up to 44.1kHz. So before you can start throwing samples away you have to first put it through a brick-wall filter to remove everything above 22.05kHz. Once you’ve done that then, yes, it is just a question of throwing away every second sample (a process usually referred to as decimation).
This additional wrinkle makes the process of down-sampling by non-integer factors rather more complicated. In fact, there are two specific complications. First, you can’t decimate by a non-integer fraction! Secondly, because you’re now interpolating a signal which may contain frequencies that would be eliminated by the brick-wall filter, you need to do the interpolation first, before you do the brick-wall filtering, and then the decimation last of all (I’m sorry if that’s not immediately obvious – you’ll just have to stop and think it through
). In summary, to get around these two issues, the process of down-sampling by a non-integer factor will usually involve (i) interpolative up
-sampling to an integer multiple of the target sample rate; (ii) applying the brick-wall filter (matched to the final desired sample rate); and finally (iii) performing decimation.
I hope you have followed enough of what I just wrote to at least enable you to understand why I always recommend sample rate conversions between members of the same “family” of sample rates. One family includes 44.1kHz, 88.2kHz, 176.4kHz, 352.8kHz, DSD64, DSD128, etc
. The other includes 48kHz, 96kHz, 192kHz and 384kHz. If you feel the need to up- or down-sample, try to stay within the same family. In other words, convert from 44.1kHz to 88.2kHz rather than 96kHz. And convert from DSD64 to 176.4kHz rather than 192kHz. But in any case, SRC does involve a substantial manipulation of the signal, and the principle that generally guides me is that if you can avoid it you ought to be better off without it.