Article List

How Digital Media Works, Part 1: Digital Audio


Nowadays, we can barely imagine using a computer without having access to audio and images, including moving ones in the form of video. But just how do these things work? What is contained in these MP3s, PNGs and MP4s? This is what this series of articles will try to explain. It is aimed at the layperson, an end user who’s interested in what’s going on under the hood. If you’re experienced in the topic, for example being an audio engineer, you probably won’t gain anything from reading this. To begin with, let’s discuss digital audio.

Let’s begin with a primer on what sound is, physically. Simply put, it is the vibration of matter (most often air). Vibration at a small scale can, at a larger scale, be seen as density fluctuations – in other words, a sound wave is a volume of air where certain parts have a lower density of matter than other parts. Taking inspiration from school, consider a sine wave. It has a certain frequency (how far apart the peaks are) and a certain amplitude (how high the peaks are). This has a direct representation in matter, where frequency implies how quickly the density changes, and amplitude implies by how much it changes.


A simple sine wave and the corresponding density of air molecules in the same volume of space.


Notice: for a wave to be a wave, it has to be moving. Sound moves through air at 340 meters per second. With a given velocity and a given distance between the peaks (areas of high density), you can calculate the time it takes for the peaks to move to the neighbouring position, hence why distance represents frequency.

To be able to save this analogue in a digital form (and play it back), one must be able to convert air pressure into an electrical signal, which is what a microphone does. An example of such conversion would be a coil with a charge running through it, connected to a membrane, and a conductor inside the coil. When pressure waves hit the membrane, they cause it and the attached coil to vibrate. The change in the magnetic field generated by the coil charge would, through the process of induction, generate a charge inside the conductor, which would more or less be a direct representation of the initial pressure waves. The reverse of this process can be used to play back sound.

Now, a distinction has to be made between digital and analogue signals. An analogue signal is something continuous. Think of a solid chunk of stone salt; you can chop some away and take a closer look, “zoom in” as it were, but it’s still solid, it’s still salt through and through (at some point you’ll reach the individual molecule level, but that’s beyond the scale we care about). A digital equivalent, then, would be a heap of grains of salt. If you look from far away, yeah, it’s still salt, but if you zoom in far enough, you’ll see the gaps between the individual pieces. A more abstract description can be done with mathematics, where an analogue signal would be called continuous, and digital – discrete. If you have a continuous line, no matter how far you zoom in, it’s still well defined at any scale. A discrete line, on the other hand, consists of points that, put together, resemble a line, but you cannot zoom in indefinitely, because eventually you’ll see the gaps between the points.

The electrical signal generated by a microphone is still analogue, but computers can only handle discrete values, so the analogue signal has to be turned into a digital signal. That’s what a little piece of hardware, called an analogue-to-digital converter, or an ADC, does. You can think of a grid, where the line intersections are the possible point positions, and you mark the closest ones to the original signal. You cannot represent an analogue signal perfectly, as that would require infinitely many points, but with enough points, with a dense enough grid, you can get a pretty good approximation. When you play back audio, the points you’ve stored go through a digital-to-analogue converter, DAC, which has to take the points and try to recreate the original wave based on them, so obviously the more points we have, the better the result.



Both images showcase all the steps audio goes through, but you can hopefully see how the higher resolution in the lower one produces a more accurate signal.


Each point has a given horizontal position, roughly denoting its frequency, and a vertical position, corresponding to amplitude. To indicate how many horizontal positions there are (that’s the X-axis, or time axis), computers use sample rate, while vertical positions (the Y-axis) are covered by bit depth. The vast majority of audio you will hear is represented by a sample rate of 44.1 KHz and a bit depth of 16 bits. This means that, in one second of recording, 44,100 points are recorded (sampled), and there are 2^16 = 65,536 levels that they can be on. This is the standard case for CDs, but DVDs (for example concert recordings) will typically use a sample rate of 48 KHz. Fancy, ultra-high quality recordings might come with a bit depth of 24, and a sample rate of 48 KHz and sometimes even 96 KHz. In professional settings such as studio recordings, where the smallest details need to be accounted for, the bit depth reaches 32 bits, and I’ve seen sample rates as high as 192 KHz. This means that, for each second, 192,000 points can be on 2^32 ~ 4,3 billion different levels.


The points that make up an actual audio file.


At this point, I would encourage you to pick a song and play around with these values. You can do so for free using a program called Audacity. If you zoom in far enough, you can see the individual points. Changing the sample rate is as simple as choosing a different value in the bottom-left corner, while choosing Export Audio… → Other Uncompressed Formats and selecting the right encoding will let you hear your song at 8 bits instead of the usual 16. Keep in mind that increasing these values above what you put in won’t make the audio any clearer, as you cannot recover data that’s not there. If you choose to play around, you’ll notice that lower sample rates make the overall sound much more muddy, while a bit depth of 8 causes some unpleasant distortions to appear, particularly during quieter moments.

When looking for music online… well, most services don’t bother specifying the quality of the music, but of those that do, you’ll likely only see sample rate and bit depth specified when a service specialises in high quality audio or otherwise caters to audiophiles. The more average folk will most often hear about something called bitrate, instead. It’s expressed in kilobits per second, and it specifies how much data is actually allocated to represent the previous two values. You can see this bitrate by opening your song in VLC Media Player, clicking on Tools → Media Information and looking at Input bitrate under the Statistics tab. To losslessly – that is, exactly – represent a stream with bit depth 16 and sample rate 44,1 KHz, you’d need a bitrate of roughly 1411 kbps, assuming a stereo track. Mono would take 706 kbps, while a 5.1 concert recording would take a little over 4200 kbps. When purchasing CDs, DVDs and Blu-Rays, you’ll usually get the lossless form, but streaming services and most online music sources don’t bother with that, for two reasons. One, that is a lot of data to send – a 50 minute stereo album is roughly half a gigabyte. Second, unless you’re dealing with really high-end equipment, the difference between a lossless track and a well-done lossy track amounts mostly to placebo.

All audio formats carry audio, obviously, and that audio can be either lossless or lossy, and it can also be compressed or uncompressed. To demonstrate that, do another experiment – open up Audacity, and from the top bar find an option to generate 10 minutes of noise. Save that noise in three different formats – WAV, FLAC and MP3 (Insane preset). Then, remove that audio track, generate 10 minutes of silence and export into the same formats. Now, compare the outputs. Since we’ve made a mono track, it’d take 706 kbps to represent it perfectly. This is what WAV does. It is a format that’s both lossless and uncompressed. At 10 minutes of length, we can actually calculate just how much space it would take up: divide 706 by 8 to turn bits into bytes, multiply by 600 to account for the length, and divide by 1000 to turn kilobytes into megabytes. The resulting number, 52.95, is exactly the size of the WAV file we’ve just created. Notice, though, that both the noise track and the silent track are the exact same size – that’s the uncompressed nature of WAV. No matter the input, it uses exactly the same amount of data to represent it, which is obviously inefficient.

This is where FLAC comes in. Like WAV, it is lossless, but unlike WAV, it is compressed. The audio data in the noise track is almost completely random, hence why that track takes up the same amount of space in FLAC as it does in WAV – you just can’t compress it. Looking at the silent track, however, you can see that it reduced the original 52.9 MB into just 8.1 MB. One can wonder what the 8 megabytes are spent on exactly, but regardless of that, we’ve just compressed the file by 85%. During playback, the data will be uncompressed, so it will sound absolutely identical to the WAV track. Of course, pure silence is easy to compress, so in practice the compression rates will be much lower. An album I own, Iron Maiden’s The Book of Souls, takes up 977 MB when uncompressed, while storing it in FLAC reduces the size by 36%, to 628 MB. This compression is obviously useful for both file transfer and for storage, but how is it accomplished? The specifics of FLAC are unknown to me, but general compression principles are enough to explain it. Basically, all this data ultimately consists of 1s and 0s, and as long as the data’s not random (like in the case of the noise we generated), there are patterns and other attributes that can help us out. For example, if your song has a clear, constant beat playing at the beginning, the differences between the individual beats are miniscule, so it makes more sense to just record one in detail and alter the others slightly, instead of recording each one in full detail. Another example, when you have something like a bass guitar solo, the physical representation of the low notes is a low frequency. Given that, the 44.1 KHz sample rate is overkill, because the difference between neighbouring samples is almost one, so you can reduce the rate significantly without losing any detail. Because of all this, FLAC has a variable bitrate, depending on the complexity of the audio and how well it can be compressed – loud and complex parts of a song can easily reach 1000 kbps, while quieter parts with low tones may rarely go past 300 kbps.


An example of an audio file, where the subsequent bits of music (marked by different colours) are actually the same, and so can be very efficiently compressed.


By far the most popular format, however, is MP3. It can be both compressed and uncompressed, depending on the settings you choose for it, but one constant is that it is a lossy format. What this means is that the data is not only compressed, but some of it is thrown out, as well. The specifics, like with FLAC, elude me – this stuff can get very complex, and even if I knew it, it’d be far beyond the scope of this article. Still, MP3 introduces the idea of meaningful data – that is, data that has significant influence on how people perceive the sound – and non-meaningful data. For example, the general hearing range for humans is between 30 Hz and 19 KHz, but the extremely low and extremely high tones are barely perceptible, so removing them will not make a meaningful difference for most people, while saving a fair bit of space. Going back to the files we’ve created, you’ll see that the noise in WAV and FLAC takes up 52,9 MB, while in MP3 it’s just 24 MB – the encoder decided that over half of the data is ultimately not meaningful, and if you listen to it, you’ll have a hard time disagreeing – unless you have exquisite audio gear and equally exquisite ears, a 320 kbps MP3 (that’s the Insane preset we selected) is in most cases indistinguishable from a lossless track, despite being some three times smaller. If you look at the silent track, however, you’ll notice that it is also 24 MB in size, which would suggest that MP3 does not bother with compression. Well, by throwing out certain data, it makes compression harder by design, but even despite that, this behaviour is caused by us using the preset that we did. The settings we used specified a bitrate of 320 kbps; in other words, we used Constant Bit Rate, or CBR. Every second of the audio was stripped of irrelevant data and, regardless of what it contains, assigned 320 kilobits. A more reasonable thing to do would be to use Variable Bit Rate, or VBR, which you can do in Audacity during the exporting process. This way, instead of assigning a single value, we assign a range that the encoder can then use and combine the best of both worlds – both lossy encoding and compression. 320 kbps is the upper limit of what MP3 supports, and is the bit rate used by many streaming platforms, like Amazon Music and Spotify Premium. A more common rate, however, is 128 kbps, which is what you’ll find on YouTube, or when playing back songs on Bandcamp and Spotify for free. The lower the bitrate, the worse the quality. If you directly compare a 128 kbps track to a 320 kbps one, the lower number will result in some sounds being more muddy and noise-like. But in some cases it’s alright; audiobooks, for example, often employ something in the range of 92 kbps, because the human voice just doesn’t need much data to represent.


The output of a program called youtube-dl shows (highlited), among other things, audio codecs YouTube uses, their bitrate and sample ratio.


Of course, there are many more audio formats. YouTube videos, for example, use AAC (lossy and compressed). Apple has their own fancy version of FLAC, called ALAC (compressed, lossless), older games liked to store their music and sounds in the Vorbis (lossy) format with the OGG file extension. There are countless more, but while they differ in their algorithms, they can all be classified with the distinction of compressed, uncompressed, lossy and lossless, which is arguably the most important aspect of storing audio data. With that said, I hope that the article was clear and concise, and that you now have a decent understanding of digital audio. In the next part of the series, we will take a look at digital storage of images; different colour depths and spaces, compression algorithms, and more.


Back to the top