Quibbles and Bits

Brainz The Size of a Planet, Part 2

In the last edition of Copper, I introduced MusicBrainz, a crowd-sourced, free-to-use database of metadata for recorded music. From a skeptical beginning, I have come to appreciate that MusicBrainz is actually a first-class resource, seriously well thought-out – one which has accomplished far more than I would have thought possible, and one which I enthusiastically endorse going forward.

MusicBrainz starts from the premise that the metadata typically associated with ripped and/or downloaded music is inadequate, and puts in place a framework to improve upon that. This is so much easier said than done. If you are going to improve upon something, you have to have clear view of what is wrong with it, and in what specific ways it has to be improved. In doing so, it is critical that the data structures you put in place can be applied to the widest possible spectrum of music styles and formats, and also that it is compatible (to the largest extent practicable) with the norms which have hitherto become accepted as standard practice. Both of these requirements involve challenges, and those challenges exist both as fundamental issues regarding how the database is structured, and problems regarding how the data will be used, viewed, and understood in the real world.

MusicBrainz is what is called a ‘relational database’. This type of database comprises lists of similar entities, together with tables of relationships that describe how items on one list relate to items on another list. For example, one list can be a list of people, and another can be a list of musical compositions. A Person can be related to a Work via a composer relationship. Typically a relationship is a two-way affair so that the Person is ‘composer of’ the Work, and the Work was ‘composed by’ the Person. Therefore the first things to understand about MusicBrainz are the primary lists. There are actually 15 of those lists, but only four of them form the vast bulk of the critical relationships in the database. These are Artists, Releases, Recordings, and Works, so let’s just focus on those.

Artists include both people and ensembles. The Beatles constitute an Artist, as do John Lennon and Paul McCartney. A relational database also allows for relationships within a list, so that John Lennon has a relationship ‘member of’ with The Beatles, as indeed does Paul McCartney, and The Beatles have corresponding ‘has member’ relationships with both Lennon and McCartney. Orchestras, choirs, conductors, and producers all end up as part of the Artists list, as do composers, photographers, lyricists, and arrangers. Mostly, though, Artists have important relationships with entities on other lists. So, for example, the only way we know if an Artist is a composer is if he has a ‘composed by’ relationship with an entity in the list of Works. This is very helpful in the big picture because, as we know, individuals can wear many different hats over the course of a career. Leonard Bernstein’s recorded oeuvre includes appearances as conductor, composer, and concert pianist. And ‘ambient music performer’ Brian Eno (much beloved of the NY Times crossword) appears in MusicBrainz as guitarist, keyboard player, percussionist, composer, lyricist, arranger, producer, engineer, vocalist, illustrator, chorus master… that list just goes on and on and on.

MusicBrainz is a highly structured and formalized environment, and the relationships that individual entities can have within and among each other are carefully controlled. Strict hierarchies are maintained. For example:

  • Works are individual pieces of music, and have ‘recording of’ relationships to individual Recordings
  • Recordings are specific recorded performances and have‘performance of’ relationships to individual Works
  • Tracks are structural components of a Release (which is MusicBrainz-speak for an album). Individual Tracks contain individual Recordings;
    Releases contain one or more Tracks

This may be complicated-sounding – and in fact it gets even more complicated than this – but believe me, the complication is the necessary evil to be accommodated if the system is to apply smoothly and consistently across all the possibilities encountered in the world of recorded music.

Works can be broken down into multiple parts, each of which is in itself a Work, and has a ‘part of’ relationship with the parent Work. This is most commonly seen in classical music, where, for example, Beethoven’s 9th Symphony has four movements. In this case the symphony itself is a Work, and each of the movements exist in MusicBrainz as separate ‘part ofWorks. These ‘part of’ relationships can be nested as deep as you need. Works typically have a ‘composed by’ relationship with someone in the Artists list, and will often also have lyricist, arranger, or even ‘revised by’ relationships. Interestingly, one of the legacy aspects that MusicBrainz has decided to live with instead of imposing its own view is that it includes composer, lyricist, librettist, and writer as separate relationships, which can be viewed as conveying a certain ambiguity.

Releases are an important entity in MusicBrainz because music is typically released onto the market in self-identified collections, such as albums. Therefore albums, EPs, singles, and downloadable releases comprising just one item, all constitute Releases. But for most music, when we are talking about Releases, we are talking about albums. MusicBrainz allows for a lot of information to be stored in respect of albums, including release date, record label, catalog number, cover art, and a whole lot more.

Recordings in MusicBrainz are what we normally call tracks. A Release will comprise a number of Recordings, which are just the tracks on the album. Each of those Recordings will have its own set of attributes, including track number, duration, performers associated with the track, and so forth. One of the key things about MusicBrainz – which causes a lot of trouble – is the relationship between Recordings and Works, and this illustrates nicely one of the requirements I laid out at the start of this column, that the MusicBrainz database should be compatible with the widest possible spectrum of music styles and formats. In the classical music world any given piece of music may have many different recordings of it that have been released by various performers, or even by the same performers at different times. So it follows that a Recording and a Work cannot be the same thing. A Recording has to be a recording of a Work (read that sentence again in order to understand why I have been so anal with my use of capitalization and italics). In existing digital audio metadata no such distinction is made – so a track has both performers and composers, and there is no place at all for the Work, unless it is somehow (i.e. informally) captured in the track’s Title. In MusicBrainz it is only the Recording which can have performers, and the Work which can have composers – you cannot associate composers with Recordings, nor performers with Works. If you think about it, it makes perfect sense.

The second aspect of MusicBrainz that I want to cover in this column is the crowd-sourcing aspect. Crowd-sourcing means that – like Wikipedia as a well-known example – anybody can sign in and enter data. (As an experiment, some years ago, I made a minor edit to the official Wikipedia page for the state of North Carolina. Not an overtly controversial edit, but one with mild socio-political overtones, replacing a text which had slightly less mild socio-political overtones. I was interested to see how long it would last – and who would change it (and why). But, no, it is still there!)

MusicBrainz then has a community of Editors who pore over newly crowd-sourced data and edit out any errors or any data that do not conform to the MusicBrainz ‘style guidelines’. At least that’s the theory. In practice, based on what I am seeing, there are serious limits on how much of the submitted material the Editors can actually review, and as a result huge swathes of the database are not in strict compliance with the style guidelines. But this is not surprising when you consider that over 15,000 new Releases (i.e. albums) are added to the MusicBrainz database every month, a rate which is actually slowly accelerating. (Every time I read that number I find it so incredible I have to go back and check it again in case I made a mistake!)

Adding new data to MusicBrainz can be a tedious process, particularly since the available tools are not particularly user-friendly, but also because every time you want to add a relationship to an entity which is not pre-existing in MusicBrainz, you must first create that entity from scratch, a task which gets old very quickly. With modern (pop/rock etc.) music this often means creating a new Work for each track on the album, which is doubly tedious because you have to check first to see if the Work already exists (the process for doing that isn’t as simple or convenient as you might wish, for all sorts of irritating reasons). And there are always those occasions when you know that a track you are entering is a cover of another track which is already in MusicBrainz, but you find that nobody has bothered to create the Work for it. Because of issues like these, a great many Releases in MusicBrainz have neither performers nor Works associated with their Recordings, which means it is left to editors to step in and fill in the blanks. But there aren’t remotely enough editors to be able to keep up.

The complexity, and thoroughness, of the MusicBrainz database is at the same time its greatest strength and greatest weakness. Strength, because it allows the most comprehensive metadata relationships to be unambiguously assembled. Weakness, because if nobody is motivated to enter the data in the first place, then the strengths are quite irrelevant. By far MusicBrainz’s biggest problem is the paucity with which key data relationships have been entered by the community. With modern music in particular, it is surprising how few Releases in MusicBrainz actually have Works associated to their individual tracks. This means, among other things, that the composer relationships for such tracks is empty.

There are a couple of other important databases which are associated with MusicBrainz, and which form key parts of the MusicBrainz ecosystem. AcoustID is a bit like Shazam. AcoustID publishes an algorithm with which an ‘acoustic fingerprint’ of any given track can be calculated and the resultant fingerprint stored in the AcoustID database. This can then be used to identify the track. Somebody can then take an unknown piece of music, calculate its acoustic fingerprint, submit that to AcoustID, and, if a match is found, find the matching Recording in the MusicBrainz database. Of course, this requires that if a new album is entered into the MusicBrainz database, you need to be able to generate acoustic fingerprints for each track, register them with AcoustID, and then register the match with MusicBrainz. This is done using a free app called MusicBrainz Picard, and, naturally, can only be done if you actually have the music to hand. The other associated database is CoverArtArchive which is used to store cover art and other imagery associated with MusicBrainz data (because images themselves are not handled by the MusicBrainz database).

So that is a basic introduction to MusicBrainz, and believe me, there is more than ten times as much that I could have written if I had the space, and if I thought you had the patience to read it. So in the next and final MusicBrainz column, I’ll deal mostly with how you can use MusicBrainz to power a state-of-the-art music server.