Music Analysis
From BluWiki
Contents |
Forecasting Top-100 Radio Music Success
Can we predict what defines a successful song in the top-100 music industry? Is it a deterministic function that governs success, or is it some sort of random stochastic process?
You could paraphrase this as the following: is it possible to write some sort of algorithm that will guess popular a song will be? It doesn't have to be entirely accurate, even 51% accuracy would be steps forward from what we currently have.
What do we mean by success? Success is the difference between a song that people hear once and ignore and the song that plays for months on end. It doesn't mean (very likely) that you as an individual like the song, music success means that a song grosses a lot of money. We could go into the whole philisopohical background of what makes a song that people like, but it might be easier to just look at the statistics and make conclusions from there.
Introduction
Popular music can be thought of as having several main components:
1. Lyrics -- the words that make up the song. Textual. You can grab all of this off of sites like lyricwiki.org, or write your own script that scrapes lyric websites. Not too difficult.
2. Sounds -- the sounds you hear -- the melodies, rythms, tempo, beats, key, etc. It's harder for computers to understand this. But there definitely is a way to derive this information; a Fast Fourier Transform is a good place to start looking. This is essentially how apps like Shazam work.
3. Cultural Aspects -- who sings the song? Is the artist popular? Has he/she released good music in the past? Is the song a part of a movie soundtrack? is there a story behind the song? When in the year was the song released?
Hypothesis
Hypothesis: There is a correlation between the contents (lyrical, sounds, cultural) and how successful a song is throughout its lifetime, and the correlation coefficient r2 is at least .6.
Methodology
Measuring a song's lifetime success
A huge aspect of this project is calculating how successful a song actually is. It's important to have a formal definition of "success," so we'll define it here. The ranking process is based on the American Billboard Top 100 (ABT100), the de-facto rating authority for music. Each week, the ABT100 releases a ranking of the most popular 100 songs for the week. For all intensive purposes, we will assume that these measurements actually measure how successful a song is. There is a strong correlation between the ranks and amount of money grossed.
Songs are normally on the billboard for more than one week. Here's an example of how we will calculate a song's score:
| Week # | Rank on ABT100 | Score | Comment |
| 1 | 20 | 80 | Song debuts on billboard in 20th place |
| 2 | 10 | 90 | Gains momentum, jumps to 10th place |
| 3 | 1 | 99 | Explodes, gets first place. |
| 4 | 50 | 50 | Old news. Nobody wants to hear it anymore. |
| Song is no longer ranked. |
All this data is available on the Whitburn Project. In fact, this has a list of every popular song from the year 1890 and counting. I'm not sure if I'm allowed to link to the actual spreadsheet from here, but here's a link: Billboard Pop ME 1980-2008.xls.
Analyzing Attributes
So we mentioned several characteristics of each song. Lyrics, sound, and culture. I don't want to go into analyzing culture quite yet, but it could very well have some interesting information available. If culture did say something about the song, one could make conclusions about how good a song will be before it's even written. Interesting!
Doing transforms on the melodies will be a huge computational challenge. It would also require downloading all of the songs and sorting through a LOT of noise to find the actual music signal. There is certainly some value in here, but it would be much easier to start off with the lyrics to describe a song.
Lyrical & Semantic Comprehension
- Latent Semantic Analysis seems like a very, very good contender. Most of the LSA tutorials I've walked through have had the following steps:
- Aggregate all documents into Document Vector D: { d1, d2, ... ,dn}. Each element in this vector represents a document, in our case, a song.
- Normalize Text by making everything in lowercase and cleaning up whitespace
- Stem words using some stemming algorithm. Stemming algorithms simply trim off the endings of words This turns, for example, the word "going" -> "go", "passive" -> "pass", etc.
- Tf-idf: This is a brilliant step. It finds the important words in document by comparing that word's relative frequency to the frequency throughout all of the documents being compared. Stemming algorithms realize that the word "that" isn't very important because it appears everywhere, but "smack" is a more important word.
Source Code
I started this off with a school project; I'll post the source code on github.
Interested?
Email me if you have any questions, my email is omar.bohsali@gmail.com. I'm also in #musicanalysis on freenode.
Comments
Sam:
Sounds interesting. I think the "cultural" variable will have the highest correlation but will also be the least interesting (since the market(aka record labels) already take it into consideration when deciding how much an artist's contract is worth.
I don't need to know that a song by Britney Spears is more likely to be successful than one by an unknown. I do want to know if a song by an unknown is likely to be more successful than one by an also unknown peer.
To do that I'd suggest focusing on correlating only lyrics and sounds to the success of a song.



