Data Aquisition

Data will mainly be collected from the Echo Nest. The Echo Nest is a company providing information and analysis of music content based on a large database currently containing more than 36 million songs. The data will be extracted using the Echo Nest API and each API call used will be explained in the following sections where we will document the method used to download music content.

Extract all available music genres

To get an initial idea of the amount of data available we will start by extracting all available music genres. This can easily be done using the library Pyen which is a thin client library for the Echo Nest written in python. To each genre Echo Nest has assigned a numerical value between 0-1 indicating how familiar a genre currently is to the world.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
import pyen
# Set API key 
en           = pyen.Pyen("RCFLSTK3HJSSKY7XN")
# API call to retrieve all genres 
response     = en.get('genre/list',bucket=['genre_scores'],
                      format='json',results=2000)
genres_score = []

# Save genres in a list 
for g in response['genres']:
    genres_score.append([g['name'],g['scores']['coherence'],
                         g['scores']['centrality'],
                         g['scores']['familiarity']])

# sort by familiarity
genres_score.sort(key=lambda x: x[3], reverse=True)

We find that Echo Nest returns 1373 different music genres, going from pop,rock to more exotic genres such as deep space rock and gypsy jazz. As we are unfamiliar with most of the genres we choose to sort them by the parameter familiarity, in decreasing order to be able to remove the most unfamiliar genres if needed.

Extract the hottest artists within a music genre

To distinguish between artists within a genre a hotttnesss parameter will be used. It is set by Echo Nest and indicate how trending an artist currently is based upon social commentary, play counts, and editorial volume. We can give this parameter as a sorting option to the API and in that way list the hottest artists in decreasing order. We can further on collect information about each artist such as years active and geographical location. Note that the information is not available to all artists.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pyen

en  = pyen.Pyen("RCFLSTK3HJSSKY7XN")
# list_genres is a list containing the genres to which the, in this case, 100 hottest
# artists has to be found 
for genres in list_genres:
    artist_tmp = en.get('artist/search', genre=genres,start=0, results=100,
                        bucket=['hotttnesss','years_active','artist_location',
                                'id:spotify'],sort='hotttnesss-desc')
    information_list = []
    for  i in artist_tmp['artists']:
        tmp1 = u' '.join((i['name'], i['id'],
                          str(i['hotttnesss']))).encode('utf-8').strip()
        try:
            tmp1 = u' '.join((tmp1,str(i['years_active'][0]['start']))).strip()
            try:
                tmp1 = u' '.join((tmp1,str(i['years_active'][0]['end']))).strip()
            except: 
                tmp1 = u' '.join((tmp1,str(2015))).strip()
        except: 
            pass
    
        try:
            tmp1 = u' '.join((tmp1,i['artist_location']['country'])).strip()
        except: 
            pass
        information_list.append(tmp1)

It is now possible to create a structure with n artists in each genre to investigate if differences between the artists or in the songs made by them are different.

Extracting track information

Using the Echo Nest API it is possible to get information about a track such as song title, hotttnesss and geographical location of artist. The API can further on return a audio summary containing analytic content. This include the acoustic attributes listed below which are supposed to give a subjective estimate between 0-1 of the quality of a track.

Further on, the audio summary returns a few objective measures about the track in general

We can then retrieve the track information for up to 1000 tracks for each of the hottest artists contained in a genre using the API call.

1
2
3
4
5
track_information = en.get('song/search', artist_id=echo_ID,results=step,start=counter,
                            sort='song_hotttnesss-desc', 
                            bucket=['audio_summary','song_currency','song_hotttnesss',
                                    'artist_location','artist_hotttnesss',
                                    'artist_familiarity'])['songs']