I know, I know.
This is my *second* post, in as many years. Who would’ve thought a sequence of life-changing events would put a blog – a blog! – on hold, huh.
Now that the awkward part is behind me:
In this post I want to talk about a small project I’ve been carrying in my head for a while now. It is about metal music, coding and scraping, and not least of all, visualizing data. In short, things I am very passionate about.
Let me start with whys before I get into the whats and hows.
I’ve been listening to rock and metal music since I was 14. I have been shaping my taste for music ever since, but it always had to be loud and heavy. Or ZZ Top.
This is where Brutal Assault comes to play.
It’s one of the oldest, largest and most popular metal festivals in Central and Eastern Europe. Since 2007 its venue is the area surrounding an 18th century fortress, its walls and casemates, making for an immersive 4-day metal event.
However, despite all its positives, their website has been a painful experience if you are keen to discover something new. Here are some of the reasons why:
The bands are not organized by genre or any other logic, they’re listed alphabetically.
There are multiple updates throughout the year, so you have to go through the list each time, spot the new bands and add them to your list.
After the festival, the list disappears forever, and a new line-up for the next year takes its place.
If you want to make your own list, you have to click through 100-odd images to copy the name, genre, country, and description. This is not too difficult, but it is definitely boring and error-prone. Also, things like video URL would take even more clicks to get to.
To tackle this, I decided to try and automate the process.
The plan was simple. First, scrape the website, save it to a CSV using Python and then create a simple app in Tableau.
This became possible with Tableau 2019.2 beta, introducing parameter actions. I used Jonathan Drummey’s thorough post on parameter actions as both inspiration and an opportunity for a blatant copy-paste when it comes to the functionality to add or remove bands from a list, so, thank you Jonathan!
I had to figure out a way around another issue that frustrated me before, namely working on multiple devices. GitHub is a no-brainer for that. I had experience with it before, but most of the time I used the website interface. This time I went with the command line using Git on my machine. I also decided to use virtual environments in Python, to follow best practices to manage dependencies. It was a good practice and taught me a lot.
Eventually, this is where I landed:
I have been reading about web scraping for a while now, but I struggled to organize whatever responses Python threw at me, and thus all my previous projects ended in frustration. I was very happy to be able to get this working this time. What helped was the simple layout of the website, but also the fact that things like loops and lists finally clicked for me.
However, I still struggled with some elements. I’ll cover those in detail below, including code snippets how I solved them.
The full code, including the tableau workbook is in my GitHub repo here. The app has been published to Tableau Public and is available below.
First, libraries I used in this script. Platform is used to determine which device I am on, so that the save to path is updated. Pandas creates and works with the data frames, re is to do some regex cleanup. Requests and BeautifulSoup are what this script is all about – get a URL, and parse whichever parts of the html you want. Then make the soup to start things off.
import requests from bs4 import BeautifulSoup import re import pandas as pd import platform link = 'https://www.brutalassault.cz/en/line-up' url = requests.get(link).text soup = BeautifulSoup(url, 'html.parser')
Getting a list of bands was simple, all available in the /line-up page.
bands = soup.findAll('strong',class_='band_lineup_title') bands_list =  for band in bands: bands_list.append(band.text)
Genres, however, included tab and newline characters before and after the genre string.
['\n\t\t\t\t\t\t DEATH METAL\n\t\t\t\t\t', '\n\t\t\t\t\t\tMETAL/HARDCORE/PROGRESSIVE\n\t\t\t\t\t', '\n\t\t\t\t\t\tNY/HC\n\t\t\t\t\t', '\n\t\t\t\t\t\tMAORI THRASH\n\t\t\t\t\t', '\n\t\t\t\t\t\tINDUSTRIAL BLACK GRIND\n\t\t\t\t\t', '\n\t\t\t\t\t\tALTERNATIVE ROCK\n\t\t\t\t\t', '\n\t\t\t\t\t\tPROGRESSIVE METAL \n\t\t\t\t\t', '\n\t\t\t\t\t\tBLACK METAL\n\t\t\t\t\t', '\n\t\t\t\t\t\tTHRASH METAL\n\t\t\t\t\t', '\n\t\t\t\t\t\tPOWERVIOLENCE/GRIND/SLUDGE \n\t\t\t\t\t', '\n\t\t\t\t\t\tPOST-BLACK METAL\n\t\t\t\t\t', '\n\t\t\t\t\t\tPROGRESSIVE METAL\n\t\t\t\t\t', ...]
A simple regex formula took care of that. Another option would be to get the list of genres from each band’s site.
for genre in genre_list: genre = re.sub('[^A-Za-z]+',' ',genre).strip() genre_list_clean.append(genre)
In order to get more details about each band, I had to get a list of URLs where each image pointed.
band_link = soup.findAll(class_='lineup_band_link') band_url =  for band in band_link: bands = band['href'] band_url.append(bands)
Then loop through each URL to get country, website, and video URL. Don’t mind the variable names, I am very bad at this.
band_country =  band_website =  band_videourl =  band_description =  band_image =  #open each band link and scrape country, website, video URL etc. for link in band_url: temp_url = requests.get(link).text temp_soup = BeautifulSoup(temp_url, 'html.parser') temp_band_country = temp_soup.find('h5').text clean_band_country = temp_band_country[-(len(temp_band_country)-temp_band_country.find(':')-2):] band_country.append(clean_band_country) temp_band_website = temp_soup.find('p',class_='officialWebiste').find('a')['href'] band_website.append(temp_band_website) band_text = temp_soup.find('div',class_='page_content').text #remove everything before official website, inclusive charpos = band_text.find('official website') band_clean_text = band_text[charpos+17:] band_description.append(band_clean_text) try: band_videourl.append(temp_soup.find('iframe')['src'][2:]) except: band_videourl.append('No Video') band_image_url = temp_soup.find('div',class_='band_image').img['src'] band_image.append(band_image_url)
- Country: had to parse out “Country: ” from each entry, and could not get this working with split, so used string length and *advanced* math.
- Website: nested find – first ‘p’ anchor, then ‘a’ within it. Note there’s a typo (line 7) in “officialWebiste”. If the admin fixes this i am screwed!
- Description: you get a lot of special characters before the actual text begins:
'\nCountry: Australia\nGenre: DEATHCORE\n official website\n\t\t\t\t\t\tRising from the unhallowed wastes of Western Sydney, THY ART IS MURDER have successfully ... '
I ended up looking for “official website” and removed a fixed amount of characters from the beginning of that string. Very unstable, and RegEx would definitely do a better job.
- Video: I added a try-except clause, because not every band had a video embedded to their profile. I could probably replace this with a Google search URL with the name of the band instead.
- Image: also a nested find, same as website. Fairly straightforward.
All that was left at this time was to add all these lists together into a pandas data frame and save as CSV.
table_ba = pd.DataFrame(list(zip(bands_list,genre_list_clean,band_country,band_website,band_url, band_description,band_videourl, band_image)),columns=['Band Name','Genre','Country','Band Website','BA URL','Description','Video URL','Image']) table_ba.to_csv(path+'brutal_assault_2019_kapely.csv',index=False)
I mentioned the platform library at the beginning. The “path” variable is merely looking at what computer I am working on (based on PC name) and returning a path depending on that. This to avoid always having to rewrite the path manually or be forced to use the same path string.
Once I was done I decided to wrap the script in a function and duplicate it with slight changes to scrape the Czech version of the website. I added a simple language selection menu at the beginning for that.
while True: try: user_input = int(input('Language Menu:\n 1 - English \n 2 - Czech \n Your Choice: ')) if user_input > 2: print('\n please select 1 or 2 \n') continue else: break except ValueError: print('\n please select the number 1 or 2 \n') continue else: break
The CSV is then used as a data source for the Band Selector Tableau dashboard/app. Here, you can read briefly about the festival’s history, attendance (which was a manual effort, using old online articles and with any of the numbers missing), and explore bands by genre.
I had to manually group genres, because 100+ bands were split into 70+ (!!) genres.
When you get to the bands, you can:
watch the video (or get a YouTube search result if there was none on the website),
visit the website or Facebook page – sadly too many bands have just the latter,
and finally, add the band to your list.
Now that this is all completed, I see that the website was updated with additional bands (literally today, May 30), and all I have to do in order to update the dashboard is run the script. Actually, I’ve done it before publishing this post; it took under 2 minutes to run. The only caveat is that new bands usually means new genres that need to be manually added to groups in the dashboard.
To close this, I wanted to list some ways I’ve thought of to improve the workflow.
Create a script to determine if there are new bands added to the line-up. If there are, re-run the scraping script.
Use an online storage (Google Drive, Dropbox, OneDrive, S3?) to store the CSV and point Tableau to that file to automatically refresh.
Find a clever way to automatically group genres, for example by using key words (doom, death, black, heavy, core, etc.).
I know this was a long post about what many would find an easy task in Python, however, there are aspiring beginners coming up every day and maybe someone will find this useful. And it helped me organize my thoughts and revise the concepts once more.