Scrape Metal: How I built Brutal Assault 2019 Bænd Selectør Using Python and Tableau

I know, I know.

This is my *second* post, in as many years. Who would’ve thought a sequence of life-changing events would put a blog – a blog! –  on hold, huh.

Now that the awkward part is behind me:

In this post I want to talk about a small project I’ve been carrying in my head for a while now. It is about metal music, coding and scraping, and not least of all, visualizing data. In short, things I am very passionate about.

Let me start with whys before I get into the whats and hows.

I’ve been listening to rock and metal music since I was 14. I have been shaping my taste for music ever since, but it always had to be loud and heavy. Or ZZ Top.

This is where Brutal Assault comes to play.

It’s one of the oldest, largest and most popular metal festivals in Central and Eastern Europe. Since 2007 its venue is the area surrounding an 18th century fortress, its walls and casemates, making for an immersive 4-day metal event.

However, despite all its positives, their website has been a painful experience if you are keen to discover something new. Here are some of the reasons why:

The bands are not organized by genre or any other logic, they’re listed alphabetically.

There are multiple updates throughout the year, so you have to go through the list each time, spot the new bands and add them to your list.

After the festival, the list disappears forever, and a new line-up for the next year takes its place.

If you want to make your own list, you have to click through 100-odd images to copy the name, genre, country, and description. This is not too difficult, but it is definitely boring and error-prone. Also, things like video URL would take even more clicks to get to.

To tackle this, I decided to try and automate the process.

The plan was simple. First, scrape the website, save it to a CSV using Python and then create a simple app in Tableau.

This became possible with Tableau 2019.2 beta, introducing parameter actions. I used Jonathan Drummey’s thorough post on parameter actions as both inspiration and an opportunity for a blatant copy-paste when it comes to the functionality to add or remove bands from a list, so, thank you Jonathan!

I had to figure out a way around another issue that frustrated me before, namely working on multiple devices. GitHub is a no-brainer for that. I had experience with it before, but most of the time I used the website interface. This time I went with the command line using Git on my machine. I also decided to use virtual environments in Python, to follow best practices to manage dependencies. It was a good practice and taught me a lot.

Eventually, this is where I landed:

Solution diagram. Like I said. Simple. Created using draw.io

I have been reading about web scraping for a while now, but I struggled to organize whatever responses Python threw at me, and thus all my previous projects ended in frustration. I was very happy to be able to get this working this time. What helped was the simple layout of the website, but also the fact that things like loops and lists finally clicked for me.

However, I still struggled with some elements. I’ll cover those in detail below, including code snippets how I solved them.

The full code, including the tableau workbook is in my GitHub repo here. The app has been published to Tableau Public and is available below.

First, libraries I used in this script. Platform is used to determine which device I am on, so that the save to path is updated. Pandas creates and works with the data frames, re is to do some regex cleanup. Requests and BeautifulSoup are what this script is all about – get a URL, and parse whichever parts of the html you want. Then make the soup to start things off.

import requests
from bs4 import BeautifulSoup
import re
import pandas as pd
import platform

link = 'https://www.brutalassault.cz/en/line-up'
url = requests.get(link).text
soup = BeautifulSoup(url, 'html.parser')

Getting a list of bands was simple, all available in the /line-up page.

bands = soup.findAll('strong',class_='band_lineup_title')

bands_list = []

for band in bands:
  bands_list.append(band.text)

Genres, however, included tab and newline characters before and after the genre string.

['\n\t\t\t\t\t\t DEATH METAL\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tMETAL/HARDCORE/PROGRESSIVE\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tNY/HC\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tMAORI THRASH\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tINDUSTRIAL BLACK GRIND\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tALTERNATIVE ROCK\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tPROGRESSIVE METAL \n\t\t\t\t\t',
 '\n\t\t\t\t\t\tBLACK METAL\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tTHRASH METAL\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tPOWERVIOLENCE/GRIND/SLUDGE \n\t\t\t\t\t',
 '\n\t\t\t\t\t\tPOST-BLACK METAL\n\t\t\t\t\t',
 '\n\t\t\t\t\t\tPROGRESSIVE METAL\n\t\t\t\t\t',
...]

A simple regex formula took care of that. Another option would be to get the list of genres from each band’s site.

for genre in genre_list:
  genre = re.sub('[^A-Za-z]+',' ',genre).strip()
  genre_list_clean.append(genre)

In order to get more details about each band, I had to get a list of URLs where each image pointed.

band_link = soup.findAll(class_='lineup_band_link')

band_url = []
for band in band_link:
  bands = band['href']
  band_url.append(bands)

Then loop through each URL to get country, website, and video URL. Don’t mind the variable names, I am very bad at this.

band_country = []
band_website = []
band_videourl = []
band_description = []
band_image = []

#open each band link and scrape country, website, video URL etc.
for link in band_url:
  temp_url = requests.get(link).text
  temp_soup = BeautifulSoup(temp_url, 'html.parser')
  temp_band_country = temp_soup.find('h5').text
  clean_band_country = temp_band_country[-(len(temp_band_country)-temp_band_country.find(':')-2):]
  band_country.append(clean_band_country)
  temp_band_website = temp_soup.find('p',class_='officialWebiste').find('a')['href']
  band_website.append(temp_band_website)
  band_text = temp_soup.find('div',class_='page_content').text
  #remove everything before official website, inclusive
  charpos = band_text.find('official website')
  band_clean_text = band_text[charpos+17:]
  band_description.append(band_clean_text)
  try:
    band_videourl.append(temp_soup.find('iframe')['src'][2:])
  except:
    band_videourl.append('No Video')
  band_image_url = temp_soup.find('div',class_='band_image').img['src']
  band_image.append(band_image_url)
  1. Country: had to parse out “Country: ” from each entry, and could not get this working with split, so used string length and *advanced* math.
  2. Website: nested find – first ‘p’ anchor, then ‘a’ within it. Note there’s a typo (line 7) in “officialWebiste”. If the admin fixes this i am screwed!
  3. Description: you get a lot of special characters before the actual text begins:
    '\nCountry: Australia\nGenre: DEATHCORE\n official website\n\t\t\t\t\t\tRising from the unhallowed wastes of Western Sydney, THY ART IS MURDER have successfully ... '

    I ended up looking for “official website” and removed a fixed amount of characters from the beginning of that string. Very unstable, and RegEx would definitely do a better job.

  4. Video: I added a try-except clause, because not every band had a video embedded to their profile. I could probably replace this with a Google search URL with the name of the band instead.
  5. Image: also a nested find, same as website. Fairly straightforward.

All that was left at this time was to add all these lists together into a pandas data frame and save as CSV.

table_ba = pd.DataFrame(list(zip(bands_list,genre_list_clean,band_country,band_website,band_url, band_description,band_videourl, band_image)),columns=['Band Name','Genre','Country','Band Website','BA URL','Description','Video URL','Image'])

table_ba.to_csv(path+'brutal_assault_2019_kapely.csv',index=False)

I mentioned the platform library at the beginning. The “path” variable is merely looking at what computer I am working on (based on PC name) and returning a path depending on that. This to avoid always having to rewrite the path manually or be forced to use the same path string.

Once I was done I decided to wrap the script in a function and duplicate it with slight changes to scrape the Czech version of the website. I added a simple language selection menu at the beginning for that.

while True:
  try:
    user_input = int(input('Language Menu:\n 1 - English \n 2 - Czech \n Your Choice: '))
    if user_input > 2:
      print('\n please select 1 or 2 \n')
      continue
    else:
      break
  except ValueError:
    print('\n please select the number 1 or 2 \n')
    continue
  else:
    break

The CSV is then used as a data source for the Band Selector Tableau dashboard/app. Here, you can read briefly about the festival’s history, attendance (which was a manual effort, using old online articles and with any of the numbers missing), and explore bands by genre.

I had to manually group genres, because 100+ bands were split into 70+ (!!) genres.

When you get to the bands, you can:

watch the video (or get a YouTube search result if there was none on the website),

visit the website or Facebook page – sadly too many bands have just the latter,

and finally, add the band to your list.

Now that this is all completed, I see that the website was updated with additional bands (literally today, May 30), and all I have to do in order to update the dashboard is run the script. Actually, I’ve done it before publishing this post; it took under 2 minutes to run. The only caveat is that new bands usually means new genres that need to be manually added to groups in the dashboard.

To close this, I wanted to list some ways I’ve thought of to improve the workflow.

Create a script to determine if there are new bands added to the line-up. If there are, re-run the scraping script.

Use an online storage (Google Drive, Dropbox, OneDrive, S3?) to store the CSV and point Tableau to that file to automatically refresh.

Find a clever way to automatically group genres, for example by using key words (doom, death, black, heavy, core, etc.).

I know this was a long post about what many would find an easy task in Python, however, there are aspiring beginners coming up every day and maybe someone will find this useful. And it helped me organize my thoughts and revise the concepts once more.

— MtS

Building the NHL Player Stats Dashboard using Python and Alteryx

I’ve been an NHL fan ever since we could get any live NHL broadcast back home. The first memories I have are watching the 1994 Stanley Cup Finals with my cousin. I was in awe of what Pavel Bure could do, and of the vintage Canucks jerseys, so Vancouver had become my favorite team. Oh, and also ’cause we had to pick our favorites and my cousin was picking first 🤷‍♂️.

Some years I followed it daily, some years not-so-often, but I never completely stopped. And so I was very happy when I saw this tweet:

click to go to GitHub/hockey_scraper

 

I have been interested in analyzing NHL data for a long time, but it’s not exactly easy to get good, detailed data. That is, if you’re not aware of this API (I was not).

Hockey_scraper is a Python library and I decided to process the script in Alteryx. This was mostly to learn how to use Run Command tool. You can easily run this in a terminal and of course in order to run this you need Python on your machine either way. I went with Anaconda for Python and I used Atom as text editor. It was quite handy to use Jupyter Notebook to test out scripts, too.

This is the setup I ended up using the most:

Run Command Tool + a Python script that I manually changed depending on my requirements. I also played around with a more complex script, mainly for my amusement, since everything you need is in the basic pre-defined functions.

Settings For Run Command Tool
Python Script 1 – Custom Date Range
Python Script 2 – Entire Season
A manual way to scrape any required custom period

What you get is a file for each query, or for each season – one for the game logs, one for shifts data. I needed to combine all these files into one – easy as 1-2-3 in Alteryx:

Unions for play-by-play and shift files

Now the game logs record every event that happens during a game, not just goals. Since for this particular dashboard I was interested in point streaks, I had to play around with the data in Alteryx to get the data in the right format. During the process it occurred to me it may be quite interesting to also get the coordinates for each goal scored in that time period, so I did:

Click to Enlarge Image
Alteryx workflow to get goal location data and game-by-game data

As you can see there’s a couple more outputs that I’m planning to use in the future.

Now all that was left was build the dashboards! Easy! And also something I want to tackle on its own in a separate post.

Long story short (for now), I created this Tableau dashboard with three screens:

Users can select the season, team, and player and take a look at how they’ve been performing. Where is my favorite player scoring his goals from? Is he relying on slap shots or tip-ins? And which day of the week should his opponents be extra cautious playing him?

The last screen is a bit special to me. I’ve been always in awe of how much the NHL teams change each season, especially compared to European leagues. Almost each season I read how a roster has to be consistent in order for the team to gel together and be successful. I was curious to see if the same is true of the NHL from the past decade. See for yourself, how the rosters of the most recent Stanley Cup champs looked the year or two before their triumph.

I already have another blog post coming, this time about the creation process I went through.

Let me know what you think of the dashboards!

— MtS