Stacked Area Plots of Mean MLS Attendance Figures for Each Team (1996-2016)

Always looking for new ways to present data, I found this article about 7 data visualization types I "should be using more". This was the article that inspired me to try sunburst plots, and I was eager to try another of these 7 recommendations. I decided to try my hand at making a streamgraph.

I had seen streamgraphs before, and I quickly found some information online about how to make them using R. However, I was intimidated by the code examples I found as I didn't understand their structure and what was going into them. I returned to the article mentioned above and was reminded about this example of what was also touted as a streamgraph of music genres. I had actually seen this figure before, and after some rumination on the matter, I figured out a way to make it using some MLS data. I found this table of average MLS season attendances from 1996 to 2016 by team and set about trying to make a figure similar to the music genre example, at least in appearance rather than interactivity. Here is the result:

(click to enlarge)

It took a while--some might say longer than it should have--for me to put together all of the pieces of this particular dataviz puzzle, but I'm thrilled with the result. This figure is a series of area plots using geom_area. Here are the initial lines of code I used within ggplot2:

> ggplot(data = Datt, aes(x = Season, y = value, fill = variable)) + 
+   geom_area(color="gray")+

The default position argument in geom_area is "stacked", so stopping at this point gives the following figure:

(click to enlarge)

This is a stacked area plot. This figure uses a fixed x-axis and stacks the area plots for each variable (the MLS teams) on top of each other based on the actual mean attendance figures in the dataset.

I think this stacked area plot figure is pretty cool, but it wasn't like the music genre streamgraph. To transform this figure into one more like the music example, I changed the default "position" argument within geom_area to "fill". The result was that each area plot now represented a proportion of total MLS attendance for each year, with the y-axis changing from number of people to percent and with the x-axis still fixed.

Naively, I really thought I'd done something novel and clever with my proportional stacked area plot, but I have since discovered other examples.

However (and realizing that I am very much a novice when it comes to these matters), I don't think that this form of data visualization is truly a streamgraph, as the name and most other examples I found imply to me that there should be some kind of visual flow to the entire image, including the external borders of the plotted data. Stacked area plots with fixed x-axes are closer to this appearance of flow than are proportional stacked area plots, but I still cannot figure out what true streamgraphs are plotting around in the horizontal plane. Without that understanding, the flow of the resulting figure is very limited. Still, these are visually interesting figures, and I'm sure I'll be able to apply similar techniques to other data soon.

Speaking of techniques, a few more words are warranted on the code I used in this project. Part of the pieces to this puzzle came from the code I used to create my facet plots of F1 drivers. I gave short shrift to the importance of the melt function from the reshape2 package in that post, but it was critical both there with the F1 data and here with the MLS data. melt takes the elements in multiple columns and creates rows linking those elements to their respective column headers based on whichever column one sets as the "ID variable". What results is three columns: the ID column, variable, and value. For example, here is what happened to the MLS table I used when I subjected it to melt using "Season" as the ID column:

> head(Datt)
  Season variable value
1   1996      CHI     0
2   1997      CHI     0
3   1998      CHI 17886
4   1999      CHI 16016
5   2000      CHI 13387
6   2001      CHI 16388

I then used these columns in the ggplot function like so:
> ggplot(data = Datt, aes(x = Season, y = value, fill = variable)) + 
+   geom_area(position="fill", color="gray")+
+   scale_fill_manual("Teams", values=MLSpal)+

What is amazing to me about this is that even though the values in "y" are from all of the teams and all of the years, the ggplot function is able to make sense out of it all simply by telling it that fill is based on whatever is in the variable column for each of those respective values. Marvelous.

Lastly, colors. In the ggplot function above, MLSpal is the variable I created using the colorRampPalette function in the RColorBrewer package. I added color="gray" to geom_area in order to create more visible divisions between each of the area plots given the number of teams and the resulting similarity of some of the fill colors.


Popular Posts