US Youth Soccer Registration and Current MLS Players by State

After my recent map-making misadventure, I returned to my dataset in an effort to clean it up, add to it, and otherwise bend it to my will. My goal was to produce a map of the United States with gradient shading of each state. In particular, I wanted to use this approach to present data related to MLS player production by state in reference to the number of youth soccer players in each state. The supposition here is that the ratio of the number of MLS players from a state to the number of youth soccer players in that state is an indicator of the relative quality of that state's youth soccer, as MLS players from that state would presumably have learned their skills playing youth soccer in that state.

Using my 2017 MLS player dataset along with the youth player registration numbers that are available from US Youth Soccer, I produced ratios of the number of current MLS players from each state (based on the location of each player's hometown) to the number of youth soccer players registered with US Youth Soccer in that state. I was then able to use map_data from ggplot2 to produce the following map:
(click to enlarge)
The more purple the state, the higher that state's ratio of current MLS players to registered US Youth Soccer players. The gray/missing states are those states without a current MLS player (i.e., a ratio of 0).

A couple of things stand out to me from this map. One is that not only does California account for the most US hometown players currently in MLS (67 of 312), California also provides the highest proportion of current MLS players relative to its US Youth Soccer registration numbers. The former finding is not surprising, given that California is by far the most populous state in the Union, but the latter finding is impressive: California not only has more players than any other state, it also has proportionally greater quality players than any other state.

The other finding that caught my attention is that Missouri is very near the top of the table when it comes to quality player production. As indicated by their purple shading, both Missouri and Oregon are very close to California's ratio of MLS players to youth players. So not only is St. Louis, Missouri, the hometown of American soccer, the quality of the players produced through youth soccer in Missouri would appear to be amongst the highest in the country.

However, there are a few issues that limit the strength of any conclusions that can be drawn from this project. One is that the US Youth Soccer registration numbers likely do not include all youth soccer players in the various states. I still can't figure out the relationship between US Youth Soccer, which is the youth affiliate of US Soccer, and AYSO or whether the data provided by US Youth Soccer includes all registered youth soccer players regardless of affiliation. Even so, I would still expect the US Youth Soccer registration numbers to at least be proportionally representative of total number of youth players registered in each state even if AYSO and other organizations or even players who only play high school soccer are not included in the numbers I used.

One thing I do know about the US Youth Soccer numbers is that they include both boys and girls. Overall, in 2008 (the last year for which the boys:girls split is reported) 52% of players were boys and 48% of players were girls. However, the state numbers are not broken down into girls and boys, so the denominator in my ratios includes both males and females while the numerator includes only males. Again though, I think it unlikely that states have wildly disparate proportions of girls and boys registered as players with US Youth Soccer. Nevertheless, the possibility remains that states that have outlier boy:girl player splits could skew the data I used to make this map.

Another concern is that the numbers of MLS players from most states are relatively small and thus more susceptible to the effects of players who ply their trade in foreign leagues of similar or higher quality than MLS, such as Christian Pulisic. Such players are by definition not included in the MLS player totals but are clearly quality players who likely grew up playing youth soccer in their home states prior to moving abroad. States that were the homes of these players will not see this reflected in the ratios used to shade the map above, causing their ratios to be smaller relative to other states without hometown players abroad.

Yet another issue to mention is that there is a chronological discrepancy that I cannot overcome. The current MLS season is 2017, while the US Youth Soccer registration numbers are from 2014, which is the most recent year that is available on the website. But does this even matter? The real chronology problem seems to me to be the fact that virtually all of the players in MLS now are too old to play for youth soccer teams and have been for many years. Which year of US Youth Soccer would be more appropriate to compute these ratios? I don't have an answer for that, and 2014 seems as good as any for this exercise.

Enough of those caveats. What about the R code? The idea for this project came out of this R-bloggers example of using map_data to create gradient maps that shaded each state. I was intrigued by this function's capacity to produce maps of all of the Lower 48 (unlike my past struggles with get_googlemap). While the page with the example was inspiring, it was the link to the R code on GitHub that was more valuable. Even then, it was the comments section that proved the most valuable, both for the R code that actually worked for me when creating the map (from amandamasonsingh) and for the solution to the problem that caused my first attempt at map making with map_data into a shattered mess (from PReineke). In particular, it was the latter's advice to use order on the $order column in my merged dataset that did the trick, turning my shattered map into one that was correctly shaded. Here is the code I used with respect to this part of the project

> all_states <- map_data("state")
> stateMerged$region <- stateMerged$States
> statesTotal <- merge(all_states, stateMerged, by="region")
> statesTotal$ratio<-as.numeric(statesTotal$ratio)
> statesTotal <- statesTotal[order(statesTotal$order),]

I'm thrilled with how this turned out, but there is so much about the code and what exactly it is doing that I need to understand. So much to learn!

Comments