/ Code

Sankey Diagram - High School Sports

what i did

SCHS Women's Soccer Rosters: 2014, 2015, 2016 is an interactive graph to see how players move through a high school sports program.

why i did

My daughter's high school posted their women's soccer team roster for the last three years and I was curious how the players flow through the system. Freshman can play on the Freshman, JV, or Varsity teams. Seniors can only play on Varsity, and Sophomores and Juniors can play JV or Varsity.

I also wanted to see if there were any examples of what the coaches tell the girls when they are cut: "Come back next year; there are plenty of players that were cut then made it onto Varsity."

how i did

This visualization wasn't very difficult. The hardest part, because I find such things tedious, was inputting the data[1]. I created a single csv file with the following columns: school year, name, class (Senior, Junior), and the team (Varsity, JV).

$ head -2 rosters.csv 
2014,Alpha Player,Freshman_Class,Freshman_Team
2014,Beta Player,Sophomore,JV

I imported this into mysql:

$ mysql schs -e 'select * from rosters limit 2'
| calyear | name              | schoolyear     | team          |
|    2014 | Alpha Player      | Freshman_Class | Freshman_Team |
|    2014 | Beta Player       | Freshman_Class | JV            |

The next step was the part that took the longest. If a player played n seasons then they needed n+2 rows created. The first row would be for the year prior to their first season. The last row is for the year after their last season. And in between are the seasons they played. The middle seasons were straight-forward because that's the data I have. The others I had to derive.

When they came into the program they were either doing it as a New Player or I didn't have enough data to know where they came from, shown as TBD. They have one of three paths when leaving a team: Graduate, No Team, or, because the season's not over yet, TBD.

This derived data needed two more columns, a row number and total rows for each player.

$ mysql schs -Ee 'select * from fuller limit 1'
    season: 1
   seasons: 1
   calyear: 2014
      name: Alpha Player
schoolyear: Freshman_Class
      team: Freshman_Team

Once I had this table I exported the data into a json file that had a nodes array with each of the teams, and a link array with each of the players in transition from one state (team) to another. I also included a third array so I could consistently sort Freshman, Sophomore, etc.

The presentation is based on the D3 Sankey Diagram demo block from Mike Bostock.

I altered it in a couple of ways. The first is that I hard-coded in the x-value to match the year and forced an initial sort of the y-values for the nodes. I did this because in my graph the x-axis is a time value and we are used to seeing time run from left to right in a linear fashion in graphs like this. I did the y-sort because I thought that it seemed natural to have the normal progression from bottom to top. I also wanted to minimize the amount of "mousing around" required to get the graph into a helpful state.

I also added the color coding for the links to help show how the different classes progressed at each step. The colors of the nodes and links are D3 built in ordinal scales. I find it interesting No Team came out red. I didn't plan this but I like it.

I also created the table showing team rosters and cohorts in each link

my take away

I'm happy with the way this project turned out. I was hoping to see how the different groups came and went through the system and this project helped.

On the technical side, it took me a while to really understand exactly what data I needed as nodes and what was needed for links. Once I wrapped my head around that then customizing everything was trivial. The sankay.js code doesn't offer a whole lot of help in that regard or in troubleshooting. Using mysql queries was very helpful.

On the non-technical side things are much fuzzier. The graph doesn't tell the whole story, but it helped. They say "know your data", but, when one of the data-points is your kid - it's not possible to be objective.

what next

There are a few things I'd really like to add to this:

  • More years of data, hopefully from a text file! I imported this data from photographs[1:1].
  • It would also be neat to add different sports to see how the players go to and from the different activities.
  • A "nice to have" is a "Comment" field with details about individual players. I'd like this because the "No Team" field can be misleading. Some of the players got cut, but others chose not to play soccer anymore, some moved, ...
  • Another wish list item is to track an individual player from entry to exit. It'd be fairly trivial to put in a search box which highlights matching nodes and links. But not today.



  1. Original roster data:
    2014 ↩︎ ↩︎