Since we’re still in the initial steps of our project, currently we plan to work on data extraction and cleansing together. As we move further along, we plan to implement a division of labor that takes advantage of each team member’s passions and strengths. The exact division has yet to be determined, but for now we will tentatively assign the three main components of our project as follows:
Upon a valid query, e.g. for the Bears from 1994-01-01 to 1995-01-01, the results are listed in table format:
We used Python’s BeautifulSoup library for the scraping here, the source code itself was very short - about 40 lines. Minimal cleaning was done, as the unicode character for the bullet points persists in our csv. This and other unnecessary details will
simply be removed through regular expression matching or modifying the scraping code. Regardless, the cleaning task here will not prove to be too difficult.
Transaction data proves useful for us especially in the context of historical analyses of player trades and drafts. The ‘Notes’ column of the search results may contain pertinent information for predictive analysis, or constructing
visualizations with the notes column as categorical variables, based on different round draft picks or free agent signings.
Our immediate next steps will involve cleaning the AV and transaction data that we have extracted from profootballreference.com and prosportstransaction.com, respectively. This is necessary because we acquired an enormous amount of data from both of these
sources and we need to make sure that it is in a consistent and legible format. Once we clean this data, we plan to move it into a SQL database so that we can query it in order to draw conclusions regarding the importance of different positions
in the NFL and analyze the value that teams have obtained from transactions that have taken place since the salary cap was implemented.
Once we finish this, we need to figure out how to obtain salary data - ideally we’d obtain the salary cap of each team for every year since 1993. As discussed above, we’ve had some trouble finding easily-extractable salary data, but
since we consider this to be an important part of our analysis, our goal is to find a data source that would allow us to obtain the aforementioned salary data through either an API, a CSV file download, or web scraping. One route we could
go, which we plan to explore before the midterm report, is scraping spotrac.com for its extensive salary data. As can be seen below, it has data for the salary cap of each team in the NFL. The one downside is that it only seems to contain
salary cap data since 2011, so we’d have to either find another source for salary cap data before that year or pivot the scope of our project. Once we are able to obtain this data, we will clean it and add it to our SQL database as well -
just as we plan to do with our AV and transaction data.
Once we are done with the aforementioned data warehousing and integration aspects of our project, we hope to begin work on the visualization and machine learning aspects of our project. Although we haven’t delved too deeply into these aspects yet, some ideas we’ve had for each aspect are listed below:
The hardest part of our project thus far has certainly been data extraction. We’ve run into issues with the availability of data and the amount of time that it would take to scrape data from profootballreference. Salary data has been particularly difficult to find. Spotrac has more complete salary data for recent years but no players have complete salary data for the timespan that we have selected for AV data. Other difficulties arose when trying to scrape player positions and URLs (to serve as id’s) from profootballreference. Our initial approach would have taken about 20hrs to complete given the way that data is organized on the website. An alternative approach that we pursued only took 20 minutes but required deeper exploration of the site and collective effort on the part of the team to make things feasible.
Challenges: Some of the greatest challenges that we’ll be facing moving forward will be how we utilize transaction and salary data to gain the most insight from relatively sparse salary data. Fortunately we’ve been able to find near complete salary data for the past 7 years. However, this may affect the type of insights that we want to make using this data.
Concrete results/initial insights: Our challenges with data cleaning and integration were plenty, resulting in us diverting our resources to answering how we would approach answering some of the initial questions we proposed in our methodology. In addition to determining the fair value of a player (according to the methodology described in the pre-proposal), one area of interest we turned our attention to is valuing draft picks.
Motivation:Drafting is the edge every NFL team has. In theory, the amount of value every team gets out of its free agency spending is the same, in that there is a salary cap and we can assume each contract is ‘fair.’ Drafting, however, is where teams can get cheap labor - i.e. salary is far less than fair salary. Cheap labor only lasts 4-5 years - the length of the rookie contract. We can assume that afterwards, the price the team pays for the player is equal to the fair salary of the player.
This is why draft picks are worth so much - they increase the talent per dollar spent for a team. In other words, a great draft, in theory, should give the team a boost for the next 4-5 years only. Example: The Patriots were docked their first round pick in 2016. They will be feeling this loss until 2021, when that player in theory would have begun their second contract.
Chandler Jones was traded for a second round pick (plus Jonathan Cooper, but let’s ignore that for simplicity), even though that second round pick would almost certainly not be as good as Chandler Jones (i.e. comparing expected Career AV for that pick with Chandler Jones’ expected Career AV). If we consider this trade from the Patriots' perspective, Chandler Jones had just one year left on his rookie deal - i.e. one more year of cheap talent. After the one year, he would have to be compensated fairly, and a second round pick would have four years left on his rookie deal - four years of cheap talent.
Chandler Jones was paid 7.79 million in 2016. Suppose his fair contract value was 13.79 million. Then Chandler Jones offered an additional 6 million dollars above fair value. Over the four years from that second round pick, would the dollars above fair value sum to over 6 million dollars? We can parlay this into a value system for draft picks.
Assumptions:For example, suppose using the above, not including draft picks, the Brown’s total assets are worth 170 million for 2017. Total assets includes fair value for all the players, plus free salary cap space. Suppose on average, the first overall pick outperforms his rookie contract by 4 million dollars in Year 1, 5 million dollars in Year 2, 6 million dollars in Year 3, 8 million dollars in Year 4, and 4 million dollars in Year 5. Then by using the first overall pick, we expect the Brown’s total assets to increase by 4 million to 174 million in 2017, and increase by 5, 6, 8, and 4 million dollars respectively in 2018, 2019, 2020, and 2021. We can then value the first overall pick at 4 + 5 + 6 + 8 + 4 million dollars - or $27 million dollars.
Methodology:Contract values are relatively fixed for draft picks in terms of both length and value (i.e. a player drafted Round 2, pick 35 will sign a 4 year deal worth roughly 6.38 million), due to constraints in the CBA (Collective Bargaining Agreement) signed in 2011.
For a given unused draft pick: We know the approximate contract values over each year of the rookie contract for that draft pick, and can approximate the fair salaries for each year of the rookie contract based on data for that pick in previous years. Since the salary cap affects fair contract values, and the salary cap is increasing for the NFL, we may need to adjust our fair salaries for projected salary cap increases. We sum the difference between each year’s fair salary and actual salary to determine the draft pick’s worth.
We also had to figure out a feasible way to assign player ID’s to players. This was especially difficult considering various players in the NFL between 1993 and now have shared the same name - and might have even played on the same team at the same time. Thus, we decided to look at when a player was drafted in order to differentiate between players with the same name. This necessitated the assumption that no two players with the same name were drafted in the same round with the same pick. Since we were able to get the round and pick a player was drafted in when scraping our AV data, this allowed us to easily create unique player ID’s for all players drafted between 1993 and now and to make sure that these same ID’s were assigned to the same players in the AV table and the draft table. For drafted players, we decided to simply make their player IDs the year they were drafted combined with the pick they were drafted with. Still, this meant that we had various undrafted players/players that were drafted before 1993 in our AV dataset that didn’t have player IDs we could assign to them. We handled this by grabbing a unique URL for each player when scraping our AV Data. This allowed us to do the following: for each player in the AV dataset that didn’t have a player ID from the draft table, we checked if the unique URL associated with him had a player ID associated with it. If it did, it meant that we had already invented a player ID for that player and we simply assigned it to him. Otherwise, we created a player ID for that player in the same format as the player IDs used for the drafted players, using 0000 as the draft year and incrementing a counter to assign a unique pick to each undrafted player ID. Finally, we had to make sure that these player IDs were assigned to the right players in our salary data. Unfortunately, we weren’t able to find draft information associated with players in our salary data - so in order to assign the correct player IDs to these players, we had to assume that no two players with the same name played for the same team in the same year with the same position. We were then able to use a dictionary that mapped a tuple consisting of player name, team, year, and position to a player ID in order to assign that player ID to the player in the salary data.
This brings us to the final ID we had to figure out - position IDs. There are various positions in football, and these can be broken up into even more specific positions. For instance, the Offensive Line position, which is abbreviate OL, can be broken down into Guard (G) - which can be further broken down into Right Guard (RG) and Left Guard (LG), Tackle (T) - which can be further broken down into Right Tackle (RT) and Left Tackle (LT), and Center (C). Thus, when extracting data from various sources, we had to decide on a consistent way to identify all of these different positions. The final touches to this are still being applied, and we will discuss the decisions we came to in the next blog post. Furthermore, since certain players also played various positions throughout their careers, for the sake of simplicity, we decided to assign the position that the player started out with in the NFL as their position for their entire career.
Once all of this was figured out, the rest of the database loading was fairly simple - all I had to do was follow the schema, look at the order of the columns in the CSV that we were reading the data in from, and assign the values from the CSVs to the correct columns in each table.
In particular, when building an NFL team, we must fundamentally understand the constraints that teams have to work with. The two key restraints are salary cap and roster size. Teams, then, want to maximize the collective talent of the 53 players, given their budget restraint.
Teams can either pay market value for the player - i.e. through free agency - or they can pay far less than market value. Thus, the value of the draft pick can, in theory, be equal to the added value above what the draft pick is actually paid. See more precise calculations in the actual writeup (see above in "Concrete results/initial insights").
There were three problems with our previous AV Data:
Thusly, we had to get a unique ID for each player - write the draft position to the CSV file as a string (easy) - and get the position for each player.
For the unique ID, we noticed that every player on Pro-Football-Reference has a unique player page. So we took the URL of their player page when scraping.
Getting the position was much more difficult:
Kevin was a major assistance in creating functional BeautifulSoup code for this section.
For our visualization we decided to look at the distribution of AV scores across quarterbacks. The script was abstracted out such that it is straightforward to replicate for a different position, and we chose to examine a histogram of quarterback AV counts. This visualization is particularly interesting because it allows us to examine the numerical basis of the notion that the quarterback is the most important player for the success of a team. One noteworthy component of this visualization is that it shows us that the distribution for QB AV’s actually tends to follow a somewhat trimodal distribution with peaks at 0, 9 and 12. When we talk about quarterbacks colloquially, we tend to classify them into two classes ‘elite’ and ‘not elite’, focusing on a player’s ability to meet a certain threshold to transition between the two. In reality however, considering a quarterback’s ability in the context of 3 or 4 different classes may be of more utility to us and may be something that we can do with relative ease by fitting different normal curves to each peak and assessing the probability that a player’s performance is derived from one of those distributions. We could conduct a similar analysis for other positions to assess if how they are stratified in terms of performance. Additionally, we’d like to look at how these distributions change over time to assess the relative importance of each position through the 23 year period that we consider for our data.
We feel that we’re making good progress on our project but need to devote some time to formalizing what type of machine learning insights we’d like to make. Specifically, selecting appropriate features and classifiers for supervised learning tasks and deciding whether we would like to pursue any unsupervised learning tasks.
For the second blog post, we worked on cleaning the data, loading it into a database, and tried our first hand at machine learning, creating a predictor for a player’s career AV given the team they were drafted to, their position, and their draft pick.
if first_name == "michael": first_name = "mike" elif first_name == "mike": first_name = "michael"
CREATE TABLE player( id int not null, name text not null, PRIMARY KEY(id)); CREATE TABLE salary( year int, team_id text, player_id int not null, position_id int, base_salary int, signing_bonus int, roster_bonus int, option_bonus int, workout_bonus int, restructured_bonus int, dead_cap int, cap_hit int, cap_percentage real, FOREIGN KEY (player_id) REFERENCES player(id), FOREIGN KEY (team_id) REFERENCES team(id), FOREIGN KEY (position_id) REFERENCES position(id)); CREATE TABLE av( player_id int not null, year int, team_id text, position_id int, age int, av_value int, games_played int, games_started int, pro_bowler int, all_pro int, FOREIGN KEY (player_id) REFERENCES player(id), FOREIGN KEY (team_id) REFERENCES team(id), FOREIGN KEY (position_id) REFERENCES position(id)); CREATE TABLE team( id text not null, name text not null, PRIMARY KEY(id)); CREATE TABLE position( id int not null, name text not null, PRIMARY KEY(id)); CREATE TABLE draft( year int, round int, pick int, team_id text, player_id int not null, position_id text, age int, last_year int, games_started int, career_av int, draft_team_av int, games_played int, FOREIGN KEY (player_id) REFERENCES player(id), FOREIGN KEY (team_id) REFERENCES team(id), FOREIGN KEY (position_id) REFERENCES position(id));
select player.name, salary.cap_hit, av.av_value, av.year,av.position_id from player, salary, av where player.id = salary.player_id and player.id = av.player_id and salary.cap_hit > 20000000 and av.av_value < 5 and av.year = salary.year;
select player.name, salary.cap_hit, av.av_value, av.year,av.position_id from player, salary, av where player.id = salary.player_id and player.id = av.player_id and salary.cap_hit > 20000000 and av.av_value < 10 and av.year = salary.year;
select player.name, salary.cap_hit, av.av_value, av.year,av.position_id from player, salary, av where player.id = salary.player_id and player.id = av.player_id and salary.cap_hit < 5000000 and av.av_value > 20 and av.year = salary.year;
select distinct player.name, av.av_value, av.year, av.position_id from player, draft, av where player.id not in (select player_id from draft) and player.id = av.player_id and av.av_value > 20;
select position_id, avg(av_value) from av group by position_id order by avg(av_value) desc;
select position_id, avg(av_sum) from (select *, sum(av_value) as av_sum from av group by player_id) group by pos
ition_id order by avg(av_sum) desc;
Although most of the data cleaning and integration was finished before our previous blog post, there was still one aspect we weren't completey satisfied with yet: the consistency of the position data in our AV and salary tables. Since we extracted the data in those tables from two different sources, the positions associated with the players in those tables - even for the same player in the same year - could be different. After carefully examining the positions used in the two tables, we decided that we preferred the set of positions used in the salary table for most cases. Thus, when adding players to our database, we made sure that entries that existed in both tables were given the same position in both tables - oftentimes using the position listed in the salary table. However, we did find certain scenarios in which the position data from the AV data was preferable - this mostly had to do with the fact that players occasionally switch positions throughout their careers, and the salary data tended to only list their most recent position for all of that player's entries in the data. Thus, we checked whether such a switch had occurred using the position from the AV data, and if it had, we changed the position of the player for that year to match the position given by the AV data in both the AV table and the salary table.
After making the positions consistent, we also wanted to make sure we had copies of the tables in the databse, which contained our cleanest and most recent data and was the only source in which player_ids matched up and positions were consistent between tables, in CSV files. This was easy to do using sqlite and the command line using a command that looked like this for each table in our database:
sqlite3 -header -csv nfl_data.db "select * from av;" > ../data/av_data_db.csv
One of the next steps highlighted in the previous blog posts was calculating the fair salary for a player in a given year. To complete this step, we extracted a CSV file from our database containing AV data for every player, sorted by year and AV (within year).
We used R to process this data, and calculated the fair AV value for each player as follows:
Assumptions:
Using these three assumptions, and inputting the salary cap for every year from 1994 to 2016 (ranging from 34.6 million in 1994 to 155.27 million in 2016), we were able to calculate the fair salary for each player in that given year, using the following formula:
Some of our findings included:
These findings are interesting, but undervalue backups - i.e. some backups have 0 AV but are worth over $0. Part of our next step is seeking to improve this metric.
Moving forward from the ML attempts in the last blog post. We modified our classification and validation procedures to produce more varied AV quintile predictions. We switched to a stratified 10-fold validation scheme that ensures that the testing sets produced for the purpose of the validation are representative of the class distribution in the sample data. Additionally, we modified the class weights so that underrepresented classes are weighted proportional to their inverse frequency. This way we can still utilize the same number of training samples and achieve better results.
In order to create the following visualizations, we started with the stencil code we had been given for the first part of the d3 assignment and altered it to fit the data we wanted to display. We figured it would be useful to be able to compare the AV
values and salaries of players at different positions. Since it would be visually unappealing and not useful to display every player at every position, we decided to show a more narrow dataset - specifically the top 10% of players at each
position based on their AV value in a given season in the first chart and the top 10% of players at each position based on their cap hit in a given season in the second chart. In both visualizations, the cells representing each position in
the legend can be clicked to toggle the visibility of that position in the visualization and the bottom half of the year label can be scrolled to switch to any year from 2005 to 2016.
Although the basic foundation was there thanks to the d3 assignment stencil code, we still encountered several challenges while creating this visualization. First, we had to figure out how to plot only the top 10% of players at each
position based on the parameter we were examining. After some trial and error, we realized that the most efficient way to do this would be to use a Python script that utilized a SQL query to extract only the data that we wanted to plot from
our database and write it into a CSV file that we could read with d3. The important part of this Python script looked like this for the first visualization:
data = [] for year in range(2005,2017): for i in range(len(positions)): rows = c.execute(''' select distinct name, av.av_value, av.position_id, salary.cap_hit, av.year, av.team_id, av.age from av, player, salary where av.position_id = ? and av.player_id = player.id and salary.player_id = player.id and av.year = salary.year and av.year = ? order by av.position_id, av_value desc limit (select count(player_id) from av where position_id = ? and year = ? group by position_id)/10''', (positions[i],year,positions[i],year)) data.extend(rows)
data = [] for year in range(2005,2017): for i in range(len(positions)): rows = c.execute(''' select distinct name, av.av_value, av.position_id, salary.cap_hit, av.year, av.team_id, av.age from av, player, salary where av.position_id = ? and av.player_id = player.id and salary.player_id = player.id and av.year = salary.year and av.year = ? order by av.position_id, cap_hit desc limit (select count(player_id) from av where position_id = ? and year = ? group by position_id)/10''', (positions[i],year,positions[i],year)) data.extend(rows)
Over the past few weeks, significant efforts have been spent on clarifying the specific direction we wanted to go with our project. We wanted our project to have meaningful results, but also to include sufficient machine learning and statistical elements to satisfy the course requirement.
As a result, and after consultations amongst ourselves and with Colby, we determined a new project direction. Previously, we were just interested in determining the value of a draft pick and the fair contract value of a player. We are keeping with this, but expanding the decision-making process to include current team composition and specific draft prospects. In other words, the team's current strengths and weaknesses ought to inform player and prospect value to the team, and often teams aren't trading for a general draft pick, but a specific draft prospect, and we wish to take that into account. This also allows us to give more specific draft advice - should I draft a defensive end or quarterback first overall, given my team's composition (i.e. needs)?
We envision our project as requiring three steps. First, we determine which current players are being undervalued and overvalued based on their performance - this has already been completed (see Blog Post Three). Next, we determine the career potential of draft prospects by finding past and current players that they are most similar to based on their combine metrics. This allows us to predict their success in the NFL and in turn determine their fair contract value, which we can then compare to the pre-set rookie pay scale. Finally, using a supervised learning approach that compares teams’ historic performances with their roster composition (how many of each position they had, which positions performed best, which positions performed worst), we can determine how a team is currently predicted to perform as well as what piece a team might need to become more successful.
This new direction for our project necessitated the acquisition of a lot more data. This required new scripts for scraping, cleaning, and loading the data into databases.
sqlite3 -header -csv combine_data.db "select name, sum_av*1.0/count_av, combine.* from player, combine, (select player_id, sum(av_value) as sum_av, count(player_id) as count_av from av group by player_id) as av_sum where combine.player_id = player.id and combine.player_id = av_sum.player_id;" > ../data/similar_players_data.csv
sqlite3 -header -csv combine_data.db "select name, combine.* from player, combine where year = 2017 and player.id = combine.player_id;" > ../data/combine_players_2017.csv
>py -3 similar_players.py -pos True -k 3 Similar players to ('C.J. Beathard', 'QB'): ('Thaddeus Lewis', '2', 'QB') ('Chandler Harnish', '0', 'QB') ('Kevin Hogan', '1', 'QB') Predicted average seasonal AV: 1.0 Similar players to ('Joshua Dobbs', 'QB'): ('Tyrod Taylor', '4.833333333', 'QB') ('Marcus Mariota', '11', 'QB') ('Scott Tolzien', '0.75', 'QB') Predicted average seasonal AV: 5.527777777666667 Similar players to ('Jerod Evans', 'QB'): ('Colin Kaepernick', '8.166666667', 'QB') ('Tom Savage', '0.5', 'QB') ('Dak Prescott', '16', 'QB') Predicted average seasonal AV: 8.222222222333334 Similar players to ('Brad Kaaya', 'QB'): ('Zach Mettenberger', '0.5', 'QB') ('Garrett Grayson', '0', 'QB') ('Brandon Weeden', '3.75', 'QB') Predicted average seasonal AV: 1.4166666666666667 Similar players to ('Chad Kelly', 'QB'): ('Troy Smith', '1.5', 'QB') ('Matt Barkley', '0.666666667', 'QB') ('Chris Weinke', '1.8', 'QB') Predicted average seasonal AV: 1.3222222223333333 Similar players to ('DeShone Kizer', 'QB'): ('Tom Savage', '0.5', 'QB') ('Colin Kaepernick', '8.166666667', 'QB') ('Dak Prescott', '16', 'QB') Predicted average seasonal AV: 8.222222222333334 Similar players to ('Mitch Leidner', 'QB'): ('Marcus Mariota', '11', 'QB') ('Pat Devlin', '0', 'QB') ('Dak Prescott', '16', 'QB') Predicted average seasonal AV: 9.0 Similar players to ('Sefo Liufau', 'QB'): ('Tom Savage', '0.5', 'QB') ('Colin Kaepernick', '8.166666667', 'QB') ('Cody Kessler', '3', 'QB') Predicted average seasonal AV: 3.888888889 Similar players to ('Patrick Mahomes', 'QB'): ('Dak Prescott', '16', 'QB') ('Pat Devlin', '0', 'QB') ('Chandler Harnish', '0', 'QB') Predicted average seasonal AV: 5.333333333333333 Similar players to ('Nathan Peterman', 'QB'): ('Dak Prescott', '16', 'QB') ('Pat Devlin', '0', 'QB') ('Austin Davis', '2', 'QB') Predicted average seasonal AV: 6.0 Similar players to ('Antonio Pipkin', 'QB'): ('Zach Mettenberger', '0.5', 'QB') ('Garrett Grayson', '0', 'QB') ('Brandon Weeden', '3.75', 'QB') Predicted average seasonal AV: 1.4166666666666667 Similar players to ('Cooper Rush', 'QB'): ('Tom Savage', '0.5', 'QB') ('Cody Kessler', '3', 'QB') ('Austin Davis', '2', 'QB') Predicted average seasonal AV: 1.8333333333333333 Similar players to ('Seth Russell', 'QB'): ('Zach Mettenberger', '0.5', 'QB') ('Garrett Grayson', '0', 'QB') ('Brandon Weeden', '3.75', 'QB') Predicted average seasonal AV: 1.4166666666666667 Similar players to ('Alek Torgersen', 'QB'): ('Matt Mauck', '1', 'QB') ('Matt Barkley', '0.666666667', 'QB') ('Andrew Walter', '0.333333333', 'QB') Predicted average seasonal AV: 0.6666666666666666 Similar players to ('Deshaun Watson', 'QB'): ('Marcus Mariota', '11', 'QB') ('Pat Devlin', '0', 'QB') ('Dak Prescott', '16', 'QB') Predicted average seasonal AV: 9.0 Similar players to ('Davis Webb', 'QB'): ('Colin Kaepernick', '8.166666667', 'QB') ('Pat Devlin', '0', 'QB') ('Dak Prescott', '16', 'QB') Predicted average seasonal AV: 8.055555555666666 Similar players to ('Trevor Knight', 'QB'): ('Josh McCown', '2.571428571', 'QB') ('Brett Basanez', '0', 'QB') ('Kevin Kolb', '2.166666667', 'QB') Predicted average seasonal AV: 1.5793650793333331 Similar players to ('Mitchell Trubisky', 'QB'): ('Brett Basanez', '0', 'QB') ('Brock Berlin', '0', 'QB') ('Kevin Kolb', '2.166666667', 'QB') Predicted average seasonal AV: 0.7222222223333333 ('Mitch Leidner', 'QB', 9.0) ('Deshaun Watson', 'QB', 9.0) ('Jerod Evans', 'QB', 8.222222222333334) ('DeShone Kizer', 'QB', 8.222222222333334) ('Davis Webb', 'QB', 8.055555555666666) ('Nathan Peterman', 'QB', 6.0) ('Joshua Dobbs', 'QB', 5.527777777666667) ('Patrick Mahomes', 'QB', 5.333333333333333) ('Sefo Liufau', 'QB', 3.888888889) ('Cooper Rush', 'QB', 1.8333333333333333) ('Trevor Knight', 'QB', 1.5793650793333331) ('Brad Kaaya', 'QB', 1.4166666666666667) ('Antonio Pipkin', 'QB', 1.4166666666666667) ('Seth Russell', 'QB', 1.4166666666666667) ('Chad Kelly', 'QB', 1.3222222223333333) ('C.J. Beathard', 'QB', 1.0) ('Mitchell Trubisky', 'QB', 0.7222222223333333) ('Alek Torgersen', 'QB', 0.6666666666666666)
>py -3 similar_players.py -pos True -k 3 -zero True Similar players to ('C.J. Beathard', 'QB'): ('Chandler Harnish', '0', 'QB') ('Kevin Hogan', '1', 'QB') ('Connor Cook', '0', 'QB') Predicted average seasonal AV: 0.3333333333333333 Similar players to ('Joshua Dobbs', 'QB'): ('Tyrod Taylor', '4.833333333', 'QB') ('Marcus Mariota', '11', 'QB') ('Scott Tolzien', '0.75', 'QB') Predicted average seasonal AV: 5.527777777666667 Similar players to ('Jerod Evans', 'QB'): ('Ryan Lindley', '-1.333333333', 'QB') ('Blake Bortles', '9.666666667', 'QB') ('Colin Kaepernick', '8.166666667', 'QB') Predicted average seasonal AV: 5.5000000003333325 Similar players to ('Brad Kaaya', 'QB'): ('Kirk Cousins', '6.2', 'QB') ('Jared Goff', '-2', 'QB') ('Zac Robinson', '0', 'QB') Predicted average seasonal AV: 1.4000000000000001 Similar players to ('Chad Kelly', 'QB'): ('Quincy Carter', '4.5', 'QB') ("J.T. O'Sullivan", '1.5', 'QB') ('Donovan McNabb', '10.61538462', 'QB') Predicted average seasonal AV: 5.53846154 Similar players to ('DeShone Kizer', 'QB'): ('Ryan Lindley', '-1.333333333', 'QB') ('Levi Brown', '0', 'QB') ('Sean Mannion', '0', 'QB') Predicted average seasonal AV: -0.4444444443333333 Similar players to ('Mitch Leidner', 'QB'): ('Brett Hundley', '0', 'QB') ('Bryce Petty', '2', 'QB') ('Marcus Mariota', '11', 'QB') Pdicted average seasonal AV: 4.333333333333333 Similar players to ('Sefo Liufau', 'QB'): ('Jameis Winston', '12.5', 'QB') ('Tom Savage', '0.5', 'QB') ('Ryan Lindley', '-1.333333333', 'QB') Predicted average seasonal AV: 3.888888889 Similar players to ('Patrick Mahomes', 'QB'): ('Landry Jones', '1', 'QB') ('Dak Prescott', '16', 'QB') ('Pat Devlin', '0', 'QB') Predicted average seasonal AV: 5.666666666666667 Similar players to ('Nathan Peterman', 'QB'): ('Jimmy Garoppolo', '1', 'QB') ('Ricky Stanzi', '0', 'QB') ('Ryan Lindley', '-1.333333333', 'QB') Predicted average seasonal AV: -0.11111111099999997 Similar players to ('Antonio Pipkin', 'QB'): ('Dak Prescott', '16', 'QB') ('Pat Devlin', '0', 'QB') ('Jimmy Garoppolo', '1', 'QB') Predicted average seasonal AV: 5.666666666666667 Similar players to ('Cooper Rush', 'QB'): ('Tom Savage', '0.5', 'QB') ('Ryan Nassib', '0', 'QB') ('Jameis Winston', '12.5', 'QB') Predicted average seasonal AV: 4.333333333333333 Similar players to ('Seth Russell', 'QB'): ('Kirk Cousins', '6.2', 'QB') ('David Fales', '0', 'QB') ('Zac Robinson', '0', 'QB') Predicted average seasonal AV: 2.066666666666667 Similar players to ('Alek Torgersen', 'QB'): ('Chad Henne', '3.625', 'QB') ('Bryce Petty', '2', 'QB') ('Brian Brohm', '0', 'QB') Predicted average seasonal AV: 1.875 Similar players to ('Deshaun Watson', 'QB'): ('Marcus Mariota', '11', 'QB') ('Pat Devlin', '0', 'QB') ('Dak Prescott', '16', 'QB') Predicted average seasonal AV: 9.0 Similar players to ('Davis Webb', 'QB'): ('Christian Ponder', '5.75', 'QB') ('Bryce Petty', '2', 'QB') ('Jake Locker', '4', 'QB') Predicted average seasonal AV: 3.9166666666666665 Similar players to ('Trevor Knight', 'QB'): ('Tyrod Taylor', '4.833333333', 'QB') ('Marcus Mariota', '11', 'QB') ('Josh McCown', '2.571428571', 'QB') Predicted average seasonal AV: 6.134920634666666 Similar players to ('Mitchell Trubisky', 'QB'): ('Landry Jones', '1', 'QB') ('Donovan McNabb', '10.61538462', 'QB') ('Jeff Rowe', '0', 'QB') Predicted average seasonal AV: 3.8717948733333336 ('Deshaun Watson', 'QB', 9.0) ('Trevor Knight', 'QB', 6.134920634666666) ('Patrick Mahomes', 'QB', 5.666666666666667) ('Antonio Pipkin', 'QB', 5.666666666666667) ('Chad Kelly', 'QB', 5.53846154) ('Joshua Dobbs', 'QB', 5.527777777666667) ('Jerod Evans', 'QB', 5.5000000003333325) ('Mitch Leidner', 'QB', 4.333333333333333) ('Cooper Rush', 'QB', 4.333333333333333) ('Davis Webb', 'QB', 3.9166666666666665) ('Sefo Liufau', 'QB', 3.888888889) ('Mitchell Trubisky', 'QB', 3.8717948733333336) ('Seth Russell', 'QB', 2.066666666666667) ('Alek Torgersen', 'QB', 1.875) ('Brad Kaaya', 'QB', 1.4000000000000001) ('C.J. Beathard', 'QB', 0.3333333333333333) ('Nathan Peterman', 'QB', -0.11111111099999997) ('DeShone Kizer', 'QB', -0.4444444443333333)
For our fourth blog post, this visualization gets at a few dimensions of data. The radial stacked bar chart features the player's positions around the inner radius of the circle, and stacks the average AV values according to the round pick (with undrafted
being the category that players with no round or year of draft fall under). It seems quite obvious that undrafted players form the subset of players whose average AV do not rival those of drafted players, however it seems that within the
rounds of the draft, players in most positions fair quite well. The visualization was created in D3, with data processing done in Python.
Our biggest takeaways from this project were:
Another primary takeaway throughout the course of our project was that the bulk of the work in any data science project is in the process of obtaining clean data. This was more evident after completing most of our analyzes just the past two/three weeks, having spent more than a month and a half getting our databases in the schema that we had wanted them. With that said, we still learned a lot about the data pipeline, created some neat visualizations, and gathered some interesting insights into the valuation of NFL players from the perspective of a general manager through our salary, draft, and team composition division of analysis.
Future work for this project would include a more thorough look at how teams change their composition over time, as this was something that we felt was important enough to consider but did not end up having enough time to implement. The data used for the team composition stacked vertical bar chart gave an interesting look at how much emphasis certain teams placed on offense/defense, and a compilation of the same chart for all years dating back to 1978 can be found in the blog/viz/team_comp_av folder. This can be run assuming that you are running a server at that path to get around cross-origin request policies, as the D3 script loads in the csv files located at the same path. An interactive front-end application, focused more toward the composition engine/machine learning part of our project, would also be rewarding to complete given that we had completed a naive version of the back end functionality. Such a set of tools would prove useful to general managers in certain hypothetical situations, such as looking at the change in playoff probabilitys if a given player were to be traded.
For future work, we can also build off the insights we discovered in the draft pick evaluation. We can fine tune this valuation by incorporating fifth year options for first round picks, as well as adjusting for expected increases in the NFL salary cap over the course of a rookie contract. With these changes, we expect the value of all draft picks to increase further (especially first round picks considering the fifth year option).
We would like to thank our mentor TA Colby Tresness for guiding us throughout the semester and giving some semblance of structure to our project, as well as Professor Dan Potter for his remarks at the poster presentation.