Chapter 2 Data

There were multiple data sources needed for this project. First, to create the dynamic linear model to forecast the spread, I gathered data on all the spread movements throughout the week leading up to the game for as many NFL games as possible. Through web scraping from the https://pregame.com/game-center website, I was able to gather these data on all NFL games from the past two seasons. These data needed a significant amount of manipulating and cleaning to be put in a usable format. Through using “stingr” manipulations, each game contained a data frame of approximately 100-200 observations of the variables listed in Table 2.1.

Table 2.1: Betting Statistics
Statistic Description
Time and Date The time and date of the observation; The first observation nearly always occurred after both teams had finished playing their previous game, so usually the first observation was Sunday evening one week prior to the game with the final observation seconds before game time (usually the following Sunday)
Spread The spread for the away team
Away Cash Percentage The percent of the money bet on the game that is bet on the away team
Away Cash Bet The amount of money that is bet on the away team
Away Ticket Percentage The percent of bets on this game that are on the away team
Away Ticket Number The number of bets on this game that are on the away team

With these data, it is easy to calculate the same dataset for the home team through the simple formulas listed Equations (2.1)(2.4)

\[\begin{eqnarray} \text{Home Cash Bet} =& \frac{\text{Away Cash Bet}}{\text{Away Cash Percentage}} - \text{Away Cash Bet} \tag{2.1} \\ \text{Home Cash Percentage} =& \frac{\text{Home Cash Bet}}{\text{Home Cash Bet} + \text{Away Cash Bet}} \tag{2.2} \\ \text{Home Ticket Number} =& \frac{\text{Away Ticket Number}}{\text{Away Ticket Percentage}} - \text{Away Ticket Number} \tag{2.3} \\ \text{Home Ticket Percentage} =& \frac{\text{Home Ticket Number}}{\text{Home Ticket Number} + \text{Away Ticket Number}} \tag{2.4} \end{eqnarray}\]

These variables are important because they influence the spread. This is because a casino wants to manipulate the spread so the percent of money on each team is 50%, as this will generate 4.5% of money for the casino, guaranteed. However, other times, the casino is essentially gambling by allowing for uneven money percentages. They take a position in a certain outcome that, according to their models, can raise their expected value.

The timing for each of the data points from these series are irregular. To start, each week, one game is played on Thursday night, one on Monday night, with the rest of the games played on Sunday. The series starts when the casinos first open the game for betting. This is usually occurs the Sunday one week prior to the start of the game. But, since not all games are played on Sunday, some games are open for betting for shorter or longer periods of time. In addition, the casinos open up the Week 1 games for betting weeks in advance. This is the first irregularity that causes for different length series’. This is why the shortest series has only 57 data points while the longest series has 265 data points. However, the 25% - 75% of data points is 130 to 170, and nearly all series fall between 100 - 200 data points.

Within each series, the data is not captured in standardized time intervals, but instead, each data point is captured when there is a shift in the percentage of money or the percentage of tickets that is bet on each team. Some data points can be spaced minutes apart while others can be spaced out 12 hours apart. It is the timing of bets that trigger a data point. For modeling purposes, I treat these irregular intervals as evenly timed data points. When I forecast the spread, I forecast for \(h\) number of future points with \(h\) chosen through a separate model used to predict how many more data points this series based on the week in the NFL season, the hours until the game starts, the number of bets and amount of cash bet based on the game. This is further discussed in the modeling section.

The next dataset is a CSV that is updated weekly that contains information, such as the teams, week, year and location, about all NFL games dating back to 2006, in addition to the opening and closing spreads for these games. This aspect is useful for cross-referencing. But, more importantly, this dataset contains the results of all the games, in addition to a list of all games. This list can be iterated through for all 414 separate DLMs.

Finally, in order to find team-level statistics, I needed to find separate data sets for each NFL season. This is because leading up to a Week 5, 2018 matchup, the only information that bettors have is all the season (and all previous seasons) data leading up to Week 5 in 2018. Football Outsiders has webpages with week by week statistics for a certain type of statistic. This is called Defense-adjusted Value Over Average (DVOA). DVOA measures a team’s efficiency by comparing success on every single play to a league average based on situation and opponent. In addition, the “Weighted DVOA” is another metric provided, and this statistic weights the team’s DVOA with a preseason projection that the website, Football Outsider, created. This is because after 1 week, a team’s DVOA will be very extreme, but weighting it with a projection ensures that the metric will not overreact to an extremely limited sample size. This is essentially similar to putting a prior on DVOA and updating the posterior with the data from the games played. The scale for these statistics is a percentage, and this indicates the percent above or below average that a team is. Table 2.2 describes all these statistics.

Table 2.2: Team-specific Statistics
Statistic Explanation
Total DVOA Measures a team’s efficiency by comparing success on every single play to a league average based on situation and opponent
Weighted DVOA Weights the DVOA with a preseason projection
Offense DVOA Measures a team’s offensive efficiency
Defense DVOA Measures a team’s defensive efficiency
Special Teams DVOA Measures a team’s efficiency on Special Teams plays (field goals, punts, kickoffs)
Record A teams record of their wins, losses and ties

I needed to merge and match these data sets. I did so by matching the week and year of each game to the correct dataset for the statistics, and then matching the team to their statistics up to that certain week. This completed data set is used for modeling. Section @ref{appen1} of the Appendix shows one line of this data set.