Jan 11, 2012 by Cliff DeJong
NOTE: This article was written at the start of the 2012 NASCAR season to provide a better understanding of the algorithm used in our AccuPredict - NASCAR Driver Finish Predictions tool. The basis of the formula remains as described in this article, although some tweaks to the weighting of the formulas are made by Cliff after each NASCAR season.
One metric is how well a driver has done in the last several races, counting every track?
If you use a small number of races, you will measure how a particular driver has done lately and be able to react to a driver on a hot streak, such as Tony Stewart at the end of the 2011 season (or Kyle Buschs annual collapse during the Chase).
A small number of races will better reflect how a driver has improved with time as well.
On the other hand, using a large number of past races will not be sensitive to one bad race caused, for example, by an accident, and will be a better estimate of how consistent a driver is.
I evaluated the correlation of actual finishing position to the driver's average finish over the last N races.
If a driver was only in some of the last N races, the average is over those races he was in. The plot in figure 2 shows correlations for the entire database, and with the years 2010 and 2011 separated out.
The curves shown have similar shapes, and all show that averaging over the last 10 races or more give the best results.
However, an obvious question is why 2010 and 2011 are so much better than the entire span of data over the years from 1991 to 2011?
I believe that this is due to recent phenomena of start-and-park, where some drivers with little or no sponsorship will attempt to qualify and then only run a few laps due to cost issues. Those drivers are almost certain to finish very poorly every race and are therefore easy to predict.
This raises the overall correlation for those races in a misleading manner.
Since the top 35 cars in owner points are locked into each race and do not start-and-park, I repeated the calculations using only cars that finished in the top 35. This is shown in Figure 3.
It would be better to look only at drivers in the top 35 in points (those locked into the race and not likely to start-and-park), but that is not readily available. It would take significant effort to add this to the database.
Again, the curve shapes are similar and correlations for 2011 are somewhat above the long-term averages, but the differences are much smaller.
There is rapid improvement as the number of races averaged increases to about 12-15, with little or no improvement above that.
Trying to read differences of less than a percent is pushing the data beyond what is reasonable, so I have selected 15 as a reasonable number to average over all races. I wanted as small a number as possible to preserve any information about drivers on hot streaks.
Often a driver will excel at a particular track. Denny Hamlin, for example, has always done very well at Pocono.
Using the 1991-2011 data, I calculated correlations of actual finishes for all drivers with his average finishing position for each track. Again, only the top 35 finishers are counted.
Figure 4 shows the results for three typical tracks: Phoenix, Atlanta and Daytona.
Phoenix and Atlanta show typical curve shapes, with the correlations rising as the number of races averaged gets larger and then flatten out. The curves peak out at 6 or more races.
Daytona, on the other hand, has a very poor correlation, no matter how many races are averaged.
There may be a few drivers that have done well in the past and will do well at Daytona in the future, but in general, at Daytona, past performance at that track does not imply continued success.
Conversely, poor past performance at Daytona does not necessarily imply another poor finish.
In Figure 5, I have taken each track's correlation as a function of the number of races and averaged all the tracks together to give the curve in red.
The figure also shows in blue the average correlation for averages from the most recent N races at any track (from Figure 3 above).
The two curves have similar shapes: both start relatively low and then improve as the number of races averaged increase.
The correlation for averages at each individual track starts to level out above five or six, while the average for the correlations using all tracks climbs more slowly, and peaks at around 15.
Consideration of each individual track's correlation curve suggests that averaging eight races at the same track gives very good performance for this measure.
The table in Figure 6 gives each track's correlation for performance averaged over the last eight races. I have also included the last four race averages, since I have frequently used that in the past.
For almost all tracks, averaging over eight races improves correlation over the four-race averages, but not by much.
These correlations are also seen to be lower than the correlations over the last 15 races at all tracks (see Figure 5).
In other words, average driver finishes over the last 15 races at all tracks is a better indicator of how well he will do at a particular track than his past finishes at the same track.
Of course, in the AccuPredict method, both performance measures will be used to estimate driver finishes.
Notice also that some tracks are not correlated well at all to past performance at that track:
One possible explanation for the most recent 15 races at all tracks being a better indicator than the most recent eight races at a specific track is the time interval covered.
The last 15 races at any track is almost half a season, or about half a year, while the last eight races covers the past four or eight years of data at that track, depending on whether or not one or two races are run each season at that track (such as once yearly races at Chicago or biannual races at Martinsville).
I suspect that the reason for needing to average over several races, even over several years at some tracks, is due to accidents or other problems, like flat tires, that could skew a driver's performance downward and distort his performance unfairly.
It is interesting to note that the need to use several races is more important than reflecting a hot streak over a few races, that is, driver consistency for the long haul is more important.
Cliff DeJong (pronounced De Young), the man behind AccuPredict, is a research scientist who has been crunching numbers his entire life. An avid NASCAR fan, Cliff was introduced to fantasy NASCAR by his brother (who beat him at just about everything).
Cliff put his Carnegie Mellon Computer Science degree and Iowa State University Mathematics degree to use creating successful methods to predict each Cup race based on NASCAR statistics.
It is an obsession that has consumed untold hours.
Cliff would love to hear your comments, questions and suggestions at moc.liamg@tciderpucca
Download this article in its original PDF (1.25MB) format for free!