Jan 11, 2012 by Cliff DeJong
This article was written at the start of the 2012 NASCAR season to provide a better understanding of the algorithm used in our AccuPredict - Driver Finish Predictions tool. The basis of the formula remains as described in this article, although some tweaks to the weighting of the formulas are made by Cliff after each NASCAR season.
AccuPredict is an exclusive on-line fantasy NASCAR statistics tool that uses traditional and Loop Data statistics to predict NASCAR driver finish positions. Available to OWNER level subscribers of Fantasy Racing Cheat Sheet.
The AccuPredict method is an algorithm developed by research scientist Cliff DeJong. This multi-page article by Cliff DeJong is laymen friendly and aims to help fantasy NASCAR players understand the relevance of statistics regardless of their understanding of statistical formulas and methods.
The entire article is also available in PDF format for free download for easier off-line reading.
This report examines several driver performance measures and develops a method for predicting the finishing order of NASCAR Sprint Cup races.
The plot shown in figure 1 (click to enlarge) shows one of the better measures that I have found for predictions. It shows the actual finish of each driver plotted against the average of the last 18 races prior to that race for the 2011 season, 1260 data points. Only finishes of 35 and better are included. I have also shown the trendline as a summary of these data.
The spread of the data is amazing, and it is not obvious that this can be useful. Yet there are tendencies that are valuable since the data are clustered about the trendline.
The important fact is that the order of drivers in a specific race can be predicted in a meaningful way.
It is intended to show the methods used in general terms. The weekly implementation of this method is available as ACCUPREDICT to OWNER level subscribers on this site. The AccuPredict method was used without alteration to score 22nd overall in 2011 on nascar.com NASCAR Fantasy Live out of several thousand competitors. Click the screenshot image above to see the standings for yourself. Additionally, this site has received several great success stories from members who have had using it in their particular fantasy NASCAR games.
NASCAR data from 1991 through 2011 are used to develop performance metrics.
The key driver performance measures identified here are:
Driver Rating is the NASCAR Loop Driver Rating, a formula that combines wins, finishes, green flag passes and several other driver performance measures.
In this paper, track types are examined and a regrouping of types is suggested by statistical considerations. Restrictor plate races are scored by a subset of the key measures listed above.
Driver scores based on the above measures are correlated with the actual finishes for the 2011 season with a value of 0.554. During the 2011 season, ACCUPREDICT achieved a correlation of 0.538.
Predictions of almost anything are either historically based, assuming the past repeats itself, or based on first principles of physics, like your daily weather forecast.
Predictions based on historical databases are looking for similarities with the past. If a situation has come up before:
In other words, if a driver has done well at a particular track in the past, does this mean he will do well this weekend?
Maybe...you can also consider how well he is doing this year, and at similar tracks, and how he practiced and qualified.
For NASCAR, there is a rich dataset of past races: I have each race back to 1991 in my database with the finishing positions of each driver. This database is from the LeonardFrye web site, which is an excellent source for NASCAR statistics.
There are over 19000 data points. The database is in a computer-readable form, not scattered over various web sites, so it is relatively easy to process. Plus, each week in the season, and for past races, there are driver loop data, practice data and qualifying results, and other data such as bonus points earned, laps led, etc. My primary source for these data is FantasyRacingCheatSheet.com.
There are some very good fantasy NASCAR expert picks available on the web at no cost and some better ones that cost a subscription fee, including ACCUPREDICT, which is the result of this analysis.
Not all of these expert picks rank all the drivers; some only give a list of the top 5 or so drivers, and perhaps a dark horse or two.
Success in the fantasy leagues often depends on how well the low-ranked drivers do. These drivers are necessary picks because of fantasy salary constraints. So, I wanted to be able to rank each driver...not just get someone's opinion on who would do well at the next track.
Metric is a quantifiable measure of a driver's performance. Metrics available each week for each driver include:
Performance can be measured in two primary ways:
Lots of other data reflecting performance are also available, for example, Laps Led, Fast Laps, Green Flag Passes, Quality Passes, etc.
These later metrics are not as easy to process, but they are available on web sites such as fantasyracingcheatsheet.com, and will be addressed for the 2011 season only.
Cheat sheets are opinions of experts, often based on unspecified statistics, and not used in this analysis. I have found that other cheat sheets often do not score middle and lower ranked drivers.
DNFs or other major problems during a race can easily move a top ranked driver from a predicted top five to a finish of 40th. I define a DNF as finishing behind anyone who does not complete the race - that is a clear indication of a major problem, not just poor performance.
Typical DNF rates are 15-20%. Since DNFs are unpredictable, there is no obvious way to include them. Their effects on finishing position are in the database that is used.
The process of how to combine the various metrics is a complex subject that takes serious effort but, as will be seen, providing little gain beyond simple measures. The metrics are not independent - a driver who has done well at a particular track has generally done well at the same track types, and he is likely to practice well and qualify well.
How do you measure the effectiveness of a metric or combinations of metrics? There are two primary ways that I use:
I also use less frequently the likelihood that a higher-ranked driver will finish ahead of a lower-ranked driver.
Correlation is a standard statistical measure that essentially plots one variable (the actual finish, for example) as a function of the other variable (the metric, practice speed, for example), and measures how well a straight line will fit the data.
Correlation ranges between -1 and 1, with the two extremes indicating a perfect fit.
A correlation of zero indicates that the result is independent of the metric. In other words, a very low correlation indicates the metric is not a useful indicator of a driver’s finishing position. I will show some plots later to make this a lot clearer.
Correlation can also be expressed as a percentage: -100% to 100%.
Typically, in NASCAR, numbers range from 30 to 50%, that is, there is a lot of randomness in NASCAR. The data shown in the introduction has about 0.50 or 50% correlation.
A negative correlation means that as the metric gets larger, the actual finish gets smaller.
In this paper, I deal only with positive correlations by scaling the metrics - for example, the Driver Rating becomes a simple ranking of the drivers, with the best driver scored a one, second best a two, etc.
The Standard Deviation of the predicted finish is a measure of how accurate the prediction is. In essence, it is a measure of how much you are wrong on average.
Almost 70% of the data are within plus or minus one standard deviation. It is larger than you might think: typical numbers are 9 to 10, showing, again, a lot of variability in NASCAR. This is not at all unreasonable if you think about a DNF rate of about 20%. A driver that finishes 1, 2, 3, 4 and 35 (due to an accident), will average only a 9th place finish for these five races, despite four outstanding races.
The relative average finishing positions among drivers is the important point.
Drivers will be ranked by a score, based on the metrics selected. The likelihood that a higher ranked driver finishes ahead of a lower ranked driver is calculated by comparing each driver with every other driver ranked below him. The percentage of correct rankings is then calculated, and averages about 70%.
This percentage is higher if the difference in rankings is high, and less if differences are small. This measure is not used often, since it is related closely to correlations.
Cliff DeJong (pronounced De Young), the man behind AccuPredict, is a research scientist who has been crunching numbers his entire life. An avid NASCAR fan, Cliff was introduced to fantasy NASCAR by his brother (who beat him at just about everything).
Cliff put his Carnegie Mellon Computer Science degree and Iowa State University Mathematics degree to use creating successful methods to predict each Cup race based on NASCAR statistics.
It is an obsession that has consumed untold hours.
Cliff would love to hear your comments, questions and suggestions at moc.liamg@tciderpucca
Download this article in its original PDF (1.25MB) format for free!