Jan 11, 2012 by Cliff DeJong
NOTE: This article was written at the start of the 2012 NASCAR season to provide a better understanding of the algorithm used in our AccuPredict - NASCAR Driver Finish Predictions tool. The basis of the formula remains as described in this article, although some tweaks to the weighting of the formulas are made by Cliff after each NASCAR season.
Thus far, we have looked at performance in past races in a number of ways:
And, also made some new discoveries:
Each season is a new start for drivers and their teams, and may bring a new crew chief or even a new team for some drivers.
I have observed that each year has drivers that seem to do consistently better or consistently worse than expected, based solely on their past performance from previous years.
As a consequence, another measure that I have found to be useful is the current year-to-date standings of the drivers.
I do not use year-to-date statistics until after four races have been run.
I have used point standings and Driver Ratings in past years, and found that driver rankings based on Driver Ratings are a little better to use.
I have not done a formal analysis, but at the end of the 2010 season found that the Driver Rating was correlated to average finish with a value of 0.93 while the correlation of points (after taking out the Chase points adjustment) to average finish was 0.76.
These are correlations to the final values for the entire 2010 season and cannot be compared to other correlation measures in this report that are for single races.
Driver Rating, as defined by NASCAR loop data, combines several measures of driver performance, including green flag passes, green flag times passed, fast laps, laps led, and more.
I collected several of these measures during the 2011 season for assessment and will show these in the next section.
For the 2011 season, a number of performance measures were collected for each race. Figure 16 shows those measures and their correlations with finishing position.
Here are the definitions of each measure:
These metrics were developed as possibly useful in unpublished analyses of past seasons. I offer no rationale for their selection; they have evolved over time.
Some of the measures are much better than others; the average finish over the last 18 races is the best, with year-to-date driver rating the second best. Others like the green flag pass differential have little information with relatively poor correlations.
Given these 15 measures of driver performance, how can they be combined to give the best estimate of finish position for each driver and race?
This is far from an obvious question, because all of these are heavily correlated to each other, that is, a driver that has finished well in the last 18 races, is also placed highly in the year-to-date Driver Ratings, etc. If two measures are highly correlated, then the second measure adds little new information to the information in the first measure.
An additional complication is that not all measures are always available for each driver; Trevor Bayne, for example, had no prior Sprint Cup history at Daytona before the 2011 season opening race.
The desire is to find a simple method for combining selected measures.
First, all measures must be transformed mathematically so that a small number will indicate a likely good finish. The easy way to do this is to fit the various measures to the average finish and then use the curve fit data to represent the measure in question.
A score defined as a simple average of all the transformed measures gives a correlation of 0.538 to the actual finishes.
The standard deviation of the estimated finishes based on the simple average for 2011 is 9.45.
A large number of perturbations on combinations of the measures were examined, and the best approach was to average:
This gave a correlation for the score to actual finish of 0.550 and a standard deviation of 9.36 for the estimated finishes.
To determine the best possible fit of the metrics to the finishing positions, a multiple regression was calculated, using all 15 metrics as inputs. Regression, strictly speaking, is only valid for independent variables, and these are not independent. Still, in practice, regression can be very useful even here.
For data points without all 15 metrics, the simple averages of the best combinations in the previous paragraph were used. This gave a correlation of 0.559 and a standard deviation of 9.29.
The disadvantage of using this approach, however, is complexity and the fact that the regression is highly tuned to the data in the 2011 season.
The regression for 2012 data will almost certainly be different.
Plus, the 0.559 is not dramatically better than the 0.550 found by experiment.
There are other approaches to maximize the correlation of a combination of correlated variables.
For statistics geeks, I tried Principal Component Analysis and a method in a paper by Keller and Olkin . The required assumptions are only partially met, and results were a correlation of 0.550-0.551, not quite as good as the regression results.
These have also been tried in earlier seasons, with similar results, and have the same drawbacks as the regression method.
Another interesting approach that I tried was to look at each driver as ranked for all metrics. If a driver was ranked ahead of another driver on more of the metrics, then he was ranked higher in the combination.
This was only slightly different from the averages of the metrics, and performance was slightly worse.
For all approaches, the likelihood of a driver finishing ahead of a lower ranked driver was calculated. It varied by race, but all of the best approaches averaged about 70%.
The approach of a simple average of the metrics was chosen.
With this, the performance of the best combination is very poor for the restrictor plate races, so they were split out. Best combinations for these were L18-F, YTD-DR and Practice. Correlations for the plate races improved from 0.243 to 0.318, and finish standard deviation went from 11.5 to 11.2. This is still poor performance.
When this was combined with the non-plate races and their best combinations, the final correlation of score with finish is 0.554, and the standard deviation is 9.29.
Corresponding ACCUPREDICT results were 0.535 and 9.53.
The method proposed is somewhat better than the approach used in 2011.
These six performance measures, or whatever exist for each driver, are averaged, and the resulting score gives the expected finishing position by a simple curve fit to the 2011 data.
For restrictor plate races, the average of the last 15 races, year-to-date Driver Ratings, and the practice rankings are used.
It is noted that the 2012 proposed metrics are slightly different from the 2011 season metrics used to select combinations.
The last 15 races are used, and the track groupings into related track types have been revised. The number of races averaged for 2012 for same type and same track metrics is eight, rather than variable 2-14 races in 2011. The rationale for this is that the numbers and groupings have been changed to improve correlations and they are measuring very similar information.
This new approach was applied to five sample races from 2011, one from each track type, and correlations improved on average from 0.470 to 0.502.
Cliff DeJong (pronounced De Young), the man behind AccuPredict, is a research scientist who has been crunching numbers his entire life. An avid NASCAR fan, Cliff was introduced to fantasy NASCAR by his brother (who beat him at just about everything).
Cliff put his Carnegie Mellon Computer Science degree and Iowa State University Mathematics degree to use creating successful methods to predict each Cup race based on NASCAR statistics.
It is an obsession that has consumed untold hours.
Cliff would love to hear your comments, questions and suggestions at moc.liamg@tciderpucca
Download this article in its original PDF (1.25MB) format for free!