Jan 11, 2012 by Cliff DeJong

- 1
- 2
- 3
- 4
- 5 (current)

**NOTE:** This article was written at the start of the 2012 NASCAR season to provide a better understanding of the algorithm used in our AccuPredict - NASCAR Driver Finish Predictions tool. The basis of the formula remains as described in this article, although some tweaks to the weighting of the formulas are made by Cliff after each NASCAR season.

Thus far, we have looked at performance in past races in a number of ways:

- Past finish positions of all recent races.
- Races at the same track.
- Races at similar tracks.

And, also made some ** new discoveries**:

- A revision of track types
to those with*from similar physical characteristics*was assessed and found to be beneficial.*similar statistics* - Practice speeds at the
practice are useful.*next to last* - Starting position (or qualifying results) is also valuable

Each season is a new start for drivers and their teams, and may bring a new crew chief or ** even a new team** for some drivers.

I have observed that each year has drivers that seem to do ** consistently better** or

As a consequence, ** another measure** that I have found to be useful is the

I do not use year-to-date statistics until after four races have been run.

I have used point standings and Driver Ratings in past years, and found that ** driver rankings based on Driver Ratings ** are a little better to use.

I have not done a formal analysis, but at the end of the 2010 season found that the ** Driver Rating was correlated to average finish** with a value of 0.93 while the correlation of points (after taking out the Chase points adjustment) to average finish was 0.76.

These are correlations to the ** final values** for the entire 2010 season and cannot be compared to other correlation measures in this report that are for single races.

Driver Rating, ** as defined by NASCAR loop data**, combines several measures of driver performance, including green flag passes, green flag times passed, fast laps, laps led, and more.

I collected several of these measures during the 2011 season for assessment and will show these in the next section.

For the 2011 season, a number of performance measures were collected for each race. Figure 16 shows those measures and their correlations with finishing position.

Here are the definitions of each measure:

**L18-F**: Average of the finishing position of the last 18 races**L4-DR**: Ranking of average Driver Rating for the last 4 races**YTD-DR**: Ranking of average Driver Rating for the year to date (Not used in the first four races)**SType-F**: Average finish position for races at the same type track (This uses the traditional track groupings, before revisions above, and averages over 4-12 races for different tracks.)**SType-DR**: Average Driver Rating for races at the same type track, as above**SType-4F**: Average finish position for the last four races at the same type track**STrack-F**: Average finish position for races at the same track, over 2-11 races**STrack-DR**: Ranking of average Driver Ratings at the same track, over 2-11 races**STrack-Pwr**: Ranking of average Driver Ratings at the same track over the past two years**Start**: Start Position, defined as qualifying results**Practice**: Ranking of fast speeds in the next to the last practice**Bonus Points**: Average of bonus points earned**Pass Dif**: Average of green flag passes, less green flag times passed, over the last 2-11 races**Laps Led**: Number of Laps Led at the same track, averaged over the last 2-11 races**Fast Laps**: Number of Fast Laps at the same track, averaged over the last 2-11 races

These metrics were developed as possibly useful in unpublished analyses of past seasons. I offer no rationale for their selection; they ** have evolved** over time.

Some of the measures are much better than others; the ** average finish over the last 18 races** is the best, with

Given these 15 measures of driver performance, ** how can they be combined** to give the best estimate of finish position for each driver and race?

This is far from an obvious question, because ** all of these are heavily correlated to each other**, that is, a driver that has finished well in the last 18 races,

An additional complication is that ** not all measures are always available** for each driver; Trevor Bayne, for example, had no prior Sprint Cup history at Daytona before the 2011 season opening race.

The desire is to find a ** simple method** for combining selected measures.

First, all measures must be transformed mathematically so that a small number will indicate a likely good finish. The easy way to do this is to fit the various measures to the average finish and then use the curve fit data to represent the measure in question.

*A score defined as a simple average of all the transformed measures gives a correlation of 0.538 to the actual finishes.*

The standard deviation of the estimated finishes based on the simple average for 2011 is 9.45.

A large number of perturbations on combinations of the measures were examined, and ** the best approach** was to average:

- L18-F
- YTD-DR
- STY-F
- STR-DR
- Start
- Practice

This gave a correlation for the score to actual finish of 0.550 and a standard deviation of 9.36 for the estimated finishes.

To determine the best possible fit ** of the metrics to the finishing positions**, a multiple regression was calculated, using all 15 metrics as inputs. Regression, strictly speaking, is only valid for independent variables, and these are not independent. Still, in practice, regression can be very useful even here.

For data points without all 15 metrics, the simple averages of the best combinations in the previous paragraph were used. This gave a correlation of 0.559 and a standard deviation of 9.29.

The ** disadvantage of using this approach**, however, is complexity and the fact that the regression is highly tuned to the data in the 2011 season.

** The regression for 2012 data will almost certainly be different**.

Plus, the 0.559 is ** not dramatically better** than the 0.550 found by experiment.

There are other approaches to maximize the correlation of a combination of correlated variables.

For statistics geeks, I tried Principal Component Analysis and a method in a paper by Keller and Olkin . The required assumptions are only partially met, and results were a correlation of 0.550-0.551, not quite as good as the regression results.

These have also been tried in earlier seasons, with similar results, and have the same drawbacks as the regression method.

** Another interesting approach** that I tried was to look at each driver as ranked for all metrics. If a driver was ranked

This was ** only slightly different** from the averages of the metrics, and performance was slightly worse.

For all approaches, the likelihood of a driver finishing ahead of a lower ranked driver was calculated. It varied by race, but all of the best approaches averaged about 70%.

*The approach of a simple average of the metrics was chosen.*

With this, the performance of the best combination is very poor for the restrictor plate races, so they were split out. Best combinations for these were L18-F, YTD-DR and Practice. Correlations for the plate races improved from 0.243 to 0.318, and finish standard deviation went from 11.5 to 11.2. This is still poor performance.

When this was combined with the non-plate races and their best combinations, the final correlation of score with finish is 0.554, and the standard deviation is 9.29.

Corresponding ACCUPREDICT results were 0.535 and 9.53.

The method proposed is somewhat better than the approach used in 2011.

- Each race, the
are identified.*top 35 drivers in points* - Their
are averaged.*finishes in the last 15 races* - Their
is ranked.*performance in year-to-date driver ratings* - Similar track types are identified, using the revised definitions in a previous section, and
.*finishing position is averaged over the last eight races on those tracks* - The
are ranked.*average driver ratings at the last eight races at the same track* at the next to last practice are ranked.*Practice speeds*is used.*Start position*

These six performance measures, or whatever exist for each driver, are averaged, and the resulting score gives the expected finishing position by a simple curve fit to the 2011 data.

For restrictor plate races, the ** average of the last 15 races**,

It is noted that the ** 2012 proposed metrics are slightly different from the 2011 season metrics** used to select combinations.

The last 15 races are used, and the track groupings into related track types have been revised. The number of races averaged for 2012 for same type and same track metrics is eight, rather than variable 2-14 races in 2011. The rationale for this is that the ** numbers and groupings have been changed to improve correlations** and they are

This new approach was applied to five sample races from 2011, ** one from each track type**, and correlations improved on average from 0.470 to 0.502.

- 1
- 2
- 3
- 4
- 5 (current)

Cliff DeJong (pronounced *De Young*), the man behind AccuPredict, is a research scientist who has been crunching numbers his entire life. An avid NASCAR fan, Cliff was introduced to fantasy NASCAR by his brother (who beat him at just about everything).

Cliff put his *Carnegie Mellon* Computer Science degree and *Iowa State University* Mathematics degree to use creating successful methods to predict each Cup race based on NASCAR statistics.

It is an obsession that has consumed untold hours.

Cliff would love to hear your comments, questions and suggestions at moc.liamg@tciderpucca

Download this article in its original PDF (1.25MB) format for free!