College Football Regression Analysis

Let me preface this by saying I've been a longtime observer of this board for many years. I hold three degrees from Tech, a BSEng, BSECON and MSECON. I do a lot of industry work with forecasting and have written several papers on time series data. I see three glaring problems with this model:

1) The poster already mentioned that he needs the current game stats to generate a prediction.

Even if you had this information, the model doesn't serve as a good predictor and here's why. (Note I can safely say this even without seeing your model specification.)

2) Overfitting - The model likely includes a multitude of game stats which really have no correlation with the dependent variable, but the shear number of regressors will give the appearance of a model which generates a good fit. You should report your t-statistics and run several joint significance tests to see if your independent variables are truly significant. Also, degrees of freedom adjusted R values are more useful in this case since the large number of regressors will use up your df and thus lowering your R. In addition, you should report your correlation matrix. I'd be willing to be that your model suffers hugely from multicollinear effects due to the highly correlated nature of certain groups of game stats.

3) You did not estimate your model using out of sample criteria. - I can tell this based on your prediction results. You're essentially estimating your coefficients from your in-sample data which usually shows good explanatory power, but no predictor value. You should go back and re-estimate your model using out-of-sample criterion and report your predictions again against the in-sample data. I'd be willing to bet that the model has virtually no predictor value at all.

Another critique I would have is it would be more intuitive for you to estimate your model using lags of the independent variables. This will tell you more about the value of the model as a predictor.

In any case, I admire your effort. It probably took quite a bit of time to run this analysis and get the data.
 
Not so fast guys....

All games that the top 10 teams have played are in there except for about 3 teams. I did leave out any games against FCS teams except GT vs Jack St.

Before you get too excited remember that this says I can predict the score IF I KNOW WHAT THE OUTCOME ON THE INPUT X VARIABLES IS GOING TO BE. Of course, I would have to accurately predict the rush yards, turnovers, pass yards, TOP, 3rd down conversion % to actually pick the score as much as looking backwards. (insert big letdown here). But its still pretty neat to think that if you know those parameters that you can predict the score.

I will take the averages for Wake and Tech on the inputs and tell you what the predicted score is.

Actually, if you know those parameters then you already know the score.....not as neat. I'm just sayin', as Beej would say.

Anyway, if you wanted to do this right, and your goal is to predict scores, you could do it a few different ways. If your goal is to explain the factors that led to the scores, then you can do what you're doing - just give us the model equation rather than the predicted scores. That way we could see what factors contribute the most to the scores.

If you want to predict, you need to split the games up into a group that you use to create the model, and a group that you use to score the model for accuracy (actual vs. predicted scores). Your accuracy will go down, unfortunately.

By the way, your R^2 doesn't mean much. As you add more variables the number goes up. At work I've got a retail inflation model with an R^2 of .98, but it does a crappy job of telling me what inflation will be next month.
 
Good points tonisama, #2 might be taken care of, and as I mentioned earlier, I would be interested in seeing the predictor value of this as mentioned in #3. But I have a question for you:

1) The poster already mentioned that he needs the current game stats to generate a prediction.

Even if you had this information, the model doesn't serve as a good predictor
It sounds like you are saying if he had the box score stats for the current game (yards, turnovers, etc.), he wouldn't be able to predict the final score? Am I reading you right? Because the numbers rw1 posted was predicting the final score from current game stats and other game stats.

Another question would be if he used data from future games.
 
Let me preface this by saying I've been a longtime observer of this board for many years. I hold three degrees from Tech, a BSEng, BSECON and MSECON. I do a lot of industry work with forecasting and have written several papers on time series data. I see three glaring problems with this model:

1) The poster already mentioned that he needs the current game stats to generate a prediction.

Even if you had this information, the model doesn't serve as a good predictor and here's why. (Note I can safely say this even without seeing your model specification.)

2) Overfitting - The model likely includes a multitude of game stats which really have no correlation with the dependent variable, but the shear number of regressors will give the appearance of a model which generates a good fit. You should report your t-statistics and run several joint significance tests to see if your independent variables are truly significant. Also, degrees of freedom adjusted R values are more useful in this case since the large number of regressors will use up your df and thus lowering your R. In addition, you should report your correlation matrix. I'd be willing to be that your model suffers hugely from multicollinear effects due to the highly correlated nature of certain groups of game stats.

3) You did not estimate your model using out of sample criteria. - I can tell this based on your prediction results. You're essentially estimating your coefficients from your in-sample data which usually shows good explanatory power, but no predictor value. You should go back and re-estimate your model using out-of-sample criterion and report your predictions again against the in-sample data. I'd be willing to bet that the model has virtually no predictor value at all.

Another critique I would have is it would be more intuitive for you to estimate your model using lags of the independent variables. This will tell you more about the value of the model as a predictor.

In any case, I admire your effort. It probably took quite a bit of time to run this analysis and get the data.

Wow can't believe I'm getting peer reviewed for playing with data on Stingtalk!

I am using the LINEST function in OpenOffice which unfortunately doesn't give the test statistic and the critical values like Excel. Same with the adjusted R2.

But I don't understand your comments about "in sample data" and "out of sample" data. Can you elaborate. And TechPhi I've done plenty of models much larger than this one and bigger (more data points or more dependent variables) does not necessary mean higher R^2.
 
Good points tonisama, #2 might be taken care of, and as I mentioned earlier, I would be interested in seeing the predictor value of this as mentioned in #3. But I have a question for you:


It sounds like you are saying if he had the box score stats for the current game (yards, turnovers, etc.), he wouldn't be able to predict the final score? Am I reading you right? Because the numbers rw1 posted was predicting the final score from current game stats and other game stats.

Another question would be if he used data from future games.

Yes, he would be able to predict the final score using the estimated regression coefficients (which I believe he already did), and it may even be indicative of how the actual game would turn out (by shear chance) but the more games he would try to predict, the closer the score predictions would be to essentially a random process.
 
Wow can't believe I'm getting peer reviewed for playing with data on Stingtalk!

I am using the LINEST function in OpenOffice which unfortunately doesn't give the test statistic and the critical values like Excel. Same with the adjusted R2.

But I don't understand your comments about "in sample data" and "out of sample" data. Can you elaborate. And TechPhi I've done plenty of models much larger than this one and bigger (more data points or more dependent variables) does not necessary mean higher R^2.

You should divide your sample into two subsets. You should estimate your regression model using the first subset, and then use the second subset to generate your predictions. This avoids the bias of estimating your coeffcients and generating your predictions using the same set of data which will often times lead to very little error but a model which in actuality serves no predictor value.

Before you do this though, you must change your model specification to use the lagging values of your independent variables as your regressors. This is because in real life, we do not have access to the current data to make our predictions. (i.e. as you mentioned previously, we don't have access to the game stats until the game actually ends and we already know the score, therefore we must use historical stats as the basis for our score prediction.)
 
Wow can't believe I'm getting peer reviewed for playing with data on Stingtalk!

I am using the LINEST function in OpenOffice which unfortunately doesn't give the test statistic and the critical values like Excel. Same with the adjusted R2.

But I don't understand your comments about "in sample data" and "out of sample" data. Can you elaborate. And TechPhi I've done plenty of models much larger than this one and bigger (more data points or more dependent variables) does not necessary mean higher R^2.
I am not surprised at all that you are getting peer reviewed here. I already have a best paper award this year, and I am always interested in interesting paper ideas, so would like to see the predictor value of what you have. :)
 
In order to predict, you can't use historical values for that game, since that game hasn't happened yet. You've got to just go off of 2009 season averages/totals that do not include the game you are predicting.

I agree that w tonisma; you are probably getting a lot of interactions between your dependent variables. The R^2 does not tell you what to remove. You can do a correlation analysis to figure out which ones to remove. Then pick them apart until you are left with only really low p-values for each variable.

But you'll probably just end up with Sagarin's model that takes in wins, losses, SOS, home/away and margin of victory.
 
Yeah, when I had Excel I would remove the dependent variables that were not statistically significant, but I don't know that with just LINEST. Anyone care to fund my research and buy me a copy of Microsoft Office? Statgraphics, or Minitab or the other stat packages would do too, but I'd get more value from Excel and Access.

"In order to predict, you can't use historical values for that game, since that game hasn't happened yet. You've got to just go off of 2009 season averages/totals that do not include the game you are predicting."

I think you can do better than take the averages. For example, a primary running team may lose their star tailback due to injury. In that case it would be reasonable to hedge some on their rushing yards. Thats where some judgement comes into play.

I understand that using a macro model to predict every team's score introduces more error. For example it would be better to have a model based only on that teams statistics, but there simply aren't enough data points to do that in football. Wake only averages 24 ppg while the model says based on their average stats they should average 29. Thats why they aren't a top 10 team! My model is based on games involving top 10 teams so theres a representation issue there as well. But even with all these known problems its very interesting how close it would have predicted GT scores.

I will use it to predict the scores in the prediction contest and lets see how accurate it is going forward.
 
You should divide your sample into two subsets. You should estimate your regression model using the first subset, and then use the second subset to generate your predictions. This avoids the bias of estimating your coeffcients and generating your predictions using the same set of data which will often times lead to very little error but a model which in actuality serves no predictor value.

That's what I was trying to say earlier when referencing global warming. :)
 
Back
Top