Statistical Modeling, Causal Inference, and Social Science: Cross-validation vs data splitting

« E.J. Dionne's column summarizes our red-blue paper | Main | Posterior distributions vs bootstrap distributions »

March 16, 2006

Cross-validation vs data splitting

In Bayesian statistics one often has to justify the prior used. While picking the prior may be seen as a subjective preference, if a different person disagrees with the prior, the conclusions would not be acceptable. For the results of the analysis to be acceptable, a large number of people would need to agree with the prior. Priors are not particularly easy to understand and evaluate, especially when the models become complex.

On the other hand, the fundamental idea of data splitting is very intuitive: we hide one part of the data, learn on the rest, and then check our "knowledge" on what was hidden. There is a hidden parameter here: how much we hide and how much we show. Traditionally, "cross-validation" meant the same as is currently meant by "leave-one-out": we hide a single case, model from all the others, and then compute the error of the model on that hidden case.

As to remove the influence of a particular choice of what to show and what to hide, we ideally average over all show/hide partitions of the data. The contemporary understanding of "k-fold cross-validation" is different from the traditional one. We split the cases at random into k groups, so that each group has approximately equal size. We then build k models, each time omitting one of the groups. We evaluate each model on the group that was omitted. For n cases, n-fold cross-validation would correspond to leave-one-out. But with a smaller k, we can perform the evaluation much more efficiently.

When doing cross-validation, there is the danger of affecting the error estimates with an arbitrary assignment to groups. In fact, one would wonder how does k-fold cross-validation compare to repeatedly splitting 1/k of the data into the hidden set and (k-1)/k of the data into the shown set.

As to compare cross-validation with random splitting, we did a small experiment, on a medical dataset with 286 cases. We built a logistic regression on the shown data and then computed the Brier score of a logistic regression model on the hidden data. The partitions were generated in two ways, using data splitting and using cross-validation. The image below shows that 10-fold cross-validation converges quite a bit faster to the same value as does repeated data splitting. Still, more than 20 replications of 10-fold cross-validation are needed for the Brier score estimate to become properly stabilized. Although it is hard to see, the resolution of the red line is 10 times lower than that of the green one, as cross-validation provides an assessment of the score after completing the 10 folds.

The implication is that the dependence of folds within a cross-validation experiment is a good thing if one tries to quickly assess the score. This dependence was seen as a problem before.

[Changes: the new picture does not show intermediate scores within each cross-validation, only the average score across all the cross-validation experiments.]

Cross-validation has been discussed earlier on this blog.

Posted by Aleks at March 16, 2006 02:21 PM

Trackback Pings

TrackBack URL for this entry:
http://www.stat.columbia.edu/~cook/movabletype/mt-tb.cgi/364

Comments

One possibly relevant paper on this is Dieterrich's Statistical Tests for Comparing Supervised Classification Learning Algorithms (1996). I don't think it's the last word, and it's not Bayesian at all, but it does have a number of experiments comparing different methodologies. More than one reviewer has suggested using the procedures suggested there.

Posted by: rif at March 16, 2006 05:53 PM

"When doing cross-validation, there is the danger of affecting the error estimates with an arbitrary assignment to groups."

Important question is whether this additional variability due to arbitrary assignment to groups is significant compared to other uncertainties. If cross-validation is used to estimate the performance of the model for predicting the future, the uncertainty due to not knowing the exact future data distribution dominates. See, e.g., figures 3 and 6 and in our paper, in which the variability due to arbitrary assignment is labeled as variability due to having different training sets. Thus, using several cross-validation replications does not significantly reduce the overall variability of the error estimate.

Posted by: Aki Vehtari at March 17, 2006 01:54 AM

Generally, doing several replications of CV will probably increase the estimated variability of the error, but reduce the variability in the estimate of the mean error. The only reason for not doing several replications would be to save time.

Dieterrich's paper has been very widely read: yet it's disappointing that people are not following his recommendations more.

Posted by: Aleks at March 17, 2006 10:32 AM

One possibly relevant paper on this is Dieterrich's Statistical Tests for Comparing Supervised Classification Learning Algorithms (1996)

Note that Dieterrich is comparing algorithms, not models. This difference is briefly discussed in section 1.3 in our paper.

Posted by: Aki Vehtari at March 19, 2006 02:59 PM

Statistical Modeling, Causal Inference, and Social Science

March 16, 2006

Cross-validation vs data splitting

Trackback Pings

Comments

Post a comment