Why use cross validation?
I am entering several Kaggle Machine Learning competitions at the moment and I just have a quick question. Why do we use cross validation to assess our algorithms effectiveness in these competitions?
Surely in these competitions your score in the public leaderboard, where your algorithm is tested against actual live data would give you a more accurate representation of your algorithms efficacy?
Cross-validation is a necessary step in model construction. If cross-validation gives you poor results, there is no sense in even trying it on live data. Your set on which you are training and validating is also live data, isn't it? So, the results should be similar. Without validating your model you don't have any insight into its performance whatsoever. Models which give 100% accuracy on training set could give random results on validation set.
Let me re-iterate, cross-validation is not a replacement for live data test, it is a part of model construction process.