Model Selection

What is Model Selection?

Model selection is the process of selecting the best model from all the available models for a particular business problem on the basis of different criterions such as robustness and model complexity.

Model Selection vs Variable Selection

Variable selection is the process of selecting the best subset of predictors for a given problem and predictive model, while model selection is done to select one specific model from the list of available predictive models for a given business problem.

The set of best variables might vary according to the change in the predictive model used as different types of predictive modeling algorithms works differently. A specific set of features might yield very different results with different predictive models.

What Should a Model be Selected By?

It is improbable to predict the best model for a given problem without experimenting with different models. It is possible to predict the best type of model that can be used for a given problem. For example, if you’re modeling a natural language processing problem it is highly likely that deep learning based predictive models will perform a lot better than statistical based models. The idea here is to select a model that suits our purpose and different criterions such as performance, robustness, complexity, etc. rather than searching for the best model.

Different models perform differently according to factors such as the type of data available, noise in data, the type of predictive problem. The larger contextual environment around the model needs to be considered when selecting a model. A model that is perfect from a computational perspective but is not explainable or deployable is not necessarily "good enough." There needs to be a balance between these different factors in order for a model to be successful. The nuances of when a model is "good enough" can be difficult to determine, but it is important to consider all factors in order to make the best decision possible.

What are Techniques for Model Selection?

Model selection techniques can be widely classified as probabilistic measures and resampling methods. Probabilistic measures involve statistically scoring candidate models using performance on training dataset. Resampling methods estimate the performance of a model using hold out or test dataset. Here are some of the commonly used techniques for model selection

Random train/test split: This is a resampling method. In this method the model is evaluated on the skill of generalization and predictive efficiency in an unseen set of data.The data points here are sampled without replacement. This involves splitting the data into train set and test set. The model that performs best on this test set is selected as the best model.
Cross validation: It is a very popular resampling method for model selection. In this method candidate models are trained and evaluated on multiple resampled train and test sets that are exclusive of each other. The data points here are sampled without replacement. The model performance across these different iterations are averaged to estimate the model performance. For example, K-Fold cross validation, Leave one out cross validation
Bootstrap: This is also a resampling method, and can be performed like random train/test split or cross validation. The only difference here is that the data points here are sampled with replacement.
AIC (Akaike Information Criterion):It is a probabilistic measure to estimate model performance on unseen data. It is not an absolute score and can be only used in comparison with AIC scores of models from the same dataset. The model with the lowest AIC score is chosen as the best model.

Additional Resources

Field Guide: Model Management
Blog: The Importance of Machine Learning Validation and How it Works
Whitepaper: Model Risk Management in Domino