Why does a model underfit?

Models can underfit if they are not trained for long enough, have poorly chosen hyperparameters, or lack a sufficient number of training samples, leading to too much uncertainty. However, the most common reason is excessive model bias, where the model is inherently too simple to capture the complexity of the data (e.g., using linear regression on a non-linear dataset).

What happens if an underfit model is used in production?

An underfit model will perform poorly in production, producing inaccurate predictions that can lead to customer dissatisfaction or misguided business decisions. Technically, it will exhibit high bias and low variance, consistently generating inaccurate predictions.

How does underfitting relate to the bias-variance tradeoff?

Underfitting and overfitting are opposite ends of the bias-variance tradeoff. Bias represents underfitting (the model's inability to capture the underlying data relationship), while variance represents overfitting (the model being too sensitive to the training data's noise). The goal in model building is to balance these components to achieve optimal generalization performance.

How can underfit models be addressed?

To address underfitting, consider increasing model complexity by adding more parameters or complex features. If due to insufficient data, gather more training data or use techniques like transfer learning to fine-tune pre-trained models. Additionally, ensure proper model training by allowing sufficient training time or performing hyperparameter sweeps. For Bayesian models, experimenting with different priors can also help.

Underfitting

Q: What is underfitting in machine learning?

Underfitting describes a machine learning model that fails to capture the fundamental relationship within its training dataset. An example is a linear regression model applied to data with a polynomial relationship, leading to poor performance on both training and unseen data.

What is Underfitting in Machine Learning?

Underfitting describes a model which does not capture the underlying relationship in the dataset on which it’s trained. An example of underfitting would be a linear regression model which is trained on a dataset that exhibits a polynomial relationship between the input and output variables. Such a model will never be able to adequately capture this relationship (assuming polynomial data features are not used), so the model will underfit and will neither perform well on the training set nor generalize well to unseen data. Underfitting is more common than you may think, especially in certain business contexts where labeled training data may be sparse.

Why is my Model Underfitting?

Complex models such as neural networks may underfit to data if they are not trained for long enough or are trained with poorly chosen hyperparameters. Certain models may also underfit if they are not provided with a sufficient number of training samples. In this case, the underfitting may occur because there is too much uncertainty in the training data, leading the model to be unable to discern an underlying relationship between inputs and outputs. However, by far the most common reason that models underfit is because they exhibit too much bias. For example, linear regression biases the model to learn linear relationships in data, so linear regression models will underfit to non-linear datasets. Similarly, under-parameterized models, i.e. those with few parameters, are likely to bias the model towards simplistic relationships which do not sufficiently model the complexity present in real-world datasets.

What Happens if Your Model Experiences Underfitting?

Specifying what will happen if you push an underfit model to production is simple. Your model will not perform well. It will produce incorrect predictions that disappoint customers or lead to unwise business decisions predicated on inaccurate information. Therefore, addressing underfitting in your models is absolutely crucial from a business perspective. From a technical standpoint, an underfit model will exhibit high bias and low variance. In layman’s terms, it will generate reliably inaccurate predictions, and while reliability is desirable, inaccuracy is certainly not. On the other hand, when addressing underfitting it’s important to not go too far in the other direction and cause your model to overfit. This leads us to the discussion of a concept called the bias-variance tradeoff.

Overfitting vs. Underfitting: The Bias-Variance Tradeoff

Overfitting is the counterpoint to underfitting – the two concepts are diametrically opposed. Tuning a model away from underfitting pushes it closer towards overfitting and vice versa. This concept is neatly encapsulated in a principle referred to as the bias-variance tradeoff. In the early days of machine learning, theorists showed that the error in a model can be broken down into a sum of two components: bias and variance. Bias describes the amount of underfitting in the model, the extent to which the model is inherently incapable of modeling a given dataset. Variance, on the other hand, describes the component of error due to overfitting. Overfit models are too tightly tied to a single dataset. While they may perform remarkably well on the training set, perhaps even achieving 100% predictive accuracy, they will not generalize to unseen data. Often this happens due to too much flexibility in the model. An over-parameterized neural network may exactly fit the noise in a training set, but doing so will cause it to fail spectacularly on unseen data drawn from the same distribution. Thus, devising a model which performs ideally is very much a balancing act of trading off between the bias and variance components of error so that neither becomes overwhelming.

How to Address Underfit Models

There are a few ways to address underfit models. The first step is usually to take a closer look at your training data and the modeling assumptions that you are making. Is your model sufficiently complex enough to capture the underlying relationships in the data? A simple way to test this is to add more parameters to your model or add more complex features such as polynomial combinations of existing features and then retrain the model. Does the accuracy increase? If so, your model is likely limited in its predictive power and you can proceed by either adding more features/parameters or choosing a different model entirely.

If the high bias is not due to modeling assumptions, then it may be due to insufficient training data. There are a number of approaches you can take to get around this. First, you could simply gather more training data, perhaps by scraping the web, paying crowdsourced workers to label data you already have, or pay third-party data providers to license their proprietary datasets. Alternatively, you can utilize modeling techniques such as transfer learning to take pre-trained models that have already been trained on large datasets and fine-tune them to your own, unique data. In doing so, you’ll be able to use the prior assumptions baked into the pre-trained model while enabling it to specialize to your own data.

Finally, if neither of these approaches help, be sure that you are training your model properly. You might need to let it train for longer or do a hyperparameter sweep to discover better hyperparameter configurations. Model training can often be as much an art as a science, and it can take some experimentation to figure out how to best fit the model to your data. If your model is Bayesian, you can also try working with a different prior as the prior controls your predefined assumptions about the data. Ultimately, addressing underfitting takes time and patience but is quite within reach for any capable data scientist.

Summary

Why is my Model Underfitting?
What Happens if Your Model Experiences Underfitting?
Overfitting vs. Underfitting: The Bias-Variance Tradeoff
How to Address Underfit Models

Additional Resources

Data Science

Building Robust Models with Cross-Validation in Python

Learn more

Data Science

Credit Card Fraud Detection using XGBoost, SMOTE, and threshold moving

Learn more

White paper

Model risk management in Domino

Learn more

Who is Domino?

Domino Data Lab empowers the largest AI-driven enterprises to build and operate AI at scale. Domino’s Enterprise AI Platform provides an integrated experience encompassing model development, MLOps, collaboration, and governance. With Domino, global enterprises can develop better medicines, grow more productive crops, develop more competitive products, and more. Founded in 2013, Domino is backed by Sequoia Capital, Coatue Management, NVIDIA, Snowflake, and other leading investors.

Attend Weekly Demo