An important modeling technique that is very common in statistics is regression. For people not overly familiar with this term Investopedia defines this term as an attempt to determine the strength and character of the relationship between one dependent variable and a series of other independent variables. For example, we might want to predict a relationship between a particular yield for a specific crop and we have information on rainfall and the age of seeds. We could potentially make a regression model with crop yield as the dependent variable and rainfall and age of seeds as the independent variables.

But how do we know if both independent variables in the above example really help us in predicting crop yield? If you have taken a statistics class we might do a hypothesis test and set either or both variables equal to zero and go through with our test to see if we reject or fail to reject our null hypothesis. If we reject our null hypothesis, we tend to keep that independent variable in our regression model. If we fail to reject our null hypothesis (where our p-value we obtain is larger than our significance level α then we tend not to include this in our regression model. Sometimes with less variables this is easy to do. We can put all of our variables into a linear model in our software (for the purpose of this piece I will reference R but other softwares do similar things) and get a summary of each independent variable (in R using summary(model_created)). In this summary we will be able to get the coefficient values of how much weight each independent variable adds to the final prediction of our dependent variable, the standard errors of these predictions, but also T-statistic values with corresponding p-values. As a rule of thumb, unless told otherwise, you should keep the variables that have p-values less than your significance level α (which is usually 0.05); otherwise, you remove the independent variables from your model. Then you fit a new model with the kept variables and continue this until all your variables have p-values less than your significance level α.

Personally, I like this procedure since it is a quick and easy way to get what variables you need for a simple model. Issues arise when our potential independent variables are large in the quantity of them as this can take an enormous amount of time to do this procedure over and over again. In this case I prefer two techiques: the first being the backwards selection technique and the second being Lasso.

The backwards selection technique is similar to what I described before. What it will do is remove the independent variable automatically until some criteria (usually all independent variables have below some p-value) and then our final model is made. Again this sounds pretty similar to what I said and why I like it the most out of any model selection technique. It is very similar to things everyone has learned in a statistics class and as data scientists is the easiest to explain to people who are not in the field because they most likely learned in their introduction statistics class of model selection by using just p-values and manually going through removing the p-values that are too high. Backwards selection is also very similar to forwards selection (instead of backwards selection removing independent variables forwards selection added them if they meet the criteria). It is said in practice that backwards selection is better than forwards selection. This is because in forwards selection we may have suppressor effects, which is when an independent variable is selected while others are held constant and might not actually be a significant variable once combined with the others. Thus, via forward selection we might have some errors in our choices in what should be included in the final model. So with backwards selection being easier to explain to someone and reducing errors that could be present are just some reasons why it is a method of choice. I will mention that one downside to backwards selection is overfitting which is something we can try to prevent using my other model selection technique.

Now Lasso, which stands for Least Absolute Shrinkage and Selection Operator, is a different technique we can use for model fitting. Here this technique introduces a penalty term which prevents possible overfitting. Sometimes we want to include a lot of independent predictor variables that we think are “important”. What Lasso will do is make sure it includes only the variables that best fit a model without overdoing it and give appropriate weights that optimize predictions and limiting error rates. It is also extremely easy to make a lasso model as it is just like making a linear regression model but instead in R you are using the glmnet() function from the glmnet package, which is another reason I like this modeling technique. So if Lasso has the power to do all of this, why doesn’t everyone use it? I will say that a major downside is that if you have correlated variables, Lasso will only choose to use one of them and set the others equal to 0. This might not be ideal, especially if there is a correlated variable that you know is extremely important and should be a part of your predictive model.

Those are just two of the many modeling selection techiques out there that we can use to optimize our models. Do you like these techniques or is there one you like better? Feel free to connect with me on LinkedIn or contact me via email and I would be happy to hear what modeling selection techniques you use and how they benefit your predictive modeling.


<
Previous Post
What is Exploratory Data Analysis?
>
Next Post
Recap of Predicting Future Diabetics Project