Why use feature selection in machine learning

Upasana | May 24, 2019 | 3 min read | 21 views

Why use feature selection? If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression? What are the confidence intervals of the coefficients?

Answer : Feature engineering is important when it comes to building a model. It is important because it lets us understand the data in much better way. When we do data analysis on the raw data then we have to find out hidden patterns that can only be done via feature engineering.

Lets consider a very simple example here. Lets say we have data of how many persons board bus from kashmiri gate bus stand. We have a date column and total number of persons as another columns.

Sample data
Date	Persons
18-5-2019	15
19-5-2019	20
20-5-2019	50
21-5-2019	55

This is raw data. Think on it for few minutes and try to guess what is it trying to tell.

We can see that number of persons boarded bus from kashmiri gate were low on two days and high & consistent of other two. But we would never know why if we dont break the data. Now, we will be introducing a new feature/column here which will tell us what week of the day it was on these days.

Transformed sample data
Date	Week of the Day	Persons
18-5-2019	Saturday	15
19-5-2019	Sunday	20
20-5-2019	Monday	50
21-5-2019	Tuesday	55

After doing the transformation, we can clearly see why there were so less people who boarded bus from kashmiri gate on those two days because those two days were coming on weekend so people may have preferred some other means of transport or they may have not stepped out at all. This pattern can be found out with graph if we have more data and plot its line chart.

Coming to feature selection, After seeing the pattern in raw data we will be introducing more of similar features but not all of these features may be correlating with out target. To avoid including unnecessary data in training data, we do feature selection.

Next part of question is

If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression?

This is case of multicollinearity.

What is multicollinearity? Multicollinearity happens when there are high correlations among predictor variables means they are collinear, which leads to unstable estimates of regression coefficients because it becomes hard to separate out the individual effects of collinear variables on the response variable.

It makes estimates of regression coefficients unstable which results in high standard error and eventually rejecting null hypothesis since it affects z-statistics as well.

What are the confidence intervals of the coefficients?

Since the parameter \$\betaj\$ is estimated using Maximum Likelihood Estimation, MLE theory tells us that it is asymptotically normal and we shall be able to get confidence interval like

\$betaj +- z*SE(betaj)\$

Which gives a confidence interval on the log-odds ratio.

Using the invariance property of the MLE allows us to exponentiate to get

\$e^(betaj +- z*SE(betaj))\$

which is a confidence interval on the odds ratio