Why use feature selection in machine learning

Upasana | May 24, 2019 | 3 min read | 21 views


Why use feature selection? If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression? What are the confidence intervals of the coefficients?

Answer : Feature engineering is important when it comes to building a model. It is important because it lets us understand the data in much better way. When we do data analysis on the raw data then we have to find out hidden patterns that can only be done via feature engineering.

Lets consider a very simple example here. Lets say we have data of how many persons board bus from kashmiri gate bus stand. We have a date column and total number of persons as another columns.

Sample data
Date Persons

18-5-2019

15

19-5-2019

20

20-5-2019

50

21-5-2019

55

This is raw data. Think on it for few minutes and try to guess what is it trying to tell.

We can see that number of persons boarded bus from kashmiri gate were low on two days and high & consistent of other two. But we would never know why if we dont break the data. Now, we will be introducing a new feature/column here which will tell us what week of the day it was on these days.

Transformed sample data
Date Week of the Day Persons

18-5-2019

Saturday

15

19-5-2019

Sunday

20

20-5-2019

Monday

50

21-5-2019

Tuesday

55

After doing the transformation, we can clearly see why there were so less people who boarded bus from kashmiri gate on those two days because those two days were coming on weekend so people may have preferred some other means of transport or they may have not stepped out at all. This pattern can be found out with graph if we have more data and plot its line chart.

Coming to feature selection, After seeing the pattern in raw data we will be introducing more of similar features but not all of these features may be correlating with out target. To avoid including unnecessary data in training data, we do feature selection.

Next part of question is

If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression?

This is case of multicollinearity.

What is multicollinearity? Multicollinearity happens when there are high correlations among predictor variables means they are collinear, which leads to unstable estimates of regression coefficients because it becomes hard to separate out the individual effects of collinear variables on the response variable.

It makes estimates of regression coefficients unstable which results in high standard error and eventually rejecting null hypothesis since it affects z-statistics as well.

What are the confidence intervals of the coefficients?

Since the parameter \$\betaj\$ is estimated using Maximum Likelihood Estimation, MLE theory tells us that it is asymptotically normal and we shall be able to get confidence interval like

\$betaj +- z*SE(betaj)\$

Which gives a confidence interval on the log-odds ratio.

Using the invariance property of the MLE allows us to exponentiate to get

\$e^(betaj +- z*SE(betaj))\$

which is a confidence interval on the odds ratio


Top articles in this category:
  1. Top 100 interview questions on Data Science & Machine Learning
  2. Machine Learning based Multiple choice questions
  3. Configure Logging in gunicorn based application in docker container
  4. Google Data Scientist interview questions with answers
  5. Deploying Keras Model in Production using Flask
  6. Machine Learning: Understanding Logistic Regression
  7. Creating custom Keras callbacks in python

Recommended books for interview preparation:

Find more on this topic: