# Why use feature selection in machine learning

Upasana | May 24, 2019 | 3 min read | 21 views

## Why use feature selection? If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression? What are the confidence intervals of the coefficients?

**Answer** : Feature engineering is important when it comes to building a model. It is important because it lets us understand the data in much better way. When we do data analysis on the raw data then we have to find out hidden patterns that can only be done via feature engineering.

Lets consider a very simple example here. Lets say we have data of how many persons board bus from kashmiri gate bus stand. We have a date column and total number of persons as another columns.

Date | Persons |
---|---|

18-5-2019 |
15 |

19-5-2019 |
20 |

20-5-2019 |
50 |

21-5-2019 |
55 |

This is raw data. Think on it for few minutes and try to guess what is it trying to tell.

We can see that number of persons boarded bus from kashmiri gate were low on two days and high & consistent of other two. But we would never know why if we dont break the data. Now, we will be introducing a new feature/column here which will tell us what week of the day it was on these days.

Date | Week of the Day | Persons |
---|---|---|

18-5-2019 |
Saturday |
15 |

19-5-2019 |
Sunday |
20 |

20-5-2019 |
Monday |
50 |

21-5-2019 |
Tuesday |
55 |

After doing the transformation, we can clearly see why there were so less people who boarded bus from kashmiri gate on those two days because those two days were coming on weekend so people may have preferred some other means of transport or they may have not stepped out at all. This pattern can be found out with graph if we have more data and plot its line chart.

Coming to feature selection, After seeing the pattern in raw data we will be introducing more of similar features but not all of these features may be correlating with out target. To avoid including unnecessary data in training data, we do feature selection.

Next part of question is

*If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression?*

*If two predictors are highly correlated, what is the effect on the coefficients in the logistic regression?*

This is case of **multicollinearity**.

*What is multicollinearity?* **Multicollinearity** happens when there are high correlations among predictor variables
means they are collinear, which leads to unstable estimates of regression coefficients because it becomes hard
to separate out the individual effects of collinear variables on the response variable.

It makes estimates of *regression coefficients unstable* which results in high standard error and eventually rejecting null hypothesis since
it affects z-statistics as well.

*What are the confidence intervals of the coefficients?*

*What are the confidence intervals of the coefficients?*

Since the parameter \$\betaj\$ is estimated using Maximum Likelihood Estimation, MLE theory tells us that it is asymptotically normal and we shall be able to get confidence interval like

Which gives a confidence interval on the log-odds ratio.

Using the invariance property of the MLE allows us to exponentiate to get

which is a confidence interval on the odds ratio

##### Top articles in this category:

- Top 100 interview questions on Data Science & Machine Learning
- Machine Learning based Multiple choice questions
- Configure Logging in gunicorn based application in docker container
- Google Data Scientist interview questions with answers
- Machine Learning: Understanding Logistic Regression
- Deploying Keras Model in Production using Flask
- Creating custom Keras callbacks in python