Introduction to regression, correlation, multi collinearity and 99th percentile

Upasana | August 05, 2019 | 2 min read | 233 views


What is 99th percentile?

Percentile is a term which is used to interpret data in statistics. Here 99th percentile means the value which is at 99th percentile will be above 99% of the data or say 99% or data is going to be below that value.

What is the probability that a ball chosen will not be green from a bag which contains 5 red ball, 7 green ball and 2 black balls?

So here practically we have to get probability of getting red or black ball which will be

\$P(red) + P(black) = 5/14 + 2/14 = 7/14 = 0.5\$

So there is 50% probability of not getting green ball.

what is regression?

Regression is the factor of relationship between independent and dependent variable. It is used to predict dependent variable when we are aware of independent variable.

what is correlation?

Correlation is the value of association between two variable to find out how much correlated both of the variables are. Value of correlation coefficient stays between -1 to 1.

Interpretation fo correlation coefficient: 1. If more closer to -1, then negatively correlated with each other. 2. If around zero, then not correlated with each other. 3. If more closer to 1, then positively correlated with each other.

how will you calculate correlation?

Correlation can be calculated with two widely known methods:

  1. Spearman’s correlation coefficient: It is based on rank order correlation rather than covariance and standard deviation like pearson correlation coefficient. So if distribution of variables is not gaussian distribution then we shall use spearman’s correlation coefficient.

  2. Pearson’s correlation coefficient: It is covariance of the two variables divided by the product of the standard deviation of each data sample. One assumption here we take is that distribution of variables must be gaussian distribution.

What do you do if there is multi collinearity in dataset?

Multicollinearity happens when there are high correlations among predictor variables means they are collinear, which leads to unstable estimates of regression coefficients because it becomes hard to separate out the individual effects of collinear variables on the response variable.

It makes estimates of regression coefficients unstable which results in high standard error and eventually rejecting null hypothesis since it affects z-statistics as well.

so if there is multi collinearity in dataset then we shall use forward selection or backward selection method to filter out the highly correlated variable. There are few algorithms too which doesn’t get affected by multi collinearity like Random Forest etc.


Top articles in this category:
  1. Top 100 interview questions on Data Science & Machine Learning
  2. Machine Learning: Understanding Logistic Regression
  3. Google Data Scientist interview questions with answers
  4. Introduction to SVM, hyperplane, TF-IDF and BoW
  5. Introduction to Sorting Algorithms
  6. Why use feature selection in machine learning
  7. Machine Learning based Multiple choice questions

Recommended books for interview preparation:

Find more on this topic: