Math's behind Correlation
Let's talk about correlation today. A variable in a dataset can be related or dependent on some other variable. The relationship between them can be strong or light. Thus it's better to check it once before jumping on conclusions.
Correlation can be positive if both variables change in the same direction, negative if they are moving in the opposite direction or natural if there is no relationship. There are several techniques to find it like seaborn heatmap and pair plot. Let's discuss the maths behind it first then dive into the code snippets to understand the implementation.
Note: I have provided links to the sites where you can understand these formulae better with examples. I mostly create notes ( like this blog) so that it's easy for me to get everything in one place. So typing examples will be a lengthy task for me. Thank you.
1. Covariance:
If the variables are correlated with a linear relationship, then we can use covariance to find it.
Formula :
Covariance(X, Y) = [ Σ (Xi - mean(X) * (Yi - mean(Y) ] /(n-1)The sign of the covariance is used to determine the type of relationship which is the same as defined above.
In python, we can use covariance by importing cov from NumPy. But it's difficult to interpret much from the magnitude of the results and multiple values.
Code snippet:
covariance = cov( X , Y )
2. Pearson's correlation coefficient:
It gives a coefficient value between -1 to 1 after normalizing the covariance with standard deviation (SD) for linear relationships.Formula:
Covariance(X, Y)/ ( SD(X)*SD(Y))The sign of the covariance is used to determine the type of relationship which is the same as defined above. Only the magnitude is between -1 to 1.
Pearson returns two values: first is the coefficient and the second is two-tailed p values.
Code snippet:
correlation , pvalue = pearsonr( X , Y )
3. Spearman's Correlation :
Spearman's correlation is used for finding a nonlinear relationship. The coefficient lies between -1 to 1Formula :
Covariance(rank(X), rank(Y)) / (SD(rank(X)) * SD(rank(Y)))where rank means ranking the columns X in increasing order and then ranking Y according to X. See more info at "Wikipedia Spearman's rank coefficient". Please see the example to understand more.
spearmanr also returns two values: first is the coefficient and the second is the pvalue
Code snippet:
correlation , pvalue = spearmanr( X , Y )
4. Kendall rank correlation coefficient:
It is calculated by using concordant and discordant pairs.- Concordant: Here we count the number of smaller value elements below a particular element for a given column.
- Discordant: We count the number of higher value element below a particular element for a given column.
Code snippet:
correlation, pvalue = stats.kendalltau( X , Y )
5. Pandas corr( ) function
It uses pearson by default, but we can select from Pearson, Spearman or Kendall.Code snippet:
data.corr()
Seaborn heatmap and pair plot
The easiest way to draw a heatmap is finding the correlation first then passing it to heatmap( ) function defined in seaborn.import pandas as pd
correl=data.corr()
ax = sns.heatmap(correl)
This should produce a heatmap, something like this.
Various types of heatmap can be found in the documentation of seaborn heatmap.
Next, we will use pairplot to plot pair of graphs.
This should produce a cluster of bar graph and scatter graphs which can be used to find any relationship between the variables.
More variation can be found at "Seaborn pairplot documentation"
The graphs are plotted for Kaggle dataset "Medical Cost Personal Datasets".
References: Click on any reference or you can copy it.
- Numpy Cov documentation: https://docs.scipy.org/doc/numpy/reference/generated/numpy.cov.html
- Scipy Pearsonr documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.pearsonr.html
- Scipy Spearman documentation: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.spearmanr.html
- Scipy Kindall Tau coefficient documentation: https://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.kendalltau.html
- Wikipedia Spearman's rank correlation coefficient: https://en.wikipedia.org/wiki/Spearman%27s_rank_correlation_coefficient
- How to Calculate Correlation Between Variables in Python : Jason Brownlee https://machinelearningmastery.com/how-to-use-correlation-to-understand-the-relationship-between-variables/
- Seaborn heatmap types: https://seaborn.pydata.org/generated/seaborn.heatmap.html
- Statics How to Kendalls tau : https://www.statisticshowto.com/kendalls-tau/
- Seaborn pairplot documentation : https://seaborn.pydata.org/generated/seaborn.pairplot.html
- Kaggle Medical Cost Personal Datasets : https://www.kaggle.com/mirichoi0218/insurance/kernels
Comments
Post a Comment