Math's behind Correlation

Let's talk about correlation today. A variable in a dataset can be related or dependent on some other variable. The relationship between them can be strong or light. Thus it's better to check it once before jumping on conclusions.

Correlation can be positive if both variables change in the same direction, negative if they are moving in the opposite direction or natural if there is no relationship. There are several techniques to find it like seaborn heatmap and pair plot. Let's discuss the maths behind it first then dive into the code snippets to understand the implementation.

Note: I have provided links to the sites where you can understand these formulae better with examples. I mostly create notes ( like this blog) so that it's easy for me to get everything in one place. So typing examples will be a lengthy task for me. Thank you.

1. Covariance:


If the variables are correlated with a linear relationship, then we can use covariance to find it.

Formula :

Covariance(X, Y) = [ Σ (Xi - mean(X) * (Yi - mean(Y) ] /(n-1)

The sign of the covariance is used to determine the type of relationship which is the same as defined above.
In python, we can use covariance by importing cov from NumPy. But it's difficult to interpret much from the magnitude of the results and multiple values.

Code snippet:

from numpy import cov
covariance = cov( X , Y )


2. Pearson's correlation coefficient:

It gives a coefficient value between -1 to 1 after normalizing the covariance with standard deviation (SD) for linear relationships.

Formula:

Covariance(X, Y)/ ( SD(X)*SD(Y))

The sign of the covariance is used to determine the type of relationship which is the same as defined above. Only the magnitude is between -1 to 1.
Pearson returns two values: first is the coefficient and the second is two-tailed p values.

Code snippet:

from scipy.stats import pearsonr
correlation , pvalue = pearsonr( X , Y )


3. Spearman's Correlation :

Spearman's correlation is used for finding a nonlinear relationship. The coefficient lies between -1 to 1

Formula :

Covariance(rank(X), rank(Y)) / (SD(rank(X)) * SD(rank(Y)))

where rank means ranking the columns X in increasing order and then ranking Y according to X. See more info at "Wikipedia Spearman's rank coefficient". Please see the example to understand more.
spearmanr also returns two values: first is the coefficient and the second is the pvalue

Code snippet:

from scipy.stats import spearmanr
correlation , pvalue = spearmanr( X , Y )


4. Kendall rank correlation coefficient:

It is calculated by using concordant and discordant pairs.
  • Concordant: Here we count the number of smaller value elements below a particular element for a given column.
  • Discordant: We count the number of higher value element below a particular element for a given column.
Check this: "Statics How to" for understanding it with example.

Code snippet:

import scipy.stats as stats
correlation, pvalue = stats.kendalltau( X , Y )


5. Pandas corr( ) function

It uses pearson by default, but we can select from Pearson, Spearman or Kendall.

Code snippet:

import pandas
data.corr()


Seaborn heatmap and pair plot

The easiest way to draw a heatmap is finding the correlation first then passing it to heatmap( ) function defined in seaborn.
import seaborn as sns
import pandas as pd

correl=data.corr()
ax = sns.heatmap(correl)


This should produce a heatmap, something like this.


Various types of heatmap can be found in the documentation of seaborn heatmap.
Next, we will use pairplot to plot pair of graphs.
sns.pairplot(data)


This should produce a cluster of bar graph and scatter graphs which can be used to find any relationship between the variables.


More variation can be found at "Seaborn pairplot documentation"
The graphs are plotted for Kaggle dataset "Medical Cost Personal Datasets".

References: Click on any reference or you can copy it.

Comments

Popular Posts