Use .corr to get the correlation between two columns

Cleb 2019-07-09 00:05

Without actual data it is hard to answer the question but I guess you are looking for something like this:

Top15['Citable docs per Capita'].corr(Top15['Energy Supply per Capita'])

That calculates the correlation between your two columns 'Citable docs per Capita' and 'Energy Supply per Capita'.

To give an example:

import pandas as pd

df = pd.DataFrame({'A': range(4), 'B': [2*i for i in range(4)]})

   A  B
0  0  0
1  1  2
2  2  4
3  3  6

Then

df['A'].corr(df['B'])

gives 1 as expected.

Now, if you change a value, e.g.

df.loc[2, 'B'] = 4.5

   A    B
0  0  0.0
1  1  2.0
2  2  4.5
3  3  6.0

the command

df['A'].corr(df['B'])

returns

0.99586

which is still close to 1, as expected.

If you apply .corr directly to your dataframe, it will return all pairwise correlations between your columns; that's why you then observe 1s at the diagonal of your matrix (each column is perfectly correlated with itself).

df.corr()

will therefore return

          A         B
A  1.000000  0.995862
B  0.995862  1.000000

In the graphic you show, only the upper left corner of the correlation matrix is represented (I assume).

There can be cases, where you get NaNs in your solution - check this post for an example.

If you want to filter entries above/below a certain threshold, you can check this question. If you want to plot a heatmap of the correlation coefficients, you can check this answer and if you then run into the issue with overlapping axis-labels check the following post.

Dr.DOOM 2018-04-03 22:28:56

can this be applied by row?

Cleb 2018-04-03 22:40:42

@Dr.DOOM: Yes, it just takes series, so e.g. df.loc[1, :].corr(df.loc[2, :]) will work fine, too. For the entire dataframe, you can simply transpose: df.T.corr().

Dr.DOOM 2018-04-03 22:58:18

I tried your suggestion however the computation still returns 1 even after changing the a value in column B using df.loc[2, 'B'] = 4.5. maybe im just confused on the computation

Cleb 2018-04-03 23:04:43

@Dr.DOOM: Difficult to help as I don't know your code. Do I understood correctly that my example from above returns 1 in your case instead of 0.99586?

Adrian Keister 2019-08-16 01:48:11

@Cleb: Well, in the context in which I am working, every higher-level multi-column index has identical sub-layers. See this question for what I am trying to do: stackoverflow.com/questions/57513002/…

Related issues

How to use python cut method to create bins, accept one parameter and return appropriate bin?

Create a dictionary from a list of lists with certain criteria

selecting columns based on row value, Python, Pandas

plotting count of zeros and ones in a dataframe

BeautifulSoup find.all() web scraping returns empty

python function. output a keys list from a dictionary if the key is todays date

Best way to perform multiple amount of Pandas lookups between two DataFrames

How to get the number of columns and the width of each column in a Pandas pivot table?

Display a column when a desired value is missing while grouping in Pandas dataframe

Python hide ticks but show tick labels