I am trying to use two data frames for a simple lookup using Pandas. I have a main master data frame (left) and a lookup data frame (right). I want to left join them on the matching integer code and return the item title
from the item_df
.
I see a slight solution with a key value pair idea but it seems cumbersome. My idea is to merge
the data frames together using col3
and name
as key columns and keep the value
from the right frame that I want which will be title
. Thus I decide to drop
the key
column that I joined on so all I have left is the value
. Now lets say I want to do this several times with my own manual naming conventions. For this I use rename
to rename the value that I merged in. Now I would repeat this merge operation and rename my next join to something like second_title
(see example below).
Is there a less cumbersome way to perform this repeated operation without constantly dropping the extra columns that are merged in and renaming the new column between each merge step?
Example code below:
import pandas as pd
master_dict: dict = {'col1': [3,4,8,10], 'col2': [5,6,9,10], 'col3': [50,55,59,60]}
master_df: pd.DataFrame = pd.DataFrame(master_dict)
item_dict: dict = {'name': [55,59,50,5,6,7], 'title': ['p1','p2','p3','p4','p5','p6']}
item_df: pd.DataFrame = pd.DataFrame(item_dict)
print(master_df.head())
col1 col2 col3
0 3 5 50
1 4 6 55
2 8 9 59
3 10 10 60
print(item_df.head())
name title
0 55 p1
1 59 p2
2 50 p3
3 5 p4
4 6 p5
# merge on col3 and name
combined_df = pd.merge(master_df, item_df, how = 'left', left_on = 'col3', right_on = 'name')
# rename title to "first_title"
combined_df.rename(columns = {'title':'first_title'}, inplace = True)
combined_df.drop(columns = ['name'], inplace = True) # remove 'name' column that was joined in from right frame
# repeat operation for "second_title"
combined_df = pd.merge(combined_df, item_df, how = 'left', left_on = 'col2', right_on = 'name')
combined_df.rename(columns = {'title': 'second_title'}, inplace = True)
combined_df.drop(columns = ['name'], inplace = True)
print(combined_df.head())
col1 col2 col3 first_title second_title
0 3 5 50 p3 p4
1 4 6 55 p1 p5
2 8 9 59 p2 NaN
3 10 10 60 NaN NaN
We could use your key:value mapping with the map function:
The code below gets a dictionary of values of item_df name
column that is in master_df
col3 and col2 respectively.
col3 = dict(zip(*(value for _, value in
item_df[item_df.name.isin(master_df.col3)].items()))
)
col2 = dict(zip(*(value for _, value in
item_df[item_df.name.isin(master_df.col2)].items()))
)
col3
{55: 'p1', 59: 'p2', 50: 'p3'}
col2
{5: 'p4', 6: 'p5'}
Next is to use assign and create the columns first_title and second_title:
master_df.assign(
first_title=master_df.col3.map(col3),
second_title=master_df.col2.map(col2)
)
col1 col2 col3 first_title second_title
0 3 5 50 p3 p4
1 4 6 55 p1 p5
2 8 9 59 p2 NaN
3 10 10 60 NaN NaN
UPDATE
I thought about your comment for a single dictionary, and it seems achievable by using a Series. This will greatly reduce the bloated code I shared earlier.
In this case, we convert item_df
to a series and map it to each relevant column:
item_df = item_df.set_index("name").loc[:, "title"]
item_df
name
55 p1
59 p2
50 p3
5 p4
6 p5
7 p6
Name: title, dtype: object
Now create your specific columns using the assign function:
master_df.assign(first_title=master_df.col3.map(item_df),
second_title=master_df.col2.map(item_df)
)
col1 col2 col3 first_title second_title
0 3 5 50 p3 p4
1 4 6 55 p1 p5
2 8 9 59 p2 NaN
3 10 10 60 NaN NaN
Much simpler and straight to the point.
So to make sure I understand, the idea is to create "lookup tables" from the item_df where the keys exist in the original frame. Then the idea is to use those dictionaries explicitly to lookup the matching values and map them to new columns in the final dataframe? Is it efficient to make all those lookup dictionaries prior to mapping? Or is there a way to make one super dictionary that works for all key lookups?
Your values are in different columns, so you have to make individual dictionaries. As for efficiency, you could run the code and compare with your present solution to see if it is better. One super dictionary for all key lookups. Not so sure. How would you go about it?
@Coldchain9, made an update to my answer, with a much simpler code.
Excellent! Just tried it out and I much prefer this over the multi-dictionary method. Thank you a ton for your help! :). Just to confirm that I understand. You use
set_index()
to make thename
the new index of the resultingpd.Series
. So when.map()
is used later, it maps the value fromcol3
to the proper index value ofitem_df
and returns the single value that matches (since it is a single column series)? Is that how.map()
works in this case?Yes. That is how map works. Had to read the documentation again :)