This is my first time asking a question on here, so apologies if I am doing something wrong.
I am looking to create some sort of dataframe/dict/list where I can check if the ID in one column has seen a specific value in another column before.
For example for one pandas dataframe like this (90 million rows):
ID Another_ID
1 10
1 20
2 50
3 10
3 20
4 30
And another like this(10 million rows):
ID Another_ID
1 30
2 30
2 50
2 20
4 30
5 70
I want to end up with a third column that is like this:
ID Another_ID seen_before
1 30 0
2 30 0
2 50 1
2 20 0
4 30 1
5 20 0
I am looking for a memory efficient but quick way to do this, any ideas? Thanks!
Merge is a good idea, here, you want to merge on both columns:
df1['seen_before'] = 1
df2.merge(df1, on=['ID', 'Another_ID'], how='left')
Output:
ID Another_ID seen_before
0 1 30 NaN
1 2 30 NaN
2 2 50 1.0
3 2 20 NaN
4 4 30 1.0
5 5 70 NaN
Note: this assumes that df1
has no duplicates. If you are not sure about this, replace df1
with df1.drop_duplicates()
in merge
.