I want to make relationship between values by their Name based on below rules:
1- I have a CSV file (with more than 100000 rows) that consists of lots of values, I shared some examples as below:
Name:
A02-father
A03-father
A04-father
A05-father
A07-father
A08-father
A09-father
A17-father
A18-father
A20-father
A02-SA-A03-SA
A02-SA-A04-SA
A03-SA-A02-SA
A03-SA-A05-SA
A03-SA-A17-SA
A04-SA-A02-SA
A04-SA-A09-SA
A05-SA-A03-SA
A09-SA-A04-SA
A09-SA-A20-SA
A17-SA-A03-SA
A17-SA-A18-SA
A18-SA-A17-SA
A20-SA-A09-SA
A05-NA
B02-Father
B04-Father
B06-Father
B02-SA-B04-SA
B04-SA-BO2-SA
B04-SA-B06-SA
B06-SA-B04-SA
B06-NA
2- Now I have another CSV file which let me know from which value I should start? in this case the value is A03-father & B02-father & ... which dont have any influence on each other and they all have seperate path to go, so for each path we will start from mentioned start point. father.csv A03-father B02-father ....
3- Based on the naming I want to make the relationships, As A03-Father has been determined as Father I should check for any value which has been started with A03.(All of them are A0's babies.) Also as B02 is father, we will check for any value which starts with B02. (B02-SA-B04-SA)
4- Now If I find out A03-SA-A02-SA , this is A03's baby. I find out A03-SA-A05-SA , this is A03's baby. I find out A03-SA-A17-SA , this is A03's baby.
and after that I must check any node which starts with A02 & A05 & A17: As you see A02-Father exists so it is Father and now we will search for any string which starts with A02 and doesn't have A03 which has been detected as Father(It must be ignored)
This must be checked till end of values which exist in the CSV file. As you see I should check the path based on name (REGEX) and should go forward till end of path.
The expected result:
Father Baby
A03-father A03-SA-A02-SA
A03-father A03-SA-A05-SA
A03-father A03-SA-A17-SA
A02-father A02-SA-A04-SA
A05-father A05-NA
A17-father A17-SA-A18-SA
A04-father A04-SA-A09-SA
A02-father A02-SA-A04-SA
A09-father A09-SA-A20-SA
B02-father B02-SA-B04-SA
B04-father B04-SA-B06-SA
B06-father B06-NA
I have coded it as below with pandas:
import pandas as pd
import numpy as np
import re
#Read the file which consists of all Values
df = pd.read_csv("C:\\total.csv")
#Read the file which let me know who is father
Fa = pd.read_csv("C:\\Father.csv")
#Get the first part of Father which is A0
Fa['sub'] = Fa['Name'].str.extract(r'(\w+\s*)', expand=False)
r2 = []
#check in all the csv file and find anything which starts with A0 and is not Father
for f in Fa['sub']:
baby=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains('Father')])
baby['sub'] = bay['Name'].str.extract(r'(\w+\s*)', expand=False)
r1= pd.merge(Fa, baby, left_on='sub', right_on='sub',suffixes=('_f', '_c'))
r2.append(result1)
out_df = pd.concat(result2)
out_df= out_df.replace(np.nan, '', regex=True)
#find A0-N-A2-M and A0-N-A4-M
out_df.to_csv('C:\\child1.csv')
#check in all the csv file and find anything which starts with the second part of child1 which is A2 and A4
out_df["baby2"] = out_df['Name_baby'].str.extract(r'^(?:[^-]*-){2}\s*([^-]+)', expand=False)
baby3= out_df["baby2"]
r4 = []
for f in out_df["baby2"]:
#I want to exclude A0 which has been detected.
l = ['A0']
regstr = '|'.join(l)
baby1=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains(regstr)])
baby1['sub'] = baby1['Name'].str.extract(r'(\w+\s*)', expand=False)
r3= pd.merge(baby3, baby1, left_on='baby2', right_on='sub',suffixes=('_f', '_c'))
r4.append(r3)
out2_df = pd.concat(r4)
out2_df.to_csv('C:\\child2.csv')
I want to put below code in a loop and go through the file and check it, based on naming process and detect other fathers and babies till it finished. however this code is not customized and doesn't have the exact result as i expected. my question is about how to make the loop?
I should go through the path and also consider regstr
value for any string.
#check in all the csv file and find anything which starts with the second part of child1 which is A2 and A4
out_df["baby2"] = out_df['Name_baby'].str.extract(r'^(?:[^-]*-){2}\s*([^-]+)', expand=False)
baby3= out_df["baby2"]
r4 = []
for f in out_df["baby2"]:
#I want to exclude A0 which has been detected.
l = ['A0']
regstr = '|'.join(l)
baby1=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains(regstr)])
baby1['sub'] = baby1['Name'].str.extract(r'(\w+\s*)', expand=False)
r3= pd.merge(baby3, baby1, left_on='baby2', right_on='sub',suffixes=('_f', '_c'))
r4.append(r3)
out2_df = pd.concat(r4)
out2_df.to_csv('C:\\child2.csv')
Start with import collections
(will be needed soon).
I assume that you have already read df and Fa DataFrames.
The first part of my code is to create children Series (index - parent, value - child):
isFather = df.Name.str.contains('-father', case=False)
dfChildren = df[~isFather]
key = []; val = []
for fath in df[isFather].Name:
prefix = fath.split('-')[0]
for child in dfChildren[dfChildren.Name.str.startswith(prefix)].Name:
key.append(prefix)
val.append(child)
children = pd.Series(val, index=key)
Print children to see the result.
The second part is to create the actual result, starting from each starting points in Fa:
nodes = collections.deque()
father = []; baby = [] # Containers for source data
# Loop for each starting point
for startNode in Fa.Name.str.split('-', expand=True)[0]:
nodes.append(startNode)
while nodes:
node = nodes.popleft() # Take node name from the queue
# Children of this node
myChildren = children[children.index == node]
# Process children (ind - father, val - child)
for ind, val in myChildren.items():
parts = val.split('-') # Parts of child name
# Child "actual" name (if exists)
val_2 = parts[2] if len(parts) >= 3 else ''
if val_2 not in father: # val_2 not "visited" before
# Add father / child name to containers
father.append(ind)
baby.append(val)
if len(val_2) > 0:
nodes.append(val_2) # Add to the queue, to be processe later
# Drop rows for "node" from "children" (if any exists)
if (children.index == node).sum() > 0:
children.drop(node, inplace=True)
# Convert to a DataFrame
result = pd.DataFrame({'Father': father, 'Baby': baby})
result.Father += '-father' # Add "-father" to "bare" names
I added -father with lower case "f", but I think this is not much significant detail.
The result, for your data sample, is:
Father Baby
0 A03-father A03-SA-A02-SA
1 A03-father A03-SA-A05-SA
2 A03-father A03-SA-A17-SA
3 A02-father A02-SA-A04-SA
4 A05-father A05-NA
5 A17-father A17-SA-A18-SA
6 A04-father A04-SA-A09-SA
7 A09-father A09-SA-A20-SA
8 B02-father B02-SA-B04-SA
9 B04-father B04-SA-B06-SA
10 B06-father B06-NA
And two remarks concerning your data sample:
A02-father A02-SA-A04-SA
in your expected result is doubled.
I assume it should occur only once.
This works pretty fine for one record in Father.csv, I have about 200 records in Father.csv and the code is just checking the first father.
Your input data sample (Fa) contains just one row. Please indicate what you expect when it contains more rows. The first assumption can be to run my code for each starting point and concatenate results. But maybe what has been found before (for previous starting points) should have some influence on how the current cycle works? Provide an example of Fa with e.g. 2 starting points, the expected result and some explanation.
Hopefully they don't have any influence on each other. I updated my examples and needful explanation. I think Fa.iloc[0,0] must be changed and include all the values which exist in fathers.csv
This works pretty fine. I dont know how to say thanks. I really appreciate if you just add comment to code. so i can understand it better. Thanks
I added some comments. You can also add some "trace" printouts and run the code on some limited source data.