其他-Finding relationships between values based on their name in Python with Panda

Valdi_Bo 2020-12-01 19:15:10

Start with import collections (will be needed soon).

I assume that you have already read df and Fa DataFrames.

The first part of my code is to create children Series (index - parent, value - child):

isFather = df.Name.str.contains('-father', case=False)
dfChildren = df[~isFather]
key = []; val = []
for fath in df[isFather].Name:
    prefix = fath.split('-')[0]
    for child in dfChildren[dfChildren.Name.str.startswith(prefix)].Name:
        key.append(prefix)
        val.append(child)
children = pd.Series(val, index=key)

Print children to see the result.

The second part is to create the actual result, starting from each starting points in Fa:

nodes = collections.deque()
father = []; baby = []  # Containers for source data
# Loop for each starting point
for startNode in Fa.Name.str.split('-', expand=True)[0]:
    nodes.append(startNode)
    while nodes:
        node = nodes.popleft()  # Take node name from the queue
        # Children of this node
        myChildren = children[children.index == node]
        # Process children (ind - father, val - child)
        for ind, val in myChildren.items():
            parts = val.split('-')  # Parts of child name
            # Child "actual" name (if exists)
            val_2 = parts[2] if len(parts) >= 3 else ''
            if val_2 not in father:  # val_2 not "visited" before
                # Add father / child name to containers
                father.append(ind)
                baby.append(val)
                if len(val_2) > 0:
                    nodes.append(val_2)  # Add to the queue, to be processe later
        # Drop rows for "node" from "children" (if any exists)
        if (children.index == node).sum() > 0:
            children.drop(node, inplace=True)
# Convert to a DataFrame
result = pd.DataFrame({'Father': father, 'Baby': baby})
result.Father += '-father'    # Add "-father" to "bare" names

I added -father with lower case "f", but I think this is not much significant detail.

The result, for your data sample, is:

        Father           Baby
0   A03-father  A03-SA-A02-SA
1   A03-father  A03-SA-A05-SA
2   A03-father  A03-SA-A17-SA
3   A02-father  A02-SA-A04-SA
4   A05-father         A05-NA
5   A17-father  A17-SA-A18-SA
6   A04-father  A04-SA-A09-SA
7   A09-father  A09-SA-A20-SA
8   B02-father  B02-SA-B04-SA
9   B04-father  B04-SA-B06-SA
10  B06-father         B06-NA

And two remarks concerning your data sample:

You wrote B04-SA-B02-SA with capital O (a letter) instead of 0 (zero). I corrected it in my source data.
Row A02-father A02-SA-A04-SA in your expected result is doubled. I assume it should occur only once.

Sara Daniel 2020-11-30 13:36:41

This works pretty fine for one record in Father.csv, I have about 200 records in Father.csv and the code is just checking the first father.

Valdi_Bo 2020-11-30 16:41:16

Your input data sample (Fa) contains just one row. Please indicate what you expect when it contains more rows. The first assumption can be to run my code for each starting point and concatenate results. But maybe what has been found before (for previous starting points) should have some influence on how the current cycle works? Provide an example of Fa with e.g. 2 starting points, the expected result and some explanation.

Sara Daniel 2020-11-30 17:50:19

Hopefully they don't have any influence on each other. I updated my examples and needful explanation. I think Fa.iloc[0,0] must be changed and include all the values which exist in fathers.csv

Sara Daniel 2020-12-01 08:57:50

This works pretty fine. I dont know how to say thanks. I really appreciate if you just add comment to code. so i can understand it better. Thanks

Valdi_Bo 2020-12-01 11:17:28

I added some comments. You can also add some "trace" printouts and run the code on some limited source data.

Finding relationships between values based on their name in Python with Panda

热门github