Warm tip: This article is reproduced from serverfault.com, please click

其他-使用Panda在Python中根据值的名称查找值之间的关系

(其他 - Finding relationships between values based on their name in Python with Panda)

发布于 2020-11-28 06:17:39

我想根据以下规则通过值的名称建立值之间的关系:

1-我有一个包含许多值的CSV文件(具有100000多行),我共享了一些示例,如下所示:

    Name:
A02-father
A03-father
A04-father
A05-father
A07-father
A08-father
A09-father
A17-father
A18-father
A20-father
A02-SA-A03-SA
A02-SA-A04-SA
A03-SA-A02-SA
A03-SA-A05-SA
A03-SA-A17-SA
A04-SA-A02-SA
A04-SA-A09-SA
A05-SA-A03-SA
A09-SA-A04-SA
A09-SA-A20-SA
A17-SA-A03-SA
A17-SA-A18-SA
A18-SA-A17-SA
A20-SA-A09-SA
A05-NA
B02-Father
B04-Father
B06-Father
B02-SA-B04-SA
B04-SA-BO2-SA
B04-SA-B06-SA
B06-SA-B04-SA
B06-NA

2-现在我还有另一个CSV文件,可以让我知道应该从哪个值开始?在这种情况下,值是A03-father和B02-father和...,它们彼此之间没有任何影响,并且它们都有各自独立的路径,因此对于每种路径,我们将从提到的起点开始。父亲.csv A03父亲B02父亲....

3-基于命名,我想建立关系,因为确定A03-父亲为父亲,所以我应该检查任何以A03开头的值(它们都是A0的婴儿。)另外,由于B02是父亲,我们将检查以B02开头的任何值。(B02-SA-B04-SA)

4-现在,如果我找到A03-SA- A02 -SA,这是A03的孩子。我发现A03-SA- A05 -SA,这是A03的孩子。我发现A03-SA- A17 -SA,这是A03的孩子。

之后,我必须检查以A02&A05&A17开头的任何节点:如你所见,A02-父亲存在,所以它是父亲,现在我们将搜索以A02开头且没有A03的任何字符串作为父亲(必须忽略)

必须检查它直到CSV文件中存在的值的结尾。如你所见,我应该检查基于名称(REGEX)的路径,并且应该前进到路径结尾。

预期结果:

    Father      Baby
A03-father   A03-SA-A02-SA
A03-father   A03-SA-A05-SA
A03-father   A03-SA-A17-SA
A02-father   A02-SA-A04-SA
A05-father   A05-NA
A17-father   A17-SA-A18-SA
A04-father   A04-SA-A09-SA
A02-father   A02-SA-A04-SA
A09-father   A09-SA-A20-SA
B02-father   B02-SA-B04-SA
B04-father   B04-SA-B06-SA
B06-father   B06-NA

我已经用大 pandas 对它进行了如下编码:

import pandas as pd
import numpy as np
import re

#Read the file which consists of all Values
df = pd.read_csv("C:\\total.csv")


#Read the file which let me know who is father
Fa = pd.read_csv("C:\\Father.csv")

#Get the first part of Father which is A0
Fa['sub'] = Fa['Name'].str.extract(r'(\w+\s*)', expand=False)
r2 = []

#check in all the csv file and find anything which starts with A0 and is not Father
for f in Fa['sub']:
     baby=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains('Father')])
     baby['sub'] = bay['Name'].str.extract(r'(\w+\s*)', expand=False)
     r1= pd.merge(Fa, baby, left_on='sub', right_on='sub',suffixes=('_f', '_c'))
     r2.append(result1)
out_df = pd.concat(result2)
out_df= out_df.replace(np.nan, '', regex=True)
#find A0-N-A2-M and A0-N-A4-M
out_df.to_csv('C:\\child1.csv')



#check in all the csv file and find anything which starts with the second part of child1 which is A2 and A4
out_df["baby2"] = out_df['Name_baby'].str.extract(r'^(?:[^-]*-){2}\s*([^-]+)', expand=False)
baby3= out_df["baby2"]
r4 = []
for f in out_df["baby2"]:
    #I want to exclude A0 which has been detected.
     l = ['A0']  
     regstr = '|'.join(l)
     baby1=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains(regstr)])
     baby1['sub'] = baby1['Name'].str.extract(r'(\w+\s*)', expand=False)

     r3= pd.merge(baby3, baby1, left_on='baby2', right_on='sub',suffixes=('_f', '_c'))
     r4.append(r3)
out2_df = pd.concat(r4)
out2_df.to_csv('C:\\child2.csv')

我想将以下代码放入循环中,并根据命名过程检查文件并进行检查,并检测其他父亲和婴儿,直到完成为止。但是,此代码不是自定义的,并且没有我期望的确切结果。我的问题是关于如何进行循环?

我应该通过路径,还要考虑regstr任何字符串的值。

#check in all the csv file and find anything which starts with the second part of child1 which is A2 and A4

out_df["baby2"] = out_df['Name_baby'].str.extract(r'^(?:[^-]*-){2}\s*([^-]+)', expand=False)
baby3= out_df["baby2"]
r4 = []
for f in out_df["baby2"]:
    #I want to exclude A0 which has been detected. 
     l = ['A0']  
     regstr = '|'.join(l)
     baby1=(df[df['Name'].str.startswith(f) & ~df['Name'].str.contains(regstr)])
     baby1['sub'] = baby1['Name'].str.extract(r'(\w+\s*)', expand=False)

     r3= pd.merge(baby3, baby1, left_on='baby2', right_on='sub',suffixes=('_f', '_c'))
     r4.append(r3)
out2_df = pd.concat(r4)
out2_df.to_csv('C:\\child2.csv')
Questioner
Sara Daniel
Viewed
11
Valdi_Bo 2020-12-01 19:15:10

开始import collections(很快将需要)。

我假设你已经阅读了dfFa DataFrames。

我的代码的第一部分是创建级数(索引-父级,值-子级):

isFather = df.Name.str.contains('-father', case=False)
dfChildren = df[~isFather]
key = []; val = []
for fath in df[isFather].Name:
    prefix = fath.split('-')[0]
    for child in dfChildren[dfChildren.Name.str.startswith(prefix)].Name:
        key.append(prefix)
        val.append(child)
children = pd.Series(val, index=key)

打印孩子以查看结果。

第二部分是从Fa中的每个起点开始创建实际结果

nodes = collections.deque()
father = []; baby = []  # Containers for source data
# Loop for each starting point
for startNode in Fa.Name.str.split('-', expand=True)[0]:
    nodes.append(startNode)
    while nodes:
        node = nodes.popleft()  # Take node name from the queue
        # Children of this node
        myChildren = children[children.index == node]
        # Process children (ind - father, val - child)
        for ind, val in myChildren.items():
            parts = val.split('-')  # Parts of child name
            # Child "actual" name (if exists)
            val_2 = parts[2] if len(parts) >= 3 else ''
            if val_2 not in father:  # val_2 not "visited" before
                # Add father / child name to containers
                father.append(ind)
                baby.append(val)
                if len(val_2) > 0:
                    nodes.append(val_2)  # Add to the queue, to be processe later
        # Drop rows for "node" from "children" (if any exists)
        if (children.index == node).sum() > 0:
            children.drop(node, inplace=True)
# Convert to a DataFrame
result = pd.DataFrame({'Father': father, 'Baby': baby})
result.Father += '-father'    # Add "-father" to "bare" names

用小写的“ f”添加了-father,但是我认为这并不是很多重要的细节。

对于你的数据样本,结果是:

        Father           Baby
0   A03-father  A03-SA-A02-SA
1   A03-father  A03-SA-A05-SA
2   A03-father  A03-SA-A17-SA
3   A02-father  A02-SA-A04-SA
4   A05-father         A05-NA
5   A17-father  A17-SA-A18-SA
6   A04-father  A04-SA-A09-SA
7   A09-father  A09-SA-A20-SA
8   B02-father  B02-SA-B04-SA
9   B04-father  B04-SA-B06-SA
10  B06-father         B06-NA

关于数据样本的两点评论:

  • 用大写的O(一个字母)而不是0 (零)写了B04-SA-B02-SA我在源数据中对其进行了更正。
  • A02-father A02-SA-A04-SA你的预期结果中的增加了一倍。我认为它应该只发生一次。