其他-使用Panda在Python中根据值的名称查找值之间的关系

Valdi_Bo 2020-12-01 19:15:10

开始import collections（很快将需要）。

我假设你已经阅读了df和Fa DataFrames。

我的代码的第一部分是创建子级数（索引-父级，值-子级）：

isFather = df.Name.str.contains('-father', case=False)
dfChildren = df[~isFather]
key = []; val = []
for fath in df[isFather].Name:
    prefix = fath.split('-')[0]
    for child in dfChildren[dfChildren.Name.str.startswith(prefix)].Name:
        key.append(prefix)
        val.append(child)
children = pd.Series(val, index=key)

打印孩子以查看结果。

第二部分是从Fa中的每个起点开始创建实际结果：

nodes = collections.deque()
father = []; baby = []  # Containers for source data
# Loop for each starting point
for startNode in Fa.Name.str.split('-', expand=True)[0]:
    nodes.append(startNode)
    while nodes:
        node = nodes.popleft()  # Take node name from the queue
        # Children of this node
        myChildren = children[children.index == node]
        # Process children (ind - father, val - child)
        for ind, val in myChildren.items():
            parts = val.split('-')  # Parts of child name
            # Child "actual" name (if exists)
            val_2 = parts[2] if len(parts) >= 3 else ''
            if val_2 not in father:  # val_2 not "visited" before
                # Add father / child name to containers
                father.append(ind)
                baby.append(val)
                if len(val_2) > 0:
                    nodes.append(val_2)  # Add to the queue, to be processe later
        # Drop rows for "node" from "children" (if any exists)
        if (children.index == node).sum() > 0:
            children.drop(node, inplace=True)
# Convert to a DataFrame
result = pd.DataFrame({'Father': father, 'Baby': baby})
result.Father += '-father'    # Add "-father" to "bare" names

我用小写的“ f”添加了-father，但是我认为这并不是很多重要的细节。

对于你的数据样本，结果是：

        Father           Baby
0   A03-father  A03-SA-A02-SA
1   A03-father  A03-SA-A05-SA
2   A03-father  A03-SA-A17-SA
3   A02-father  A02-SA-A04-SA
4   A05-father         A05-NA
5   A17-father  A17-SA-A18-SA
6   A04-father  A04-SA-A09-SA
7   A09-father  A09-SA-A20-SA
8   B02-father  B02-SA-B04-SA
9   B04-father  B04-SA-B06-SA
10  B06-father         B06-NA

关于数据样本的两点评论：

你用大写的O（一个字母）而不是0 （零）写了B04-SA-B02-SA。我在源数据中对其进行了更正。
A02-father A02-SA-A04-SA你的预期结果中的行增加了一倍。我认为它应该只发生一次。

Sara Daniel 2020-11-30 13:36:41

对于父亲.csv中的一条记录，这工作得很好，我在父亲.csv中有大约200条记录，并且代码只是在检查第一个父亲。

Valdi_Bo 2020-11-30 16:41:16

您的输入数据样本（Fa）仅包含一行。请指出它包含更多行时的期望。第一个假设可以是为每个起点运行我的代码并连接结果。但是也许以前发现的（对于先前的起点）应该对当前循环的工作方式产生一些影响？提供一个具有2个起点，预期结果和一些解释的Fa例子。

Sara Daniel 2020-11-30 17:50:19

希望他们彼此之间没有任何影响。我更新了示例和必要的解释。我认为必须更改Fa.iloc [0,0]并包括在Fathers.csv中存在的所有值

Sara Daniel 2020-12-01 08:57:50

这工作得很好。我不知道怎么说。如果您仅在代码中添加注释，我将不胜感激。这样我就可以更好地理解它。谢谢

Valdi_Bo 2020-12-01 11:17:28

我添加了一些评论。您还可以添加一些“跟踪”打印输出，并在某些有限的源数据上运行代码。

其他-使用Panda在Python中根据值的名称查找值之间的关系

(其他 - Finding relationships between values based on their name in Python with Panda)

热门github