1

我是熊猫新手。

我有一个如下所示的数据集:

Date_1       Hour_1    id_1    Date_2       Hour_2    id_2    Date_3       Hour_3    id_3    
2019-12-04   00        ABC     2019-12-04   01        ABC     2019-12-04   02        ABC
2019-12-04   00        ABCD    2019-12-04   01        ABCD    2019-12-04   02        ABCD
2019-12-04   00        ABCDEF  2019-12-04   01        ABCDE   2019-12-04   02        ABCDEF
2019-12-04   03        ABCDEFG 2019-12-04   01        ABCDEFG 2019-12-04   02        ABCDEF
...

我的目标

是检查是否id_1存在于id_2,中id_3。并创建一个新的数据框,其结构如下:

Date_1       Hour_1    id_1    Date_2       Hour_2    Exists   Date_3       Hour_3    Exists    
2019-12-04   00        ABC     2019-12-04   01        True     2019-12-04   02        True
2019-12-04   00        ABCD    2019-12-04   01        True     2019-12-04   02        True
2019-12-04   00        ABCDEF                         False    2019-12-04   02        True
2019-12-04   03        ABCDEFG 2019-12-04   01        True                            False

我现在遇到的问题是我不知道如何包含 Date_2、Hour_2、Date_3、Hour_3 或排除它们,具体取决于 id_2 和 id_3 是 True 还是 False。

当我创建我的数据框时,我只需添加所有信息源(日期、小时、id),然后我得到大数据框,其中有 Date_1-10、Hour_1-10、id_1-10。

final_export['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_export['Date from id_2'] = final_data['Date from id_2 other source']
final_export['Hour from id_2'] = final_data['Hour from id_2 other source']

当我使用.isin()方法时,它会正确过滤数据,但无论是否包含同一行的小时和日期,它都不会改变。例如,如果 id_1 存在于 id_3 中,我将拥有 True 及其日期和小时,如果它不存在,我将拥有 False 并且带小时的日期将为空。

在我使用.isin()日期和小时的那一刻,没有链接到 id_ 值。

如果问题解释正确,请告诉我。

谢谢你的建议。

4

5 回答 5

2

像这样的东西应该工作:

mask_id2 = df.id_1 == df.id_2
mask_id3 = df.id_1 == df.id_3

df.id_2 = mask_id2
df.id_3 = mask_id3

df.loc[~mask_id2, ['Date_2', 'Hour_2']] = ""
df.loc[~mask_id3, ['Date_3', 'Hour_3']] = ""

输出:

       Date_1  Hour_1     id_1      Date_2 Hour_2   id_2      Date_3 Hour_3   id_3
0  2019-12-04       0      ABC  2019-12-04      1   True  2019-12-04      2   True
1  2019-12-04       0     ABCD  2019-12-04      1   True  2019-12-04      2   True
2  2019-12-04       0   ABCDEF                     False  2019-12-04      2   True
3  2019-12-04       3  ABCDEFG  2019-12-04      1   True                     False
于 2019-12-04T10:38:04.723 回答
0

我建议将数据帧分成三个数据帧,每个数据帧都有 id、date、hour 并使用合并函数将数据帧与 id 合并为一个值,并在不存在 id 的情况下分配空值

于 2019-12-04T10:30:45.323 回答
0

尝试

df = pd.DataFrame({
    "Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_1": ["00", "00", "00", "03"],
    "id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
    "Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_2": ["01", "01", "01", "01"],
    "id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
    "Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],
    "Hour_3": ["02", "02", "02", "02"],
    "id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})

ids = df["id_1"]
# You can choose whichever columns you want
df_1 = df.loc[df["id_1"].isin(ids), ["Date_1", "Hour_1", "id_1"]]
df_2 = df.loc[df["id_2"].isin(ids), ["Date_2", "Hour_2", "id_2"]]
df_3 = df.loc[df["id_3"].isin(ids), ["Date_3", "Hour_3", "id_3"]]

df_concat = pd.concat([df_1, df_2, df_3], axis=1)

输出

Date_1  Hour_1  id_1    Date_2  Hour_2  id_2    Date_3  Hour_3  id_3
0   2019-12-04  00  ABC 2019-12-04  01  ABC 2019-12-04  02  ABC
1   2019-12-04  00  ABCD    2019-12-04  01  ABCD    2019-12-04  02  ABCD
2   2019-12-04  00  ABCDEF  NaN NaN NaN 2019-12-04  02  ABCDEF
3   2019-12-04  03  ABCDEFG 2019-12-04  01  ABCDEFG 2019-12-04  02  ABCDEF
于 2019-12-04T10:32:17.510 回答
0

如果我正确理解了您的问题,isin()那么使用的函数是否错误:它检查值id_1是否在or ( ) 列中的任何位置:它不检查是否是来自同一行的值的子字符串。试试下面的代码:id_2id_3id_1id_2

import pandas as pd
testdf = pd.DataFrame({
    "hour_1": ["00", "01"],
    "id_1":["ABC", "ABC"], 
    "id_2":["ABCD", "AB"], 
})
testdf["exists_in_2"] = testdf['id_1'].isin(testdf['id_2'])
testdf

首先修复该位:

eltwise_contains =  lambda frag, text: frag in text
testdf["exists_in_2"] = testdf[['id_1', 'id_2']].apply(lambda x : eltwise_contains(*x), axis = 1)

testdf

接下来,您的问题:如果同一行的值和id_1值中不存在,则将天数和小时数设置为空字符串。我们可以使用与上面相同的模式:定义一个接受两个输入的 lambda 表达式,然后在下一行中,从 DataFrame 中提取两列,并在该子 DataFrame 上应用另一个 lambda,该子 DataFrame 将一个解压缩的 lambda 变量传递给原始拉姆达。id_2id_3

empty_string_if_false = lambda a_bool, val: val if a_bool else ""
testdf["hour_1"] = testdf[['exists_in_2', 'hour_1']].apply(lambda x : empty_string_if_false(*x), axis = 1)

testdf
于 2019-12-04T11:14:34.340 回答
0

如果 Iron Hand 的答案不是您在此之后的答案,则会以您拥有的格式为您提供 df -

import pandas as pd

final_data = pd.DataFrame({
    "Date_1": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_1": ["00", "00", "00", "03"],"id_1": ["ABC", "ABCD", "ABCDEF", "ABCDEFG"],
    "Date_2": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_2": ["01", "01", "01", "01"],"id_2": ["ABC", "ABCD", "ABCDE", "ABCDEFG"],
    "Date_3": ["2019-12-04", "2019-12-04", "2019-12-04", "2019-12-04"],"Hour_3": ["02", "02", "02", "02"],"id_3": ["ABC", "ABCD", "ABCDEF", "ABCDEF"],
})

final_data['Exists in id_2'] = final_data['id_1'].isin(final_data['id_2'])
final_data['Exists in id_3'] = final_data['id_1'].isin(final_data['id_3'])    final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Hour_2']=final_data.apply(lambda r: r['Hour_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_2']=final_data.apply(lambda r: r['Date_2'] if r['Exists in id_2'] is True else '',axis=1)
final_data['Date_3']=final_data.apply(lambda r: r['Date_3'] if r['Exists in id_3'] is True else '',axis=1)
final_data['Hour_3']=final_data.apply(lambda r: r['Hour_3'] if r['Exists in id_3'] is True else '',axis=1)
print(final_data[['id_1','id_2','id_3','Hour_2','Hour_3']])

它给出了一个 df,其中包含除 id2 之外的所有原始信息,当 id_2 不在 id1 中时删除 hour2,对于 id3 也是如此。选定的行看起来像 -

      id_1     id_2    id_3 Hour_2      Date_2 Hour_3      Date_3
0      ABC      ABC     ABC     01  2019-12-04     02  2019-12-04
1     ABCD     ABCD    ABCD     01  2019-12-04     02  2019-12-04
2   ABCDEF    ABCDE  ABCDEF                        02  2019-12-04
3  ABCDEFG  ABCDEFG  ABCDEF     01  2019-12-04              
于 2019-12-04T11:14:41.077 回答