4

关于 Pandas:df.merge()方法,是他们获取合并汇总统计信息(如匹配数、未匹配数等)的便捷方式。我知道这些统计数据取决于how='inner'标志,但是在使用内部连接等时知道有多少被“丢弃”会很方便。我可以简单地使用:

df = df_left.merge(df_right, on='common_column', how='inner')
set1 = set(df_left[common_column].unique())
set2 = set(df_right[common_column].unique())
set1.issubset(set2)   #True No Further Analysis Required
set2.issubset(set1)   #False
num_shared = len(set2.intersection(set1))
num_diff = len(set2.difference(set1))
# And So on ...

但认为这可能已经实施。我是否错过了它(即类似于report=True将返回的合并new_dataframe和报告系列或数据框)

4

2 回答 2

0

这是我迄今为止使用的。这是将数据从一个编码系统协调到另一个编码系统的功能的一部分。

if report == True:
    report_df = pd.DataFrame(data[match_on].describe(), columns=['left'])
    report_df = report_df.merge(pd.DataFrame(concord[match_on].describe(), columns=['right']), left_index=True, right_index=True)
    set_left = set(data[match_on])
    set_right = set(concord[match_on])
    set_info = pd.DataFrame({'left':set_left.issubset(set_right), 'right':set_right.issubset(set_left)}, index=['subset'])
    report_df = report_df.append(set_info)
    set_info = pd.DataFrame({'left':len(set_left.difference(set_right)), 'right':len(set_right.difference(set_left))}, index=['differences'])
    report_df = report_df.append(set_info)
    #Return Random Sample of [5 Differences]
    left_diff = list(set_left.difference(set_right))[0:5]
    if len(left_diff) < 5:
        left_diff = (left_diff + [np.nan]*5)[0:5]
    right_diff = list(set_right.difference(set_left))[0:5]
    if len(right_diff) < 5:
        right_diff = (right_diff + [np.nan]*5)[0:5]
    set_info = pd.DataFrame({'left': left_diff, 'right': right_diff}, index=['diff1', 'diff2', 'diff3', 'diff4', 'diff5'])
    report_df = report_df.append(set_info)

样本报告

报告样本

于 2013-06-17T01:14:30.847 回答
0

试试这个函数......然后你可以像这样将你的参数传递给它:

df = merge_like_stata(df1, df2, mergevars)

函数定义:

def merge_like_stata(master, using, mergevars):
    master['_master_merge_'] = 'master'
    using['_using_merge_'] = 'using'
    df = pd.merge(master, using, on=mergevars, how='outer')
    df['_master_merge_'] = df['_master_merge_'].apply(lambda x: 'miss' if pd.isnull(x) else x)
    df['_using_merge_'] = df['_using_merge_'].apply(lambda x: 'miss' if pd.isnull(x) else x)
    df['_merge'] = df.apply(lambda row: '3 - Master Only' if row['_master_merge_']=='master' and row['_using_merge_'] =='using' else None, axis=1)
    df['_merge'] = df.apply(lambda row: '2 - Master Only' if row['_master_merge_']=='master' and row['_using_merge_'] =='miss' else row['_merge'], axis=1)
    df['_merge'] = df.apply(lambda row: '1 - Using Only' if row['_master_merge_']=='miss' and row['_using_merge_'] =='using' else row['_merge'], axis=1)
    df['column']="Count"
    pd.crosstab(df._merge, df.column, margins=True)
    df = df.drop(['_master_merge_', '_using_merge_'], axis=1)
    return print(pd.crosstab(df._merge, df.column, margins=True))
    return df
于 2018-04-17T22:47:22.937 回答