python - 如何使用 python 比较两个不同的 csv 文件？

Question

我想制作比较两个 csv 文件的代码！

import pandas as pd
import numpy as np

    df = pd.read_csv("E:\Dupfile.csv")
    df1 = pd.read_csv("E:\file.csv")
    
    df['Correct'] = None
    
    def Result(x):
       if ....:
         return int(1)
       else:
         return int(0)
    
    
    df.loc[:,"Correct"]=df.apply(Result,axis=1)
    
    print(df["Correct"])
    
    df.to_csv("E:\file.csv")
    print(df.head(20))

例如，file.csv 格式如下所示：

     round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41     
1     2  2021.04      2       9     10      16     35     37      
2     3  2021.04      4      15     24      35     36     40      
3     4  2021.03     10      11     20      21     25     41     
4     5  2021.03      4       9     23      26     29     33     
5     6  2021.03      1       9     26      28     30     41

Dupfile.csv 如下所示：

    round    date  first  second  third  fourth  fifth  sixth  
0     1  2021.04      1      14     15      24     40     41  
0     1  2021.04      1       2      3       4      5      6    
1     2  2021.04      2       9     10      16     35     37   
1     2  2021.04      1       2      3       4      5      6      
2     3  2021.04      4      15     24      35     36     40    
2     3  2021.04      1       2      3       4      5      6     
3     4  2021.03     10      11     20      21     25     41  
3     4  2021.03      1       2      3       4      5      6     
4     5  2021.03      4       9     23      26     29     33  
4     5  2021.03      1       2      3       4      5      6

它还有一个相同的回合，但价值不同。

使用 Dupfile 的轮次检查文件的轮值，如果第一个到第六个值相等，则在 Dupfile 中创建另一个“正确”列并放入 1。如果不正确，将 0 放入“正确”列。

我试图比较两个不同的 csv 文件，但是我不知道该怎么做。有人能帮我吗？

我的期望答案：

    round    date  first  second  third  fourth  fifth  sixth Correct
0     1  2021.04      1      14     15      24     40     41    1
0     1  2021.04      1       2      3       4      5      6    0
1     2  2021.04      2       9     10      16     35     37    1
1     2  2021.04      1       2      3       4      5      6    0  
2     3  2021.04      4      15     24      35     36     40    1
2     3  2021.04      1       2      3       4      5      6    0 
3     4  2021.03     10      11     20      21     25     41    1
3     4  2021.03      1       2      3       4      5      6    0 
4     5  2021.03      4       9     23      26     29     33    1
4     5  2021.03      1       2      3       4      5      6    0

score 2 · Accepted Answer

如果您使用pandas模块，最好获得模块中提供的方法。我建议你，尝试merge用于比较 2 个不同的 DataFrame。我重写你的代码如下。

import pandas as pd

df = pd.read_csv("E:\Dupfile.csv")
df1 = pd.read_csv("E:\file.csv")

df1['Correct'] = 1

df = df.merge(
        df1,
        how='left',
        on=['round',
            'date',
            'first',
            'second',
            'third',
            'fourth',
            'fifth',
            'sixth']).fillna(0)

print(df)

print(df['Correct'])

df.to_csv("E:\file.csv")
print(df.head(20))

它是如何工作的？

该merge方法尝试匹配数组中存在的列df和具有相同名称的列。当您选择参数时，不会删除合并 ( ) 左侧的任何值（左连接）。换句话说，我们创建的列附加到数据中，不匹配的列被分配为值。该方法帮助我们将值替换为 0。df1onlefthowdfcorrectfile.csvDupfil.csvnanfillna(0)nan

pandas.DataFrame.merge API 参考

score 0 · Accepted Answer

您可以使用纯熊猫使用df.merge.

查看示例：

import pandas as pd


# file.csv
file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("6", "2021.03", "1", "9", "26", "28", "30", "41"),
    ],
)

# adding control column (we already know that those are the right values)
file_df["correct"] = 1

# Dupfile.csv
dup_file_df = pd.DataFrame(
    columns=["round", "date", "first", "second", "third", "fourth", "fifth", "sixth"],
    data=[
        ("1", "2021.04", "1", "14", "15", "24", "40", "41"),
        ("1", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("2", "2021.04", "2", "9", "10", "16", "35", "37"),
        ("2", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("3", "2021.04", "4", "15", "24", "35", "36", "40"),
        ("3", "2021.04", "1", "2", "3", "4", "5", "6"),
        ("4", "2021.03", "10", "11", "20", "21", "25", "41"),
        ("4", "2021.03", "1", "2", "3", "4", "5", "6"),
        ("5", "2021.03", "4", "9", "23", "26", "29", "33"),
        ("5", "2021.03", "1", "2", "3", "4", "5", "6"),
    ],
)

# We extract the column names to use in the merging process
cols = [x for x in dup_file_df.columns]

# We merge the 2 dataframes.
# The data frames are to match on every column (round, date and first to sixth). 
# The "correct" column will be populated only if all the columns are matching
merged = dup_file_df.merge(file_df, how="outer", left_on=cols, right_on=cols)

# We put "0" where correct is None and cast to integer (it was float)
merged["correct"] = merged["correct"].fillna(0).astype(int)

# Done!
print(merged)

python - 如何使用 python 比较两个不同的 csv 文件？

2 回答 2

它是如何工作的？

Related

Reference