1

我正在尝试比较两个 csv 文件(未排序),并希望获得类似 SAS Proc compare 的报告。我在比较之前使用 datacompy 并对数据框进行排序,但 datacompy 报告显示“没有共同的行”。

请让我知道我在下面的代码段中缺少什么。

我已经尝试过排序、重新索引,也没有使用 join_columns,我也尝试过 on_index=True。

from io import StringIO
import pandas as pd
import datacompy

data1 = """name,age,loc
ABC,123,LON
EFG,456,MAA
"""

data2 = """name,age,loc
EFG,457,MAA
ABC,124,LON
"""

df1 = pd.read_csv(StringIO(data1))
df2 = pd.read_csv(StringIO(data2))

df1.sort_values(by=['name','age','loc']).reindex
df2.sort_values(by=['name','age','loc']).reindex

compare = datacompy.Compare(
    df1,
    df2,
    join_columns=['name','age','loc'],  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')
compare.matches()

print(compare.report())

预期结果是

数据1

姓名、年龄、地址

ABC,123,伦敦

EFG,456,MAA

数据2

姓名、年龄、地址

ABC,123,伦敦

EFG,457,MAA

并且报告应该像年龄列一样,最大差异为 1,所有其他都匹配得很好。

4

1 回答 1

3

您正在加入所有三列,并且应该只加入name. 在您的加入更改为以下内容:

compare = datacompy.Compare(
    df1,
    df2,
    join_columns=['name'],  #You can also specify a list of columns
    abs_tol=0.0001,
    rel_tol=0,
    df1_name='original',
    df2_name='new')
compare.matches()

print(compare.report())

这将产生以下输出:

DataFrame Summary
-----------------

  DataFrame  Columns  Rows
0  original        3     2
1       new        3     2

Column Summary
--------------

Number of columns in common: 3
Number of columns in original but not in new: 0
Number of columns in new but not in original: 0

Row Summary
-----------

Matched on: name
Any duplicates on match values: No
Absolute Tolerance: 0.0001
Relative Tolerance: 0
Number of rows in common: 2
Number of rows in original but not in new: 0
Number of rows in new but not in original: 0

Number of rows with some compared columns unequal: 2
Number of rows with all compared columns equal: 0

Column Comparison
-----------------

Number of columns compared with some values unequal: 1
Number of columns compared with all values equal: 2
Total number of values which compare unequal: 2

Columns with Unequal Values or Types
------------------------------------

  Column original dtype new dtype  # Unequal  Max Diff  # Null Diff
0    age          int64     int64          2       1.0            0

Sample Rows with Unequal Values
-------------------------------

  name  age (original)  age (new)
1  EFG             456        457
0  ABC             123        124
于 2019-08-21T15:11:22.567 回答