python - pd.read_feather 小数/千位分隔符问题和浮点数舍入问题

Question

我想使用 .ftr 文件快速分析数百个表。不幸的是，我在小数点和千位分隔符方面遇到了一些问题，类似于那篇文章，只是 read_feather 不允许decimal=',', thousands='.'选项。我尝试了以下方法：

df['numberofx'] = (
    df['numberofx']
    .apply(lambda x: x.str.replace(".","", regex=True)
                      .str.replace(",",".", regex=True))

导致

AttributeError: 'str' object has no attribute 'str'

当我将其更改为

df['numberofx'] = (
    df['numberofx']
    .apply(lambda x: x.replace(".","").replace(",","."))

我在结果中收到了一些奇怪的（四舍五入）错误，例如 22359999999999998 而不是 2236 用于某些高于 1k 的数字。1k以下都是真实结果的10倍，这可能是因为删除了“。” 浮点数并创建该数字的整数。

试

df['numberofx'] = df['numberofx'].str.replace('.', '', regex=True)

也会导致结果中出现一些奇怪的行为，因为一些数字在 10^12 中，而另一些则保持在 10^3 中。

以下是我从多个 Excel 文件创建 .ftr 文件的方法。我知道我可以简单地从 Excel 文件创建 DataFrame，但这会大大降低我的日常计算速度。

我该如何解决这个问题？

编辑：问题似乎来自于以 df 格式读取 excel 文件，其中关于十进制和千位分隔符的非美国标准，而不是将其保存为羽毛。使用pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.')读取 excel 文件的选项解决了我的问题。这就引出了下一个问题：

为什么在羽毛文件中保存浮点数会导致奇怪的舍入错误，例如将 2.236 更改为 2.2359999999999998？

score 2 · Accepted Answer

您的代码中的问题是：

当您检查数据框（熊猫）中的列类型时，您会发现：

df.dtypes['numberofx']

结果：类型object

所以建议的解决方案是尝试：

df['numberofx'] = df['numberofx'].apply(pd.to_numeric, errors='coerce')

解决此问题的另一种方法是将您的值转换为 float ：

def coerce_to_float(val):
    try:
       return float(val)
    except ValueError:
       return val

df['numberofx']= df['numberofx'].applymap(lambda x: coerce_to_float(x))

为了避免这种类型的 float '4.806105e+12' 这里是一个示例 Sample：

df = pd.DataFrame({'numberofx':['4806105017087','4806105017087','CN414149']})
print (df)
              ID
0  4806105017087
1  4806105017087
2       CN414149

print (pd.to_numeric(df['numberofx'], errors='coerce'))
0    4.806105e+12
1    4.806105e+12
2             NaN
Name: ID, dtype: float64

df['numberofx'] = pd.to_numeric(df['numberofx'], errors='coerce').fillna(0).astype(np.int64)
print (df['numberofx'])
              ID
0  4806105017087
1  4806105017087
2              0

score 0 · Accepted Answer

正如我在编辑中提到的，这里解决了我最初的问题：

path = r"pathname\*_somename*.xlsx"
file_list = glob.glob(path)
for f in file_list:
    df = pd.read_excel(f, encoding='utf-8', decimal=',', thousands='.')
    for col in df.columns:
            w= (df[[col]].applymap(type) != df[[col]].iloc[0].apply(type)).any(axis=1)
            if len(df[w]) > 0:

                df[col] = df[col].astype(str)

            if df[col].dtype == list:
                df[col] = df[col].astype(str)
    pathname = f[:-4] + "ftr"
    df.to_feather(pathname)
df.head()

我不得不添加decimal=',', thousands='.'读取 excel 文件的选项，后来我将其另存为羽毛。因此，在使用 .ftr 文件时并没有出现问题，而是在之前。舍入问题似乎来自将具有不同小数点和千位分隔符的数字保存为 .ftr 文件。

python - pd.read_feather 小数/千位分隔符问题和浮点数舍入问题

2 回答 2

Related

Reference