pandas - Python Pandas 十进制标记欧盟到美国

Question

我把关于欧盟到美国小数点转换的邮件发红了，这些帮助很大，但我仍然觉得需要专家的帮助.. 我的数据来自 ERP 系统，数字格式为“1'000' 000,32”，我想简单地转换成“1000000.32”之类的东西，以便在 Pandas 中进一步处理。

我从欧盟开始获取美国格式的实际解决方案如下：

... 
 # read_csv and merge, clean .. different CSV files
 # result = merge (some_DataFrame_EU_format, ...)
...
result.to_csv(path, sep';')
result = read_csv(path, sep';', converters={'column_name': lambda x: float(x.replace   ('.','').replace(',','.'))})
....
result.to_csv(path, sep';')

我觉得这是用“。”更改“，”的缓慢方法。由于 read_csv 和 to_csv（以及磁盘 ..），所以愿意直接在 DataFrame 上尝试 .replace 方法以节省一些处理时间。

我最初的尝试是这样的（我在论坛的其他地方发红了..）：

result['column_name'] = result['column_name'].replace( '.', '')
result['column_name'] = result['column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

这不起作用并导致“浮点文字无效”错误。

我很感动：

for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '')
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

以上工作..但令人惊讶的是，它似乎比 read_csv/converters 解决方案慢了大约 3 倍。使用以下内容在某种程度上有所帮助：

    for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '').replace( ',', '.')
    result['column_name'] =  result['column_name'].astype(float)

我红了精美的手册..并且知道 read_csv 已经过优化..但并没有真正期望 red / write /read/ write cicle 比 for 循环快三倍！

你认为在这方面做更多工作是否值得？有什么建议吗？还是继续使用重复的写/读/写方法更好？

我的文件大约有 30k 行 x 150 列，读/写/读（转换）/写大约需要 18 秒，第一种循环的 .ix 超过 52 秒（分组 .replace 为 32 秒）。

您将 DataFrames 从欧盟格式转换为美国格式有什么经验？一些建议的改进方法？'映射'或'语言环境'呢？他们会更快吗？

非常感谢你，法比奥。

PS我意识到我很“冗长”，还不够“pythonic”..对不起对不起..我还在学习......:-)

score 2 · Accepted Answer

Thank You so much for your great suggestions and help, Andy and Jeff ! You helped a lot :-)

I firstly went back with an editor to the original datas. In some of them I saw that the system probably applied some kind of automatic conversion so I newly downloaded the same dataset as 'unconverted' option and avoided to use e.g. Excel or other programs to open/save files. I used text editors only. At this point I made the read_csv lighter with no converters and grouped the replaces as Jeff's suggested.

The real case is a bit longer than the provided example and includes some stripping (spaces), columns del, string concat, renaming/replace .... The decimal marks are replaced for three columns: USD Sales, Qty, USD_EUR exchange rate. Based on them EUR sales and EUR unit prices are calculated. In the initial file we also have a '-' ,for some other reason, before the exchange rate to be fixed ("-", ""). The result is:

result = pd.read_csv(path, sep=';', thousands = '.')
col = [ 'qty', 'sales', 'rate']
result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",","."))
result['sales_localcurrency'] = abs(result['sales'].astype(float) / result['rate'].astype(float))
result['sales_localcurrency_unit'] = result['sales_localcurrency'] / result['qty'].astype(float)
result.to_csv(path, sep=';')

The 30'000 x 150 DataFrame is processed in less than 15 seconds :-) :-) including all the other things I did not detailed here too much (stripping, del, concat, ..). All that read/write/read/write had been deleted from the code skipping the 'converters' during the read_csv.

Thank you for your helps :-) !

Bye Bye. Fabio.

-

score 1 · Accepted Answer

实际上 read_csv 中有一个千位和十进制参数（请参阅熊猫文档 read_csv 但不幸的是两者还没有一起工作（请参阅问题：github问题）

score 0 · Accepted Answer

使用您指定的值创建一个框架并写入 csv

In [2]: df = DataFrame("100'100,32",index=range(30000),columns=range(150))

In [3]: df.iloc[0:5,0:5]
Out[3]: 
            0           1           2           3           4
0  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
1  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
2  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
3  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
4  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32

In [4]: df.to_csv('test.csv')

阅读它，没有转换器

In [5]: df = read_csv('../test.csv',index_col=0)

In [6]: %timeit read_csv('../test.csv',index_col=0)
1 loops, best of 3: 1e+03 ms per loop

In [7]: df
Out[7]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 0 to 29999
Columns: 150 entries, 0 to 149
dtypes: object(150)

In [8]: %timeit read_csv('../test.csv',index_col=0)
1 loops, best of 3: 1e+03 ms per loop

逐列进行字符串替换。在这里，您只能根据需要指定某些列，方法是df[[ list of columns ]].apply(.....)

In [9]: df.apply(lambda x: x.str.replace("'","").str.replace(",",".")).astype(float)
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 0 to 29999
Columns: 150 entries, 0 to 149
dtypes: float64(150)

In [10]: %timeit df.apply(lambda x: x.str.replace("'","").str.replace(",",".")).astype(float)
1 loops, best of 3: 4.77 s per loop

总时间6s以下

仅供参考，有一个thousands单独的选项，但没有decimal一个....嗯，这会更快....

pandas - Python Pandas 十进制标记欧盟到美国

3 回答 3

Related

Reference