2

我把关于欧盟到美国小数点转换的邮件发红了,这些帮助很大,但我仍然觉得需要专家的帮助.. 我的数据来自 ERP 系统,数字格式为“1'000' 000,32”,我想简单地转换成“1000000.32”之类的东西,以便在 Pandas 中进一步处理。

我从欧盟开始获取美国格式的实际解决方案如下:

... 
 # read_csv and merge, clean .. different CSV files
 # result = merge (some_DataFrame_EU_format, ...)
...
result.to_csv(path, sep';')
result = read_csv(path, sep';', converters={'column_name': lambda x: float(x.replace   ('.','').replace(',','.'))})
....
result.to_csv(path, sep';')

我觉得这是用“。”更改“,”的缓慢方法。由于 read_csv 和 to_csv(以及磁盘 ..),所以愿意直接在 DataFrame 上尝试 .replace 方法以节省一些处理时间。

我最初的尝试是这样的(我在论坛的其他地方发红了..):

result['column_name'] = result['column_name'].replace( '.', '')
result['column_name'] = result['column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

这不起作用并导致“浮点文字无效”错误。

我很感动:

for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '')
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( ',', '.')
result['column_name'] =  result['column_name'].astype(float)

以上工作..但令人惊讶的是,它似乎比 read_csv/converters 解决方案慢了大约 3 倍。使用以下内容在某种程度上有所帮助:

    for i in range (0, len(result)):
    result.ix[i,'column_name'] = result.ix[i,'column_name'].replace( '.', '').replace( ',', '.')
    result['column_name'] =  result['column_name'].astype(float)

我红了精美的手册..并且知道 read_csv 已经过优化..但并没有真正期望 red / write /read/ write cicle 比 for 循环快三倍!

你认为在这方面做更多工作是否值得?有什么建议吗?还是继续使用重复的写/读/写方法更好?

我的文件大约有 30k 行 x 150 列,读/写/读(转换)/写大约需要 18 秒,第一种循环的 .ix 超过 52 秒(分组 .replace 为 32 秒)。

您将 DataFrames 从欧盟格式转换为美国格式有什么经验?一些建议的改进方法?'映射'或'语言环境'呢?他们会更快吗?

非常感谢你,法比奥。

PS我意识到我很“冗长”,还不够“pythonic”..对不起对不起..我还在学习......:-)

4

3 回答 3

2

Thank You so much for your great suggestions and help, Andy and Jeff ! You helped a lot :-)

I firstly went back with an editor to the original datas. In some of them I saw that the system probably applied some kind of automatic conversion so I newly downloaded the same dataset as 'unconverted' option and avoided to use e.g. Excel or other programs to open/save files. I used text editors only. At this point I made the read_csv lighter with no converters and grouped the replaces as Jeff's suggested.

The real case is a bit longer than the provided example and includes some stripping (spaces), columns del, string concat, renaming/replace .... The decimal marks are replaced for three columns: USD Sales, Qty, USD_EUR exchange rate. Based on them EUR sales and EUR unit prices are calculated. In the initial file we also have a '-' ,for some other reason, before the exchange rate to be fixed ("-", ""). The result is:

result = pd.read_csv(path, sep=';', thousands = '.')
col = [ 'qty', 'sales', 'rate']
result[col] = result[col].apply(lambda x: x.str.replace(".","").str.replace(",","."))
result['sales_localcurrency'] = abs(result['sales'].astype(float) / result['rate'].astype(float))
result['sales_localcurrency_unit'] = result['sales_localcurrency'] / result['qty'].astype(float)
result.to_csv(path, sep=';')

The 30'000 x 150 DataFrame is processed in less than 15 seconds :-) :-) including all the other things I did not detailed here too much (stripping, del, concat, ..). All that read/write/read/write had been deleted from the code skipping the 'converters' during the read_csv.

Thank you for your helps :-) !

Bye Bye. Fabio.

    -
于 2013-07-09T14:24:48.720 回答
1

实际上 read_csv 中有一个千位和十进制参数(请参阅熊猫文档 read_csv 但不幸的是两者还没有一起工作(请参阅问题:github问题

于 2013-07-19T19:18:27.157 回答
0

使用您指定的值创建一个框架并写入 csv

In [2]: df = DataFrame("100'100,32",index=range(30000),columns=range(150))

In [3]: df.iloc[0:5,0:5]
Out[3]: 
            0           1           2           3           4
0  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
1  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
2  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
3  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32
4  100'100,32  100'100,32  100'100,32  100'100,32  100'100,32

In [4]: df.to_csv('test.csv')

阅读它,没有转换器

In [5]: df = read_csv('../test.csv',index_col=0)

In [6]: %timeit read_csv('../test.csv',index_col=0)
1 loops, best of 3: 1e+03 ms per loop

In [7]: df
Out[7]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 0 to 29999
Columns: 150 entries, 0 to 149
dtypes: object(150)

In [8]: %timeit read_csv('../test.csv',index_col=0)
1 loops, best of 3: 1e+03 ms per loop

逐列进行字符串替换。在这里,您只能根据需要指定某些列,方法是df[[ list of columns ]].apply(.....)

In [9]: df.apply(lambda x: x.str.replace("'","").str.replace(",",".")).astype(float)
Out[9]: 
<class 'pandas.core.frame.DataFrame'>
Int64Index: 30000 entries, 0 to 29999
Columns: 150 entries, 0 to 149
dtypes: float64(150)

In [10]: %timeit df.apply(lambda x: x.str.replace("'","").str.replace(",",".")).astype(float)
1 loops, best of 3: 4.77 s per loop

总时间6s以下

仅供参考,有一个thousands单独的选项,但没有decimal一个....嗯,这会更快....

于 2013-07-08T23:58:27.653 回答