python - 在 python dask 中使用分隔符读取 csv

Question

我正在尝试DataFrame通过读取由 '#####' 5 个哈希分隔的 csv 文件来创建

代码是：

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',engine='python')
res = df.compute()

错误是：

dask.async.ValueError:
Dask dataframe inspected the first 1,000 rows of your csv file to guess the
data types of your columns.  These first 1,000 rows led us to an incorrect
guess.

For example a column may have had integers in the first 1000
rows followed by a float or missing value in the 1,001-st row.

You will need to specify some dtype information explicitly using the
``dtype=`` keyword argument for the right column names and dtypes.

    df = dd.read_csv(..., dtype={'my-column': float})

Pandas has given us the following error when trying to parse the file:

  "The 'dtype' option is not supported with the 'python' engine"

Traceback
 ---------
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 263, in execute_task
result = _execute_task(task, data)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/async.py", line 245, in _execute_task
return func(*args2)
File "/home/ec2-user/anaconda3/lib/python3.4/site-packages/dask/dataframe/io.py", line 69, in _read_csv
raise ValueError(msg)

那么如何摆脱它。

如果我遵循错误，那么我将不得不为每一列提供 dtype，但如果我有 100 多列，那么这是没有用的。

如果我在没有分隔符的情况下阅读，那么一切都很好，但到处都有#####。那么在计算到熊猫之后DataFrame，有没有办法摆脱它？

所以在这方面帮助我。

score 7 · Accepted Answer

Read the entire file in as dtype=object, meaning all columns will be interpreted as type object. This should read in correctly, getting rid of the ##### in each row. From there you can turn it into a pandas frame using the compute() method. Once the data is in a pandas frame, you can use the pandas infer_objects method to update the types without having to hard code them.

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv',sep='#####',dtype='object').compute()
res = df.infer_objects()

score 1 · Accepted Answer

如果您想将整个文件保留为 dask 数据帧，我只需通过增加在read_csv.

例如：

import dask.dataframe as dd
df = dd.read_csv('D:\temp.csv', sep='#####', sample = 1000000) # increase to 1e6 bytes
df.head()

这可以解决一些类型推断问题，尽管与 Benjamin Cohen 的回答不同，您需要找到正确的值来选择 sample/

python - 在 python dask 中使用分隔符读取 csv

2 回答 2

Related

Reference