0

我遇到了应该加速某些熊猫操作的 modin 库并开始对其进行测试。

虽然使用 read_csv 加载数据明显更快,但简单的条件表达式在纯 pandas 中完美运行,例如:

    df.loc[df['Score'] > 8,'Score_T2B'] = 1
    df.loc[df['Score'] < 9,'Score_T2B'] = 0

抛出许多错误:

回溯(最近一次通话最后):

  File "<ipython-input-21-0b842942ffac>", line 1, in <module>
    df.loc[df['Score'] > 8,'Score_T2B'] = 1

  File "C:\ProgramData\Anaconda3\lib\site-packages\modin\pandas\indexing.py", line 251, in __setitem__
new_col[row_loc] = item

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 1244, in __setitem__
setitem(key, value)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 1221, in setitem
self.loc[key] = value

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 204, in __setitem__
indexer = self._get_setitem_indexer(key)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 191, in _get_setitem_indexer
return self._convert_to_indexer(key, axis=axis, is_setter=True)

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1285, in _convert_to_indexer
return self._get_listlike_indexer(obj, axis, **kwargs)[1]

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1092, in _get_listlike_indexer
keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing

  File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1177, in _validate_read_indexer
key=key, axis=self.obj._get_axis_name(axis)

KeyError: "None of [Index([False, False, False, False, False, False, False, False, False, False,\n       ...\n       False, False, False, False, False, False, False, False, False, False],\n      dtype='object', length=169815)] are in the [index]"

这应该是一个简单的操作。是否有解决方法,或者我只是错过了加载以外的其他内容:

  import modin.pandas as pm  
  df = pm.read_csv(input_file, sep='\t', encoding='utf-8', low_memory=False)

非常感谢!

4

1 回答 1

0

我发现 modin 的read_csv功能与 pandas.read_csv 替代品完全不同。特别是,它也不进行异常处理,并抛出本来可以与 pandas 一起工作的异常。

也许最好在失败时导入两个版本并在熊猫上回退?

这是我的进口商

try:
    import modin.pandas as pd # claims to be 4X faster loading csvs, using all processor cores
    print('modin.pandas active')
except ImportError:
    import pandas as pd

这是我未能正确 read_csv 的示例:

这会导致 modin 出现 TypeError,但 pandas 不会出现错误。正在加载的文件不包含列名'IlmnID'

try:
    sample = pd.read_csv(part, index_col='IlmnID')
except ValueError:
    sample = pd.read_csv(part)

这适用于 modin.pd.read_csv,因为没有 try/except 包装器

sample = pd.read_csv(part)
if 'IlmnID' in sample.columns:
    sample.set_index('IlmnID', inplace=True)
elif 'illumina_id' in sample.columns:
    sample.set_index('illumina_id', inplace=True)
    sample.rename(index={'illumin_id': 'IlmnID'}, inplace=True)
else:
    # assume first column
    guess_index = columns[0]
    sample.set_index(guess_index, inplace=True)
    sample.rename(index={guess_index: 'IlmnID'}, inplace=True)

于 2020-02-27T19:19:46.263 回答