我遇到了应该加速某些熊猫操作的 modin 库并开始对其进行测试。
虽然使用 read_csv 加载数据明显更快,但简单的条件表达式在纯 pandas 中完美运行,例如:
df.loc[df['Score'] > 8,'Score_T2B'] = 1
df.loc[df['Score'] < 9,'Score_T2B'] = 0
抛出许多错误:
回溯(最近一次通话最后):
File "<ipython-input-21-0b842942ffac>", line 1, in <module>
df.loc[df['Score'] > 8,'Score_T2B'] = 1
File "C:\ProgramData\Anaconda3\lib\site-packages\modin\pandas\indexing.py", line 251, in __setitem__
new_col[row_loc] = item
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 1244, in __setitem__
setitem(key, value)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\series.py", line 1221, in setitem
self.loc[key] = value
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 204, in __setitem__
indexer = self._get_setitem_indexer(key)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 191, in _get_setitem_indexer
return self._convert_to_indexer(key, axis=axis, is_setter=True)
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1285, in _convert_to_indexer
return self._get_listlike_indexer(obj, axis, **kwargs)[1]
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1092, in _get_listlike_indexer
keyarr, indexer, o._get_axis_number(axis), raise_missing=raise_missing
File "C:\ProgramData\Anaconda3\lib\site-packages\pandas\core\indexing.py", line 1177, in _validate_read_indexer
key=key, axis=self.obj._get_axis_name(axis)
KeyError: "None of [Index([False, False, False, False, False, False, False, False, False, False,\n ...\n False, False, False, False, False, False, False, False, False, False],\n dtype='object', length=169815)] are in the [index]"
这应该是一个简单的操作。是否有解决方法,或者我只是错过了加载以外的其他内容:
import modin.pandas as pm
df = pm.read_csv(input_file, sep='\t', encoding='utf-8', low_memory=False)
非常感谢!