我正在尝试使用库的并行处理来加速我的代码modin
。
我尝试在我的 Windows 10 计算机上使用 dask 引擎来执行此操作,但它不起作用,我认为这是因为它仍在开发中。我读到您不能在 Windows 上使用 ray 引擎,因此我运行了一个简单的示例来检查该库如何在免费的 AWS Ubuntu 服务器上运行。
modin
当我在成功安装包后尝试安装包时ray
,pandas
出现以下错误:
ERROR: Could not find a version that satisfies the requirement pandas==1.0.3 (from versions: 0.1, 0.2b0, 0.2b1, 0.2, 0.3.0b0, 0.3.0b2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0rc1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0rc1, 0.8.0rc2, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0rc1, 0.19.0, 0.19.1, 0.19.2, 0.20.0rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0rc1, 0.21.0, 0.21.1, 0.22.0, 0.23.0rc2, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0rc1, 0.24.0, 0.24.1, 0.24.2)
ERROR: No matching distribution found for pandas==1.0.3
如果我在终端上键入pip3 install -vvv modin
以获取我得到的日志:
Exception information:
Traceback (most recent call last):
File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/cli/base_command.py", line 188, in _main
status = self.run(options, args)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/cli/req_command.py", line 185, in wrapper
return func(self, options, args)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/commands/install.py", line 333, in run
reqs, check_supported_wheels=not options.target_dir
File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 179, in resolve
discovered_reqs.extend(self._resolve_one(requirement_set, req))
File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 362, in _resolve_one
abstract_dist = self._get_abstract_dist_for(req_to_install)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 313, in _get_abstract_dist_for
self._populate_link(req)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 279, in _populate_link
req.link = self.finder.find_requirement(req, upgrade)
File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/index/package_finder.py", line 930, in find_requirement
req)
pip._internal.exceptions.DistributionNotFound: No matching distribution found for pandas==1.0.3 (from modin)
Removed build tracker: '/tmp/pip-req-tracker-oklngevc'
我怎么解决这个问题?
我要运行以检查其工作原理的脚本是:
import os
os.environ["MODIN_ENGINE"] = "ray" # Modin will use Ray
import modin.pandas as pd
import time
import pandas as pn
start_time = time.time()
datos = pd.read_csv('datospruebaAWS.csv', header=None, index_col=0)
end_time = time.time()
print("time read csv parallel=", end_time - start_time)
start_time = time.time()
datos = pn.read_csv('datospruebaAWS.csv', header=None, index_col=0)
end_time = time.time()
print("time read csv=", end_time - start_time)
我想加快的脚本之一,只是改变import pandas as pd
的import modin.pandas as pd
是:
import pandas as pd
import glob
import time
extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
start_time = time.time()
cookies = []
for file in all_filenames:
datos = pd.read_csv(file, header=None, index_col=0)
datos.index.name = 'CookieID'
print('leido')
for i in range(len(datos)):
if datos[2].iloc[i].find('golf') != -1:
cookies.append(datos.index[i])
print('cookies')
print(len(cookies))
del datos
end_time = time.time()
print("time=", end_time - start_time)
cookies = pd.Series(cookies)
cookies = cookies.unique()
cookies = pd.DataFrame(cookies)
cookies['Owner ID'] = ['Les gusta el golf']*len(cookies)
cookies.to_csv('DMP_golf.txt', header=False, index=False, sep='\t')
因为该文件夹有许多大的 csv 文件,需要几个小时才能找到解决方案。
另外,还有其他方法可以加快此代码的速度吗?