2

我正在尝试使用库的并行处理来加速我的代码modin

我尝试在我的 Windows 10 计算机上使用 dask 引擎来执行此操作,但它不起作用,我认为这是因为它仍在开发中。我读到您不能在 Windows 上使用 ray 引擎,因此我运行了一个简单的示例来检查该库如何在免费的 AWS Ubuntu 服务器上运行。

modin当我在成功安装包后尝试安装包时raypandas出现以下错误:

ERROR: Could not find a version that satisfies the requirement pandas==1.0.3 (from versions: 0.1, 0.2b0, 0.2b1, 0.2, 0.3.0b0, 0.3.0b2, 0.3.0, 0.4.0, 0.4.1, 0.4.2, 0.4.3, 0.5.0, 0.6.0, 0.6.1, 0.7.0rc1, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0rc1, 0.8.0rc2, 0.8.0, 0.8.1, 0.9.0, 0.9.1, 0.10.0, 0.10.1, 0.11.0, 0.12.0, 0.13.0, 0.13.1, 0.14.0, 0.14.1, 0.15.0, 0.15.1, 0.15.2, 0.16.0, 0.16.1, 0.16.2, 0.17.0, 0.17.1, 0.18.0, 0.18.1, 0.19.0rc1, 0.19.0, 0.19.1, 0.19.2, 0.20.0rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.21.0rc1, 0.21.0, 0.21.1, 0.22.0, 0.23.0rc2, 0.23.0, 0.23.1, 0.23.2, 0.23.3, 0.23.4, 0.24.0rc1, 0.24.0, 0.24.1, 0.24.2)
ERROR: No matching distribution found for pandas==1.0.3

如果我在终端上键入pip3 install -vvv modin以获取我得到的日志:

Exception information:
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/cli/base_command.py", line 188, in _main
    status = self.run(options, args)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/cli/req_command.py", line 185, in wrapper
    return func(self, options, args)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/commands/install.py", line 333, in run
    reqs, check_supported_wheels=not options.target_dir
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 179, in resolve
    discovered_reqs.extend(self._resolve_one(requirement_set, req))
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 362, in _resolve_one
    abstract_dist = self._get_abstract_dist_for(req_to_install)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 313, in _get_abstract_dist_for
    self._populate_link(req)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/resolution/legacy/resolver.py", line 279, in _populate_link
    req.link = self.finder.find_requirement(req, upgrade)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/pip/_internal/index/package_finder.py", line 930, in find_requirement
    req)
pip._internal.exceptions.DistributionNotFound: No matching distribution found for pandas==1.0.3 (from modin)
Removed build tracker: '/tmp/pip-req-tracker-oklngevc'

我怎么解决这个问题?

我要运行以检查其工作原理的脚本是:

import os
os.environ["MODIN_ENGINE"] = "ray"  # Modin will use Ray
import modin.pandas as pd
import time
import pandas as pn

start_time = time.time()
datos = pd.read_csv('datospruebaAWS.csv', header=None, index_col=0)
end_time = time.time()
print("time read csv parallel=", end_time - start_time)

start_time = time.time()
datos = pn.read_csv('datospruebaAWS.csv', header=None, index_col=0)
end_time = time.time()
print("time read csv=", end_time - start_time)

我想加快的脚本之一,只是改变import pandas as pdimport modin.pandas as pd是:

import pandas as pd
import glob
import time

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]

start_time = time.time()
cookies = []
for file in all_filenames:
    datos = pd.read_csv(file, header=None, index_col=0)
    datos.index.name = 'CookieID'
    print('leido')
    for i in range(len(datos)):
        if datos[2].iloc[i].find('golf') != -1:
            cookies.append(datos.index[i])
    print('cookies')
    print(len(cookies))
    del datos

end_time = time.time()
print("time=", end_time - start_time)

cookies = pd.Series(cookies)
cookies = cookies.unique()
cookies = pd.DataFrame(cookies)
cookies['Owner ID'] = ['Les gusta el golf']*len(cookies)
cookies.to_csv('DMP_golf.txt', header=False, index=False, sep='\t')

因为该文件夹有许多大的 csv 文件,需要几个小时才能找到解决方案。

另外,还有其他方法可以加快此代码的速度吗?

4

1 回答 1

1

看起来 Pandas 1.0.3 不支持您正在使用的 Python 3.5。请参阅https://pypi.org/project/pandas/1.0.3/#files中的“版本”列。

于 2020-07-16T16:55:05.353 回答