0

我有一个函数,它接受一个 url 列表并向每个 url 添加一个标题。url_list 可以是大约 25,000 个长列表。所以,我想使用多处理。我尝试了两种方法都失败了:

第一种方式 - url_list 未正确传递...该函数仅获取 url_list url 的第一个字母“h”:

headers = {}
header_token = {}

def do_it(url_list):
    for i in url_list:
    print "adding header to: \n" + i
    requests.post(i, headers=headers)
    print "done!"

 value = raw_input("Proceed? Enter [Y] for yes: ")
    if value == "Y":
        pool = multiprocessing.Pool(processes=8)
    pool.map(do_it, url_list)
        pool.close()
        pool.join()

Traceback (most recent call last):
  File "head.py", line 95, in <module>
    pool.map(do_it, url_list)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 250, in map
    return self.map_async(func, iterable, chunksize).get()
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 554, in get
    raise self._value
requests.exceptions.MissingSchema: Invalid URL u'h': No schema supplied

第二种方式......我更喜欢的方式,因为我不必将标题字典设为全局。但我得到一个泡菜错误:

def wrapper(headers):
    def do_it(url_list):
    for i in url_list:
        print "adding header to: \n" + i
        requests.post(i, headers=headers)
    print "done!"
    return do_it

    value = raw_input("Proceed? Enter [Y] for yes: ")
    if value == "Y":
        pool = multiprocessing.Pool(processes=8)
    pool.map(wrapper(headers), url_list)
        pool.close()
        pool.join()

Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 761, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Traceback (most recent call last):
  File "/usr/lib64/python2.7/threading.py", line 808, in __bootstrap_inner
    self.run()
  File "/usr/lib64/python2.7/threading.py", line 761, in run
    self.__target(*self.__args, **self.__kwargs)
  File "/usr/lib64/python2.7/multiprocessing/pool.py", line 342, in _handle_tasks
    put(task)
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
4

2 回答 2

1

如果您希望使用您的第二个实现,那么我认为您应该能够使用dill来序列化您的包装函数。Dill 几乎可以在 python 中序列化任何东西。Dill 还提供了一些很好的工具,可以帮助您了解在代码失败时导致酸洗失败的原因。Dill 与 python 的接口相同pickle,但也提供了一些额外的方法。如果你想使用 dill 进行序列化multiprocessing,你所要做的就是:

>>> import dill
>>> # your code goes here (as above)

而且,如果由于某种原因这不起作用,您可以换掉pathosmultiprocessing ...它是为使用 dill 进行多处理而构建的 - 并提供了一个 multi-*args函数(与标准 python 完全一样)。mapmap

于 2013-10-14T15:00:00.430 回答
0

您需要使用多处理包中的队列。您从中提取或添加的数据类型需要是线程和进程安全的;一个队列是两者。

http://docs.python.org/2/library/queue.html

http://docs.python.org/2/library/multiprocessing.html#exchanging-objects-between-processes

于 2013-08-27T22:26:19.710 回答