我知道这个问题真的很老,但它是当我用谷歌搜索如何获取多线程报纸时出现的第一个链接。虽然凯尔斯的回答很有帮助,但它并不完整,我认为它有一些错别字......
import newspaper
urls = [
'http://www.baltimorenews.net/index.php/sid/234363921',
'http://www.baltimorenews.net/index.php/sid/234323971',
'http://www.atlantanews.net/index.php/sid/234323891',
'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',
]
class SingleSource(newspaper.Source):
def __init__(self, articleURL):
super(SingleSource, self).__init__("http://localhost")
self.articles = [newspaper.Article(url=articleURL)]
sources = [SingleSource(articleURL=u) for u in urls]
newspaper.news_pool.set(sources)
newspaper.news_pool.join()
我将Stubsource更改为Singlesource,并将其中一个URL更改为articleURL。当然这只是下载网页,您仍然需要解析它们才能获取文本。
multi=[]
i=0
for s in sources:
i+=1
try:
(s.articles[0]).parse()
txt = (s.articles[0]).text
multi.append(txt)
except:
pass
在我的 100 个 url 示例中,与仅按顺序处理每个 url 相比,这花费了一半的时间。(编辑:将样本量增加到 2000 后,减少了大约四分之一。)
(编辑:让整个事情都与多线程一起工作!)我对我的实现使用了这个非常好的解释。对于 100 个 url 的样本大小,使用 4 个线程所花费的时间与上面的代码相当,但将线程数增加到 10 会进一步减少大约一半。更大的样本量需要更多的线程才能产生可比较的差异。
import newspaper
from multiprocessing.dummy import Pool as ThreadPool
def getTxt(url):
article = Article(url)
article.download()
try:
article.parse()
txt=article.text
return txt
except:
return ""
pool = ThreadPool(10)
# open the urls in their own threads
# and return the results
results = pool.map(getTxt, urls)
# close the pool and wait for the work to finish
pool.close()
pool.join()