4

I'm using the Newspaper module for python found here.

In the tutorials, it describes how you can pool the building of different newspapers s.t. it generates them at the same time. (see the "Multi-threading article downloads" in the link above)

Is there any way to do this for pulling articles straight from a LIST of urls? That is, is there any way I can pump in multiple urls into the following set-up and have it download and parse them concurrently?

from newspaper import Article
url = 'http://www.bbc.co.uk/zhongwen/simp/chinese_news/2012/12/121210_hongkong_politics.shtml'
a = Article(url, language='zh') # Chinese
a.download()
a.parse()
print(a.text[:150])
4

4 回答 4

4

我知道这个问题真的很老,但它是当我用谷歌搜索如何获取多线程报纸时出现的第一个链接。虽然凯尔斯的回答很有帮助,但它并不完整,我认为它有一些错别字......

import newspaper

urls = [
'http://www.baltimorenews.net/index.php/sid/234363921',
'http://www.baltimorenews.net/index.php/sid/234323971',
'http://www.atlantanews.net/index.php/sid/234323891',
'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
]

class SingleSource(newspaper.Source):
def __init__(self, articleURL):
    super(SingleSource, self).__init__("http://localhost")
    self.articles = [newspaper.Article(url=articleURL)]

sources = [SingleSource(articleURL=u) for u in urls]

newspaper.news_pool.set(sources)
newspaper.news_pool.join()

我将Stubsource更改为Singlesource,并将其中一个URL更改为articleURL。当然这只是下载网页,您仍然需要解析它们才能获取文本。

multi=[]
i=0
for s in sources:
    i+=1
    try:
        (s.articles[0]).parse()
        txt = (s.articles[0]).text
        multi.append(txt)
    except:
        pass

在我的 100 个 url 示例中,与仅按顺序处理每个 url 相比,这花费了一半的时间。(编辑:将样本量增加到 2000 后,减少了大约四分之一。)

(编辑:让整个事情都与多线程一起工作!)我对我的实现使用了这个非常好的解释。对于 100 个 url 的样本大小,使用 4 个线程所花费的时间与上面的代码相当,但将线程数增加到 10 会进一步减少大约一半。更大的样本量需要更多的线程才能产生可​​比较的差异。

import newspaper
from multiprocessing.dummy import Pool as ThreadPool

def getTxt(url):
    article = Article(url)
    article.download()
    try:
        article.parse()
        txt=article.text
        return txt
    except:
        return ""

pool = ThreadPool(10)

# open the urls in their own threads
# and return the results
results = pool.map(getTxt, urls)

# close the pool and wait for the work to finish 
pool.close() 
pool.join()
于 2018-08-07T16:10:30.183 回答
4

Source我可以通过为每篇文章创建一个 URL 来做到这一点。(免责声明:不是 python 开发人员)

import newspaper

urls = [
  'http://www.baltimorenews.net/index.php/sid/234363921',
  'http://www.baltimorenews.net/index.php/sid/234323971',
  'http://www.atlantanews.net/index.php/sid/234323891',
  'http://www.wpbf.com/news/funeral-held-for-gabby-desouza/33874572',  
]

class SingleSource(newspaper.Source):
    def __init__(self, articleURL):
        super(StubSource, self).__init__("http://localhost")
        self.articles = [newspaper.Article(url=url)]

sources = [SingleSource(articleURL=u) for u in urls]

newspaper.news_pool.set(sources)
newspaper.news_pool.join()

for s in sources:
  print s.articles[0].html
于 2016-07-06T19:13:38.413 回答
1

以约瑟夫的瓦尔斯答案为基础。我假设原始海报想要使用多线程来提取一堆数据并将其正确存储在某个地方。经过多次尝试,我想我找到了一个解决方案,它可能不是最有效的但它有效,我试图让它变得更好但是,我认为报纸 3k 插件可能有点错误。但是,这适用于将所需元素提取到 DataFrame 中。

import newspaper
from newspaper import Article
from newspaper import Source
import pandas as pd

gamespot_paper = newspaper.build('https://www.gamespot.com/news/', memoize_articles=False)
bbc_paper = newspaper.build("https://www.bbc.com/news", memoize_articles=False)
papers = [gamespot_paper, bbc_paper]
news_pool.set(papers, threads_per_source=4) 
news_pool.join()

#Create our final dataframe
df_articles = pd.DataFrame()

#Create a download limit per sources
limit = 100

for source in papers:
    #tempoary lists to store each element we want to extract
    list_title = []
    list_text = []
    list_source =[]

    count = 0

    for article_extract in source.articles:
        article_extract.parse()

        if count > limit:
            break

        #appending the elements we want to extract
        list_title.append(article_extract.title)
        list_text.append(article_extract.text)
        list_source.append(article_extract.source_url)

        #Update count
        count +=1


    df_temp = pd.DataFrame({'Title': list_title, 'Text': list_text, 'Source': list_source})
    #Append to the final DataFrame
    df_articles = df_articles.append(df_temp, ignore_index = True)
    print('source extracted')

请提出任何改进建议!

于 2020-04-09T00:54:40.593 回答
0

我不熟悉 Newspaper 模块,但以下代码使用 URL 列表,应该与链接页面中提供的相同:

import newspaper
from newspaper import news_pool

urls = ['http://slate.com','http://techcrunch.com','http://espn.com']
papers = [newspaper.build(i) for i in urls]
news_pool.set(papers, threads_per_source=2)
news_pool.join()
于 2016-05-25T04:11:20.227 回答