python - 如何使用报纸库仅解析网站的特定类别？

Question

我使用Python3和newspaper图书馆。据说这个库可以创建一个Source对象，它是一个新闻网站的抽象。但是如果我只需要某个类别的抽象呢？

例如，当我使用这个 url时，我想获取该'technology'类别的所有文章。相反，我从'politics'.

我认为在创建Source对象时，报纸只使用域名，在我的例子中是www.kyivpost.com)。

有没有办法让它与像这样的网址一起使用http://www.kyivpost.com/technology/？

score 0 · Accepted Answer

newspaper将在可用时使用网站的 RSS 提要；KyivPost 只有一个 rss 提要并发布主要关于政治的文章，这就是为什么您的结果集主要是政治的原因。

您可能会更幸运地使用BeautifulSoup专门从技术页面绘制文章 URL 并newspaper直接将它们提供给它们。

score 0 · Accepted Answer

我知道这有点老了。但是，如果有人仍在寻找这样的东西，您可以首先使用正则表达式获取所有锚标记元素过滤链接，然后请求所有链接以获取文章 + 所需数据。我正在粘贴一个示例代码，您可以根据您的页面更改必要的汤元素-
'''

"""
Created on Tue Jan 21 10:10:02 2020

@author: prakh
"""

import requests
#import csv
from bs4 import BeautifulSoup
import re
from functools import partial  
from operator import is_not
from dateutil import parser
import pandas as pd
from datetime import timedelta, date

final_url = 'https://www.kyivpost.com/technology'

links = []
news_data = []
filter_null = partial(filter, partial(is_not, None))

try:
    page = requests.get(final_url)

    soup = BeautifulSoup(page.text, 'html.parser')

    last_links = soup.find(class_='filter-results-archive')

    artist_name_list_items = last_links.find_all('a')
    for artist_name in artist_name_list_items:

        links.append(artist_name.get('href'))
        L =list(filter_null(links))

        regex = re.compile(r'technology')

        selected_files = list(filter(regex.match, L))
#            print(selected_files)     
#        print(list(page))
except Exception as e:
    print(e)
    print("continuing....")
#    continue

for url in selected_files:
        news_category = url.split('/')[-2]
        try:
            data = requests.get(url)
            soup = BeautifulSoup(data.content, 'html.parser')

            last_links2 = soup.find(id='printableAreaContent')                
            last_links3 = last_links2.find_all('p')
#            metadate = soup.find('meta', attrs={'name': 'publish-date'})['content']
            #print(metadate)
#            metadate = parser.parse(metadate).strftime('%m-%d-%Y')
#            metaauthor = soup.find('meta', attrs={'name': 'twitter:creator'})['content']
            news_articles = [{'news_headline': soup.find('h1', 
                                                         attrs={"class": "post-title"}).string,
                          'news_article':  last_links3,
 #                        'news_author':  metaauthor,
#                          'news_date': metadate,
                            'news_category': news_category}
                        ]

            news_data.extend(news_articles)        
#        print(list(page))
        except Exception as e:
            print(e)
            print("continuing....")
            continue

df =  pd.DataFrame(news_data)
'''

python - 如何使用报纸库仅解析网站的特定类别？

2 回答 2

Related

Reference