1

如何将报纸库用于需要身份验证的网站?我正在使用报纸 3k库从不同的新闻网站下载几篇文章的 html(到目前为止工作得很好)。但是,由于我需要完整的内容,我需要在请求 html 之前进行身份验证(用户名、密码)。我将不胜感激任何正确方向的指示!

我认为这必须在我使用报纸.build() 之前发生?

(此时我只想说,这是我第一次用 python 编码(或者只是一般地编码任何东西)所以任何帮助都会很棒)

import newspaper #import newspaper library
from newspaper import news_pool

guardian = newspaper.build('https://www.theguardian.com/uk-news/all', language='en', memoize_articles=True)
telegraph = newspaper.build('https://www.telegraph.co.uk/news/uk/', language='en', memoize_articles=True)
dagbladet = newspaper.build('https://www.svd.se/sverige', language='sv', memoize_articles=True)
dagensnyheter = newspaper.build('https://www.dn.se/nyheter/sverige/', language='sv', memoize_articles=True)

allpapers = [guardian, telegraph, dagbladet, dagensnyheter]

for papers in allpapers:
    newpathpaper = r'/Users/articles/' + today + "/" + naming #naming is just a variable from further up that gives the name of each newspaper 
    if not os.path.exists(newpathpaper):
        os.makedirs(newpathpaper)

    #parsing, downloading and creating files for articles
    pointer = 0
    while(papers.size() > pointer):
        papers_article = papers.articles[pointer]
        papers_article.download()
        if papers_article.download_state == 2: #checking if article has been downloaded
            time.sleep(2)
            papers_article.parse()
            print(papers_article.url)

            #receiving publishing date so it is comparable
            published_today = papers_article.publish_date #newspaper extractor
            published = str(published_today)[0:10] 

            #writing html
            if published == today: #today was declared earlier
                 f = open('articles/%s/%s/%s_article_%s.html' %(today, naming, naming, pointer), 'w+') #writing html file
                f.write(papers_article.html)
                print("written successfully")
                count_writes +=1
            else:
                print("not from today")

        else: 
            print("article %s" %pointer)
            print(papers_article.url)
            print("Has not downloaded!")
        pointer += 1
4

0 回答 0