我需要从新闻文章中抓取作者和日期,但我无法访问元标记中的某些信息。
import requests, random, re, os
from bs4 import BeautifulSoup as bs
import urllib.parse
import time
from newspaper import Article
url = ['https://www.wsj.com/articles/covid-19-is-dividing-the-american-worker-11598068859?mod=hp_lead_pos7',
##WALL STREET JOURNAL
for link in url:
#Try 1
#Get the published date -- this is where I have problems.
webpage = requests.get(link)
soup = bs(webpage.text, "html.parser")
date = soup.find("meta", {"name": "article.published"})
print(date)
#Try 2
#Access date from the <time> tag instead
for tag in soup.find_all('time', {"class": "timestamp article__timestamp flexbox__flex--1"}):
date = tag.text
print(date)
#Get the author name -- this part works
article = Article(link, language='en')
article.download()
article.parse()
# print(article.html)
author = article.authors
date = article.publish_date
author = author[0]
day_month = str("Check Date")
print(day_month + "," + "," + "," + str(author))
当我打印出汤时,我可以在输出中获得 Meta 标签,所以我知道它们在那里,但我似乎无法用任何一种方法访问它们。
这是我到目前为止得到的输出:无检查日期,,,克里斯托弗·米姆斯
有什么想法吗?