python - 从新闻文章中提取评论

Question

我的问题类似于这里提出的问题： https ://stackoverflow.com/questions/14599485/news-website-comment-analysis 我正在尝试从任何新闻文章中提取评论。例如，我在这里有一个新闻网址： http ://www.cnn.com/2013/09/24/politics/un-obama-foreign-policy/ 我正在尝试在 python 中使用 BeautifulSoup 来提取评论。但是，评论部分似乎嵌入在 iframe 中或通过 javascript 加载。通过firebug查看源代码并不会泄露评论部分的出处。但是通过浏览器的查看源功能明确查看评论的来源。如何提取评论，尤其是当评论来自嵌入在新闻网页中的不同 url 时？

这是我到目前为止所做的，尽管这并不多：

    import urllib2
    from bs4 import BeautifulSoup

    opener = urllib2.build_opener()


    url = ('http://www.cnn.com/2013/08/28/health/stem-cell-brain/index.html')


urlContent = opener.open(url).read()
soup = BeautifulSoup(urlContent)
title = soup.title.text

print title
body = soup.findAll('body')
outfile = open("brain.txt","w+")
for i in body:
    i=i.text.encode('ascii','ignore')
    outfile.write(i +'\n')

对于我需要做什么或如何去做的任何帮助将不胜感激。

score 0 · Accepted Answer

它在一个iframe. 检查带有id="dsq2".

现在iframe有一个srcattr，它是指向具有评论的实际站点的链接。

所以在美丽的汤中：css_soup.select("#dsq2")并从 src 属性中获取 url。它将引导您进入只有评论的页面。

要获得实际的评论，从 src 获取页面后，您可以使用这个 css 选择器：.post-message p

如果您想加载更多评论，当您单击更多评论按钮时，它似乎正在发送：

http://disqus.com/api/3.0/threads/listPostsThreaded?limit=50&thread=1660715220&forum=cnn&order=popular&cursor=2%3A0%3A0&api_key=E8Uh5l5fHZ6gD8U3KycjAIAk46f68Zw7C6eW8WSjZvCLXebZ7p0r1yrYDrLilk2F

python - 从新闻文章中提取评论

1 回答 1

Related

Reference