python - python中facebook的网络爬虫

Question

我正在尝试在 python 中使用 web-Crawler 来打印 facebook 推荐者的数量。例如，在天空新闻（http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine）的这篇文章中，大约有 60 个 facebook 推荐。我想用 web-crawler 在 python 程序中打印这个数字。我试图这样做，但它不打印任何东西：

import requests
from bs4 import BeautifulSoup

def get_single_item_data(item_url):
    source_code = requests.get(item_url)
    plain_text = source_code.text
    soup = BeautifulSoup(plain_text)
    # if you want to gather information from that page
    for item_name in soup.findAll('span', {'class': 'pluginCountTextDisconnected'}):
        try:
                print(item_name.string)
        except:
                print("error")

get_single_item_data("http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine")

score 3 · Accepted Answer

Facebook 建议加载一个iframe.您可以按照iframesrc 属性到该页面，然后加载 span.pluginCountTextDisconnected 的文本：

import requests
from bs4 import BeautifulSoup

url = 'http://news.sky.com/story/1330046/are-putins-little-green-men-back-in-ukraine'
r = requests.get(url) # get the page through requests
soup = BeautifulSoup(r.text) # create a BeautifulSoup object from the page's HTML

url = soup('iframe')[0]['src'] # search for the iframe element and get its src attribute
r = requests.get('http://' + url[2:]) # get the next page from requests with the iframe URL
soup = BeautifulSoup(r.text) # create another BeautifulSoup object

print(soup.find('span', class_='pluginCountTextDisconnected').string) # get the directed information

由于src属性返回，第二个 requests.get 是这样写的//www.facebook.com/plugins/like.php?href=http%3A%2F%2Fnews.sky.com%2Fstory%2F1330046&send=false&layout=button_count&width=120&show_faces=false&action=recommend&colorscheme=light&font=arial&height=21。我添加了http://并忽略了前导//.

BeautifulSoup 文档
 请求文档

score 2 · Accepted Answer

Facebook 推荐是从 javascript 动态加载的，因此您的 HTML 解析器将无法使用它们。您将需要使用 Graph API 和 FQL 直接从 Facebook 获得答案。

这是一个 Web 控制台，您可以在生成访问令牌后探索查询。

python - python中facebook的网络爬虫

2 回答 2

Related

Reference