python - 使用 Python/Selenium 抓取动态/Javascript 生成的网站

Question

我正在尝试抓取这个网站：

http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210

使用 Python 和 Selenium（参见下面的代码）。内容是动态生成的，显然不会加载浏览器中不可见的数据。我尝试让浏览器窗口变大，并滚动到页面底部。放大窗口可以在水平方向获得我想要的所有数据，但在垂直方向仍有大量数据需要抓取。滚动似乎根本不起作用。

有没有人对如何做到这一点有任何好主意？

谢谢！

from selenium import webdriver
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
driver = webdriver.Firefox()
driver.get(url)
driver.set_window_position(0, 0)
driver.set_window_size(100000, 200000)
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")

time.sleep(5) # wait to load

soup = BeautifulSoup(driver.page_source)

table = soup.find("table", {"id":"DataTable"})

### get data
thead = table.find('tbody')
loopRows = thead.findAll('tr')
rows = []
for row in loopRows:
rows.append([val.text.encode('ascii', 'ignore') for val in  row.findAll(re.compile('td|th'))])
with open("body.csv", 'wb') as test_file:
  file_writer = csv.writer(test_file)
  for row in rows:
      file_writer.writerow(row)

score 5 · Accepted Answer

这将使您将整个 csv 自动保存到磁盘，但我还没有找到一种可靠的方法来确定下载何时完成：

import os
import contextlib
import selenium.webdriver as webdriver
import csv
import time

url = "http://stats.uis.unesco.org/unesco/TableViewer/tableView.aspx?ReportId=210"
download_dir = '/tmp'
fp = webdriver.FirefoxProfile()
fp.set_preference("browser.download.dir", download_dir)
# 2 means "use the last folder specified for a download"
fp.set_preference("browser.download.folderList", 2)
fp.set_preference("browser.download.manager.showWhenStarting", False)
fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")

# driver = webdriver.Firefox(firefox_profile=fp)
with contextlib.closing(webdriver.Firefox(firefox_profile=fp)) as driver:
    driver.get(url)
    driver.execute_script("onDownload(2);")
    csvfile = os.path.join(download_dir, 'download.csv')

    # Wait for the download to complete
    time.sleep(10)
    with open(csvfile, 'rb') as f:
        for line in csv.reader(f, delimiter=','):
            print(line)

解释：

将您的浏览器指向url. 您会看到一个Actions菜单，其中包含一个选项Download report data...和一个名为的子选项"Comma-delimited ASCII format (*.csv)"。如果您检查这些单词的 HTML，您会发现

"Comma-delimited ASCII format (*.csv)","","javascript:onDownload(2);"

因此很自然地，您可能会尝试webdriver执行 JavaScript 函数调用onDownload(2)。我们可以做到这一点

driver.execute_script("onDownload(2);")

但通常会弹出另一个窗口，询问您是否要保存文件。为了自动保存到磁盘，我使用了这个 FAQ中描述的方法。棘手的部分是找到要在此行上指定的正确 MIME 类型：

fp.set_preference("browser.helperApps.neverAsk.saveToDisk", "application/x-csv")

常见问题解答中描述的curl方法在这里不起作用，因为我们没有 csv 文件的 url。但是，此页面描述了另一种查找 MIME 类型的方法：使用 Firefox 浏览器打开保存对话框。选中“对此类文件自动执行此操作”复选框。~/.mozilla/firefox/*/mimeTypes.rdf然后检查最近添加的描述的最后几行：

  <RDF:Description RDF:about="urn:mimetype:handler:application/x-csv"
                   NC:alwaysAsk="false"
                   NC:saveToDisk="true">
    <NC:externalApplication RDF:resource="urn:mimetype:externalApplication:application/x-csv"/>
  </RDF:Description>

这告诉我们 mime 类型是"application/x-csv". 宾果游戏，我们在做生意。

score 0 · Accepted Answer

您可以通过

self.driver.find_element_by_css_selector("html body.TVTableBody table#pageTable tbody tr td#cell4 table#MainTable tbody tr td#vScrollTD img[onmousedown='imgClick(this.sbar.visible,this,event);']").click()

似乎一旦你可以滚动抓取应该是相当标准的，除非我错过了什么

python - 使用 Python/Selenium 抓取动态/Javascript 生成的网站

2 回答 2

Related

Reference