javascript - Screen Scraping from a web page with a lot of Javascript

Question

I have been asked to write an app which screen scrapes info from an intranet web page and presents the certain info from it in a nice easy to view format. The web page is a real mess and requires the user to click on half a dozen icons to discover if an ordered item has arrived or has been receipted. As you can imagine users find this irritating to say the least and it would be nice to have an app anyone can use that lists the state of their orders in a single screen.

Yes I know a better solution would be to re-write the web app but that would involve calling in the vendor and would cost us as small fortune.

Anyway while looking into this I discovered the web page I want to scrape is mostly Javascript (although it doesn't use any AJAX techniques). Does anyone know if a library or program exists which I could feed with the Javascript and which would then spit out the DOM for my app to parse ?

I can pretty much write the app in any language but my preference would be JavaFX just so I could have a play with it.

Thanks for your time.

Ian

score 8 · Accepted Answer

您可以考虑使用HTMLunit 这是一个无需控制浏览器即可自动浏览的 java 类库，它集成了 Mozilla Rhino Javascript 引擎来处理它加载的页面上的 javascript。还有一个 JRuby 包装器，名为 Celerity。它的 javascript 支持现在还不是很完美，但是如果你的页面不使用很多 hack，那么性能应该比控制浏览器好得多。此外，您不必担心在抓取结束后 cookie 以及与控制浏览器相关的所有其他讨厌的事情（历史记录、自动完成、临时文件等）。

score 5 · Accepted Answer

既然您说没有使用 AJAX，那么所有信息都存在于 HTML 源中。javascript 只是根据用户点击来呈现它。因此，您需要对应用程序的工作方式进行逆向工程，解析 html 和 javascript 代码并提取有用信息。这是严格的文本解析业务——你不应该处理运行 javascript 和生成新的 DOM。这将更难做到。

如果使用 AJAX，你的工作会更轻松。您可以轻松了解 AJAX 服务的工作方式（可能通过接收 JSON 和 XML）并提取信息。

score 4 · Accepted Answer

您可以考虑使用greasemonkey JS。Greasemonkey 是一个非常强大的 Firefox 插件，它允许您在特定网站的同时运行自己的脚本。这允许您修改网站的显示方式、添加或删除内容。您甚至可以使用它进行 AJAX 样式查找和添加动态内容。

如果您的工具是供内部使用的，并且用户都乐于使用 Firefox，那么这可能是赢家。

问候

score 2 · Accepted Answer

我建议 IRobotSoft 网络刮刀。它是一款专用的免费屏幕抓取软件，具有最佳的 javascript 支持。您可以使用其可视界面创建和测试机器人。您还可以使用其 ActiveX 控件将其嵌入到您自己的应用程序中并隐藏浏览器窗口。

score 1 · Accepted Answer

我同意kgiannakakis 的回答。如果您不能对 javascript 进行逆向工程以识别信息的来源，然后使用Urllib2和Beautiful Soup 库编写一些简单的 Python 脚本来抓取相同的信息，我会感到惊讶。

如果 Python 和抓取是一个新想法，那么有一些关于如何开始的优秀教程。

[编辑]看起来也有 Python 版本的 mechanize。是时候重新编写我不久前开发的一些刮板了！:-)

score 1 · Accepted Answer

我创建了一个项目site2archive，它使用phantomJs来渲染包括 JS 内容和wget来抓取。phantomJs基于 Webkit，提供与 Safari 和 Google Chrome 类似的浏览环境。

score 1 · Accepted Answer

我会选择 Perl 的Win32::IE::Mechanize，它可以让您自动化 Internet Explorer。您应该能够单击图标并提取文本，同时让 MSIE 完成处理所有 JS 的烦人任务。

javascript - Screen Scraping from a web page with a lot of Javascript

7 回答 7

Related

Reference