1

I'm trying to parse a bunch of webpages one after the next with PHP, but I noticed that when I fopen the first page, the links to the following pages are hidden in javascript.

Is there anyway I can continue on to parse the next webpages? If the url had a variable like "page=2" encrypted into it I would go through them that way, but the urls are encrypted.

-LPG

4

2 回答 2

1

基本上你有两个选择:

  1. 模仿他们的逻辑
  2. 模拟一个有效的客户

如果你想使用#1,你必须阅读他们的 Javascript 代码并弄清楚它是如何工作的。我真的无法比这更好地解释它,因为它在很大程度上取决于他们的代码。您只需要了解 Javascript 并“了解”他们的代码。然后,让您的代码执行相同的逻辑来生成“下一页”URL。

如果他们的系统使用 AJAX,您仍然可以模拟它(与 click-upvote 所说的相反)。为此,您只需使用像 Firebug Firefox 扩展这样的工具,这样您就可以“在幕后”观察浏览器向其服务器发送的内容。然后,让您的代码发送一个模仿其 AJAX 请求的虚假 HTTP 请求。即使没有像 Firebug 这样的工具,您实际上也可以做到这一点:只需通过查看 Javascript 代码来推断您的浏览器将发送什么。然而,如果你使用像 Firebug 这样的东西,它会让事情变得更容易(而不是推断,你可以看到正在发送的内容)。

如果您想改用#1,则需要使用实际的浏览器(并使用 Selenium 之类的东西以编程方式控制它),或者使用 Rhino 之类的东西来运行 Javascript。使用带有 Selenium 等控制系统的实际浏览器可能是最简单的方法。但是,它会很慢,因为它受到浏览器渲染页面等时间的限制。使用 Rhino 或类似的解决方案会更快,但也会涉及更多工作(您必须解析 HTML,包括所有相关的 JS 文件等),所以我建议仅作为最后一招。

于 2009-03-10T23:49:51.297 回答
0

The only way would be to write a regular expression which parses out the javascript links and follows them. This would probably only work if the url to the page was in the javascript code, e.g:

<a href="javascript:open('something/some_page.html');">Something</a>

instead of just

<a href="javascript:open(someField.value);">Something</a>

Because with the second example, you would actually have to process the javascript link using PHP, which can be very challenging.

Keep in mind also that you would have to create website-specific regular expressions because each site formats their URLs differently. So Cnn.com might format their urls differently than Reddit.com

于 2009-03-10T02:04:09.267 回答