0

I am trying to come up with a strategy to detect the main content DIV of a site. Main content div means: The div which contains the header, body, and the footer of the site.

It's a very difficult and slow process to detect it.

For instance, on http://www.goo.ne.jp/, I would detect id="bodyWrapper" or "minWidthInbox" because these divs contain the main content on the site.

I have also tried many algorithms to do so. But bacause of weird structures of sites, and inconsistencies, it's not possible for all sites to run by a single algorithm.

Table layout is especially hard to detect. :-(

How should I approach this problem?

4

1 回答 1

3

你应该看看 Readability http://www.readability.com/。他们开发了一种算法来提取网页内容并删除所有其他元素,如页眉、页脚、广告。

不幸的是,他们的算法不再公开。他们在这里有一个 API:http ://www.readability.com/developers/api 。

他们的原始算法也有几种实现。我已经使用了 Python 库和 NodeJS 库(https://github.com/arrix/node-readability),它们工作得很好。

关于您关于主 div 的问题,除非您要废弃特定网站,否则我不建议您搜索这样一段特定的代码。在我看来,您喜欢内容,当然网站的 html 代码可以包含几乎所有内容,而不仅仅是主 div。

于 2012-10-27T08:29:42.213 回答