javascript - Need to know main DIV of the page

Question

I am trying to come up with a strategy to detect the main content DIV of a site. Main content div means: The div which contains the header, body, and the footer of the site.

It's a very difficult and slow process to detect it.

For instance, on http://www.goo.ne.jp/, I would detect id="bodyWrapper" or "minWidthInbox" because these divs contain the main content on the site.

I have also tried many algorithms to do so. But bacause of weird structures of sites, and inconsistencies, it's not possible for all sites to run by a single algorithm.

Table layout is especially hard to detect. :-(

How should I approach this problem?

score 3 · Accepted Answer

你应该看看 Readability http://www.readability.com/。他们开发了一种算法来提取网页内容并删除所有其他元素，如页眉、页脚、广告。

不幸的是，他们的算法不再公开。他们在这里有一个 API：http ://www.readability.com/developers/api 。

他们的原始算法也有几种实现。我已经使用了 Python 库和 NodeJS 库（https://github.com/arrix/node-readability），它们工作得很好。

关于您关于主 div 的问题，除非您要废弃特定网站，否则我不建议您搜索这样一段特定的代码。在我看来，您喜欢内容，当然网站的 html 代码可以包含几乎所有内容，而不仅仅是主 div。

javascript - Need to know main DIV of the page

1 回答 1

Related

Reference