4

有一些标准方法,比如DOM选择性地解析一个 html 页面,但我想知道爬虫(从小到大)如何检测要分析的主要文本在哪里?

将分析以捕获其关键字的正文与菜单、侧边栏、页脚等混合在一起。爬虫如何知道从菜单和侧面部分中跳过关键字?

我正在开发一个小型 PHP 项目,以从各种 HTML 页面中捕获关键字,但我不知道如何避免从侧面内容中捕获关键字。谁能描述或至少给我一个提示如何区分 HTML 页面中的主要内容和其他内容?

4

2 回答 2

2

侧边栏、菜单和页脚通常在整个站点的每个页面上重复出现。每个页面的实际内容通常是唯一的。您可以将其用作区分实际内容的指南。

爬虫还使用复杂的算法来分析页面上的文本,以确定其作为内容的权重,而且它们往往不会分享他们的秘密。

没有快速简便的方法,爬虫开发人员必须想出自己的创新方法,并集体使用这些方法来获得页面内容的整体图景。

于 2012-05-12T22:14:56.813 回答
0

If the content is textual, You can assume that the main content of the page is where the word density is relatively higher.

This means the main content of the page which is relevant to search engines -- the main content of the page is inside the dom elements mostly divs where the number of literals including tags like p,em,b, etc etc which are essentially for text formatting, is higher or above a threshold.

I shall start off with the following logic

get all the tags used in the webpage.

I shall note down the dom elements where the content is formed of only literals and formatting tags like p,em,b,li,ul and also anchor tags.

I would leave divs containing only anchor tags and assume they are for navigational purposes.

Now out of all these choose the dom elements where the number is above a particular threshold.

This threshold value varies from website to website,which you can take as avg(literals found in the div having highest literals across all pages of a site of a particular url structure)

The algorithm has to learn during its course.

于 2012-05-15T17:40:17.070 回答