选择性地解析一个 html 页面,但我想知道爬虫(从小到大)如何检测要分析的主要文本在哪里?
我正在开发一个小型 PHP 项目,以从各种 HTML 页面中捕获关键字,但我不知道如何避免从侧面内容中捕获关键字。谁能描述或至少给我一个提示如何区分 HTML 页面中的主要内容和其他内容?
选择性地解析一个 html 页面,但我想知道爬虫(从小到大)如何检测要分析的主要文本在哪里?
我正在开发一个小型 PHP 项目,以从各种 HTML 页面中捕获关键字,但我不知道如何避免从侧面内容中捕获关键字。谁能描述或至少给我一个提示如何区分 HTML 页面中的主要内容和其他内容?
If the content is textual, You can assume that the main content of the page is where the word density is relatively higher.
This means the main content of the page which is relevant to search engines -- the main content of the page is inside the dom elements mostly divs where the number of literals including tags like p,em,b, etc etc which are essentially for text formatting, is higher or above a threshold.
I shall start off with the following logic
get all the tags used in the webpage.
I shall note down the dom elements where the content is formed of only literals and formatting tags like p,em,b,li,ul and also anchor tags.
I would leave divs containing only anchor tags and assume they are for navigational purposes.
Now out of all these choose the dom elements where the number is above a particular threshold.
This threshold value varies from website to website,which you can take as avg(literals found in the div having highest literals across all pages of a site of a particular url structure)
The algorithm has to learn during its course.