一段时间以来,我一直在尝试找到一种方法,通过消除与广告相关的文本和所有其他混乱,智能地从 URL 中提取“相关”文本。经过几个月的研究,我放弃了它作为一个问题那是无法准确确定的。(我尝试了不同的方法,但没有一个是可靠的)
一周前,我偶然发现了Readability——一个将任何 URL 转换为可读文本的插件。对我来说它看起来很准确。我的猜测是,他们不知何故有一个足够聪明的算法来提取相关文本。
有谁知道他们是怎么做到的?或者我怎么能可靠地做到这一点?
一段时间以来,我一直在尝试找到一种方法,通过消除与广告相关的文本和所有其他混乱,智能地从 URL 中提取“相关”文本。经过几个月的研究,我放弃了它作为一个问题那是无法准确确定的。(我尝试了不同的方法,但没有一个是可靠的)
一周前,我偶然发现了Readability——一个将任何 URL 转换为可读文本的插件。对我来说它看起来很准确。我的猜测是,他们不知何故有一个足够聪明的算法来提取相关文本。
有谁知道他们是怎么做到的?或者我怎么能可靠地做到这一点?
Readability mainly consists of heuristics that "just somehow work well" in many cases.
I have written some research papers about this topic and I would like to explain the background of why it is easy to come up with a solution that works well and when it gets hard to get close to 100% accuracy.
There seems to be a linguistic law underlying in human language that is also (but not exclusively) manifest in Web page content, which already quite clearly separates two types of text (full-text vs. non-full-text or, roughly, "main content" vs. "boilerplate").
To get the main content from HTML, it is in many cases sufficient to keep only the HTML text elements (i.e. blocks of text that are not interrupted by markup) which have more than about 10 words. It appears that humans choose from two types of text ("short" and "long", measured by the number of words they emit) for two different motivations of writing text. I would call them "navigational" and "informational" motivations.
If an author wants you to quickly get what is written, he/she uses "navigational" text, i.e. few words (like "STOP", "Read this", "Click here"). This is the mostly prominent type of text in navigational elements (menus etc.)
If an author wants you to deeply understand what he/she means, he/she uses many words. This way, ambiguity is removed at the cost of an increase in redundancy. Article-like content usually falls into this class as it has more than only a few words.
While this separation seems to work in a plethora of cases, it is getting tricky with headlines, short sentences, disclaimers, copyright footers etc.
There are more sophisticated strategies, and features, that help separating main content from boilerplate. For example the link density (number of words in a block that are linked versus the overall number of words in the block), the features of the previous/next blocks, the frequency of a particular block text in the "whole" Web, the DOM structure of HTML document, the visual image of the page etc.
You can read my latest article "Boilerplate Detection using Shallow Text Features" to get some insight from a theoretical perspective. You may also watch the video of my paper presentation on VideoLectures.net.
"Readability" uses some of these features. If you carefully watch the SVN changelog, you will see that the number of strategies varied over time, and so did the extraction quality of Readability. For example, the introduction of link density in December 2009 very much helped improving.
In my opinion, it therefore makes no sense in saying "Readability does it like that", without mentioning the exact version number.
I have published an Open Source HTML content extraction library called boilerpipe, which provides several different extraction strategies. Depending on the use case, one or the other extractor works better. You can try these extractors on pages on your choice using the companion boilerpipe-web app on Google AppEngine.
To let numbers speak, see the "Benchmarks" page on the boilerpipe wiki which compares some extraction strategies, including boilerpipe, Readability and Apple Safari.
I should mention that these algorithms assume that the main content is actually full text. There are cases where the "main content" is something else, e.g. an image, a table, a video etc. The algorithms won't work well for such cases.
Cheers,
Christian
可读性是一个 javascript 书签。意思是操作 DOM 的客户端代码。查看javascript,您应该能够看到发生了什么。
Readability 的工作流程和代码:
/*
* 1. Prep the document by removing script tags, css, etc.
* 2. Build readability's DOM tree.
* 3. Grab the article content from the current dom tree.
* 4. Replace the current DOM tree with the new one.
* 5. Read peacefully.
*/
javascript: (function () {
readConvertLinksToFootnotes = false;
readStyle = 'style-newspaper';
readSize = 'size-medium';
readMargin = 'margin-wide';
_readability_script = document.createElement('script');
_readability_script.type = 'text/javascript';
_readability_script.src = 'http://lab.arc90.com/experiments/readability/js/readability.js?x=' + (Math.random());
document.documentElement.appendChild(_readability_script);
_readability_css = document.createElement('link');
_readability_css.rel = 'stylesheet';
_readability_css.href = 'http://lab.arc90.com/experiments/readability/css/readability.css';
_readability_css.type = 'text/css';
_readability_css.media = 'all';
document.documentElement.appendChild(_readability_css);
_readability_print_css = document.createElement('link');
_readability_print_css.rel = 'stylesheet';
_readability_print_css.href = 'http://lab.arc90.com/experiments/readability/css/readability-print.css';
_readability_print_css.media = 'print';
_readability_print_css.type = 'text/css';
document.getElementsByTagName('head')[0].appendChild(_readability_print_css);
})();
如果您按照上述代码引入的 JS 和 CSS 文件进行操作,您将了解全貌:
http://lab.arc90.com/experiments/readability/js/readability.js(评论很好,读起来很有趣)
http://lab.arc90.com/experiments/readability/css/readability.css
当然,没有 100% 可靠的方法可以做到这一点。您可以在此处查看可读性源代码
基本上,他们正在做的是试图识别正面和负面的文本块。正标识符(即 div ID)类似于:
负标识符将是:
然后他们有不太可能的候选人。他们要做的是确定最有可能成为网站主要内容的内容,请参阅678
可读性源代码中的行。这是通过主要分析段落的长度、它们的标识符(见上文)、DOM 树(即如果段落是最后一个子节点)、删除所有不必要的内容、删除格式等来完成的。
代码有 1792 行。这似乎是一个不简单的问题,所以也许你可以从那里得到你的灵感。
有趣的。我开发了一个类似的 PHP 脚本。它基本上扫描文章并将词性附加到所有文本(Brill Tagger)。然后,语法无效的句子立即被消除。然后,代词或过去时的突然变化表明文章已经结束,或者还没有开始。重复词组被搜索并剔除,如“雅虎新闻体育财经”在页面中出现十次。您还可以通过与各种情绪相关的大量词库获得有关语气的统计数据。语气的突然变化,从主动/消极/财务,到被动/积极/政治,表明了一个界限。它真的是无穷无尽的,无论你想深入挖掘。
主要问题是链接、嵌入式异常、脚本样式和更新。