3

最近几天我一直在做一个项目,这个项目中有一个我实际上不知道该怎么做的任务,该项目包括分析网页以找到表征页面的标签。

嘿伙计,你说的标签是什么意思?我所说的标签是指总结网页内容的关键字。例如,在这里,您写下您自己的标签,以便人们可以更好地发现您的问题。我所说的是构建一种算法来分析网页,以通过页面中的文本找到它的标签。

我开始从页面获取文本->完成

通常我正在寻找一种方法来找到总结网页内容的关键字

但是,我真的不知道下一步该做什么。有人有建议吗?

4

4 回答 4

4

For a really basic approach, you could use the TF-IDF algorithm to find the most important word in your page

Quick overlook from wikipedia:

The tf–idf weight (term frequency–inverse document frequency) is a weight often used in information retrieval and text mining. This weight is a statistical measure used to evaluate how important a word is to a document in a collection or corpus. The importance increases proportionally to the number of times a word appears in the document but is offset by the frequency of the word in the corpus. Variations of the tf–idf weighting scheme are often used by search engines as a central tool in scoring and ranking a document's relevance given a user query. tf–idf can be successfully used for stop-words filtering in various subject fields including text summarization and classification

Once you find the most important word in your page you can use them as tags.


If you want to improve your tags and make them more relevant.

There are a lot of way to proceed, but you can proceed as below:

  • Extract a bunch of text from which you know the main tags.
  • For all this text run a TF-IDF algorithm and create a vector with the ones with the highest score.
  • Try to find a main direction will all these vectors. (running an ACP for example, or any machine learning tool)
  • And use this tag to represent the set of words from the main direction. (the largest vector of the ACP)

Hope it's understandable and it helps

于 2011-10-20T16:28:56.903 回答
1

You can implement a number of heuristics:

  • Acronyms and words in all uppercase
  • Words that are not frequent, i.e. discard words that appear in all or most documents and favour the ones that appear relatively frequently only on this one.
  • Sequences of words that always appear in the same order in this document and possibly in others as well
  • etc.
于 2011-10-20T16:34:47.570 回答
1

通常,您会查找由特定 html 包围的特定单词。例如,标题通常位于 H 标记中,例如<h1>.

如果您为所有 H1 标签解析页面,那么该标签后面的内容是相关的。一个例子就是这个页面。它有一个围绕问题标题的 H1 标签。这给谷歌一个提示,该页面是关于“算法”、“分析”、“网页”等的。

困难的部分是确定上下文。

在我们这里的示例中,术语“页面”非常通用,可以与任何事物相关联。然而,“网页”更具体一点。您可以使用一个内部字典来做到这一点,该字典是在分析大量文档以找到共性后根据词频随时间建立的。在确定给定页面的前 X 个“标签”时,频率应该提供一个加权值。

于 2011-10-20T16:31:01.433 回答
1

这更像是一个信息检索和数据挖掘问题。复习Rao 的一些讲座可能会有所帮助。

当你爬取网页时,你实际上是在尝试建立一个索引。您可以通过构建一个全局词频词典来做到这一点,其中语言中的每个单词(通常为了解释复数和其他修改)都存储为键,并且它们在文档中出现的次数作为值。

从那里,您可以使用PageRankAuthorities and hubs等算法进行数据分析。

于 2011-10-20T16:31:16.130 回答