2

I've been working on a project to data-mine a large amount of short texts and categorize these based on a pre-existing large list of category names. To do this I had to figure out how to first create a good text corpus from the data in order to have reference documents for the categorization and then to get the quality of the categorization up to an acceptable level. This part I am finished with (luckily categorizing text is something that a lot of people have done a lot of research into).

Now my next problem, I'm trying to figure out a good way of linking the various categories to each other computationally. That is to say, to figure out how to recognize that "cars" and "chevrolet" are related in some way. So far I've tried utilizing the N-Gram categorization methods described by, among others, Cavnar and Trenkle for comparing the various reference documents I've created for each category. Unfortunately it seems the best I've been able to get out of that method is approximately 50-55% correct relations between categories, and those are the best relations, overall it's around 30-35% which is miserably low.

I've tried a couple of other approaches as well but I've been unable to get much higher than 40% relevant links (an example of a non-relevant relation would be the category "trucks" being strongly related to the category "makeup" or the category "diapers" while weakly (or not at all) related to "chevy").

Now, I've tried looking for better methods for doing this but it just seems like I can't find any (yet I know others have done better than I have). Does anyone have any experience with this? Any tips on usable methods for creating relations between categories? Right now the methods I've tried either don't give enough relations at all or contain way too high a percentage of junk relations.

4

1 回答 1

0

显然,进行这种匹配的最佳方式很大程度上取决于您的分类、“参考文档”的性质以及您希望创建的预期关系。

但是,根据提供的信息,我建议以下内容:

  1. 首先根据参考文档为每个类别构建基于单词(而不是基于字母)的一元或二元模型。如果每个类别只有少数几个(似乎您可能只有一个),您可以使用半监督方法,并为每个类别输入自动分类的文档。用于构建模型的相对简单的工具可能是CMU SLM 工具包
  2. 计算模型中每个术语或短语与其他类别相关的互信息(信息增益)。如果您的类别相似,您可能需要仅使用相邻类别才能获得有意义的结果。这一步会给最好的分离词更高的分数。
  3. 根据信息量最高的术语或短语将类别相互关联。这可以通过使用类别模型之间的欧几里得或余弦距离来完成,或者通过使用更精细的技术来完成,例如基于图形的算法或层次聚类。
于 2011-09-19T06:22:12.457 回答