6

目前我使用'lucene'和'elasticsearch',并且有下一个问题。我需要为小词获取词干形式或引理。例如 :

  • 狗 -> 狗
  • 小猫->猫

等等

但我得到下一个结果:

  • 小狗 -> 小狗
  • 小猫->小猫

是否有任何方法(不重要的准备使用库、任何算法、方法等)来获取小型词形式的根/原始词形式?

目标语言:俄语。例如 :

  • собачка -> собака
  • кошечка -> кошка

提前致谢!

4

1 回答 1

3

Firstly, as a side note: What you're trying to do isn't typically called stemming or lemmatiziation.

Your first issue would be mapping the token observed (e.g. собачка) to its normalised form (e.g. собака)-- Naively, this could be done by creating a SynonymFilter which uses a SynonymMap mapping dimunitive forms to their canonical forms. However, you'll likely run into problems with any natural language because not all derivations are unambiguous: For example, in German, Mädel ('girl'/'lass') could be a diminutive form of Magd (an archaic word meaning 'young woman'/'maid') or of Made ('maggot').

One way of disambiguating these two forms would be to calculate the probability of each canonical form appearing in the given context (e.g. the history of the preceding n tokens) and then replacing the dimunitive form with the most probable canonical form (using a custom-made TokenFilter to do so)-- See e.g. the Wikipedia entry for word-sense disambiguation for different approaches.

于 2014-12-04T13:05:31.330 回答