stemming - 是否有克罗地亚语词干算法的实现？

Question

我正在寻找克罗地亚语词干算法的实现。理想情况下是 Java，但我也接受任何其他语言。

是否有一个说英语的开发人员社区，他们正在开发克罗地亚语的搜索应用程序？

谢谢，

score 6 · Accepted Answer

斯拉夫语言是高度屈折的。最准确、最快速的方法是结合使用规则和大型映射/字典。

工作已经完成，但被搁置了。克罗地亚语形态词典会有所帮助，但它的 API 很慢。在波斯尼亚语、塞尔维亚语和克罗地亚语之间可以找到更多的工作，而不仅仅是克罗地亚语。

大型映射并不总是很方便（并且可以有效地从映射/字典/语料库构建更好的规则转换器）。

使用 Hunspell 和附加文件实现可能是获得社区和 Java 支持的好方法。例如。谷歌搜索：hr_hr.aff

未测试：应该能够反转所有单词，构建结束字符的 trie，使用一些规则（例如 LCS）进行遍历，并使用语料库文本构建准确的统计转换器。

我能做的最好的就是一些python：

import hunspell
hs = hunspell.HunSpell(
         '/usr/share/myspell/hr_HR.dic', 
         '/usr/share/myspell/hr_HR.aff')

# The following should return ['hrvatska']:
print hs.stem('hrvatski')

score 0 · Accepted Answer

here you can find a recent implementation done on ffzg in python - stemmer for croatian.

We performed basic evaluation of the stemmer on a lemmatized newspaper corpus as gold standard with a precision of 0.986 and recall of 0.961 (F1 0.973) for adjectives and nouns. On all parts of speech we obtained precision of 0.98 and recall of 0.92 (F1 0.947).

It is released under GNU licence but feel free to contact the author on further help (I only know the original author Nikola, but not his student).

stemming - 是否有克罗地亚语词干算法的实现？

2 回答 2

Related

Reference