我正在解析荷兰语维基百科,它包含以下类别标记:
[Categorie:Nederlands beeldhouwer]]
但是英文维基百科使用以下标记:
[[Category:Japanese diplomats]]
因此,标记(类别/类别)取决于语言。是否可以将Lucene WikipediaTokenizer用于非英语 wiki?如果可能,怎么做?
我正在解析荷兰语维基百科,它包含以下类别标记:
[Categorie:Nederlands beeldhouwer]]
但是英文维基百科使用以下标记:
[[Category:Japanese diplomats]]
因此,标记(类别/类别)取决于语言。是否可以将Lucene WikipediaTokenizer用于非英语 wiki?如果可能,怎么做?
I think wikipedia markups are language dependent, API results also will be different by languages.
As per http://www.mediawiki.org/wiki/API I did quick experiment with same query and got different results for http://en.wikipedia.org/w/api.php and http://nl.wikipedia.org/w/api.php
LuceneWikipediaTokenizer is extension of StandardTokenizer thus it should support and index all languages.