我尝试使用 Uima Ruta 在某些文件中标记缩写。我使用了如下的简单脚本,但不适用于某些缩写。
我的算法是这样的;1. 将缩写拆分为字母/数字 (ATM -> A,T,M . IC3 -> I,C,3) 2. 将数字转换为字母 (I,C,3 -> I,C,C,C) 3 . 阅读当前句子并将字母与单词匹配(可能包括/可能不包括停用词)
但我不知道如何在 Ruta 中达到同样的效果。我在哪里可以找到这样的循环和控制结构?
样本输入:
The National Academies of Science, Engineering, and Medicine (NAS)
registered nurses (RNs)
Licensed practical nurses (LPNs)
Asian/Pacific Islander Americans (APIAs)
Crime&Investigation Network (CI)
Internet Crime Complaint Center (“IC3”)
Practice Management <PM>
脚本:
CW (EnglishStopWord?|SPECIAL?)? CW (EnglishStopWord?|SPECIAL?)? CW (EnglishStopWord?|SPECIAL?)? CW (EnglishStopWord?|SPECIAL?)? CW LParen CAP RParen{-> MARK(DZC_ABBREVIATIONS, 1, 12)};
CW (EnglishStopWord?|SPECIAL?)? CW (EnglishStopWord?|SPECIAL?)? CW (EnglishStopWord?|SPECIAL?)? CW (EnglishStopWord?|SPECIAL?)? CW{-PARTOF(DZC_ABBREVIATIONS)} LParen CAP RParen{-PARTOF(DZC_ABBREVIATIONS) -> MARK(DZC_ABBREVIATIONS, 1, 12)};
CW (EnglishStopWord?|SPECIAL?)? CW (EnglishStopWord?|SPECIAL?)? CW (EnglishStopWord?|SPECIAL?)? CW (LParen CAP SW? RParen){-PARTOF(DZC_ABBREVIATIONS) -> MARK(DZC_ABBREVIATIONS, 1, 11)};
未标记的缩写:
Chronic Kidney Disease in Children (CKiD)
Society of Intercultural Education, Training, and Research (SIETAR)
The National Academies of Science, Engineering, and Medicine (NAS)
Internet Crime Complaint Center (“IC3”)