在 uima ruta 中是否可以进行排序。例如:
输入文件:
some text
Fig 1.1
Table 1.1
Fig 1.2
some text
Pic 1.2
Table 1.2
some text
Table 1.3
Pic 1.3
some text
Fig 1.4
some text
Table 1.4
some text
Table 1.5
Fig 1.6
Box 1.1
Fig 1.5
我怎样才能找到丢失的图(图 1.3)
这是一个如何使用 UIMA Ruta 2.5.0 完成的示例。
输入文本:
some text
Fig 1.1
some text
Pic 1.2
some text
Pic 1.3
some text
Fig 1.4
some text
规则脚本:
DECLARE FigureInd;
DECLARE FigureMention (INT chapter, INT section);
ACTION FM(INT chap, INT sect) = CREATE(FigureMention, "chapter" = chap, "section" = sect);
"Fig"-> FigureInd;
INT c, s;
(FigureInd NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){-> FM(c,s)};
DECLARE FigMissing;
f1:FigureMention #{-> FigMissing} f2:FigureMention
{f1.chapter == f2.chapter, f1.section < (f2.section - 1)};
INT pc, ps;
f:FigureMention{-> pc=f.chapter, ps=f.section}
FigMissing->{
(ANY @NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){c==pc,s==ps+1-> FM(c,s), pc=c, ps=s};
};
创建了 FigureMention 注释:
Fig 1.1
Pic 1.2
Pic 1.3
Fig 1.4
UIMA Ruta 2.4.0 的解决方案非常相似,但不允许直接使用注释标签表达式的特征。这些特征的值需要存储在附加变量中。并且需要在变量设置器之后应用布尔检查。这是 UIMA Ruta 2.4.0 的解决方案:
DECLARE FigureInd;
DECLARE FigureMention (INT chapter, INT section);
ACTION FM(INT chap, INT sect) = CREATE(FigureMention, "chapter" = chap, "section" = sect);
"Fig"-> FigureInd;
INT c, s;
(FigureInd NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){-> FM(c,s)};
DECLARE FigMissing;
INT c1,c2,s1,s2;
(FigureMention<-{FigureMention{-> ASSIGN(c1, FigureMention.chapter), ASSIGN(s1, FigureMention.section)};}
#{-> FigMissing}
FigureMention<-{FigureMention{-> ASSIGN(c2, FigureMention.chapter), ASSIGN(s2, FigureMention.section)};})
{c1 == (c2), s1 < (s2 - 1)};
INT pc, ps;
f:FigureMention{-> pc=FigureMention.chapter, ps=FigureMention.section}
FigMissing->{
(ANY @NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){c==(pc),s==(ps+1)-> FM(c,s), pc=c, ps=s};
};
(免责声明:我是 UIMA Ruta 的开发人员)
以下脚本使用 UIMA Ruta 2.4.0 中缺失数字的最小值和最大值创建注释:
DECLARE FigureInd;
DECLARE FigureMention (INT chapter, INT section);
DECLARE FigureMissing (INT minChapter, INT minSection, INT maxChapter, INT maxSection);
ACTION Mention(INT chap, INT sect) = CREATE(FigureMention, "chapter" = chap, "section" = sect);
ACTION Missing(INT minc, INT mins, INT maxc, INT maxs) = CREATE(FigureMissing, "minChapter" = minc, "minSection" = mins, "maxChapter" = maxc, "maxSection" = maxs);
"Fig"-> FigureInd;
INT c, s;
(FigureInd NUM{PARSE(c)} PERIOD NUM{PARSE(s)}){-> Mention(c,s)};
DECLARE FigMissing;
INT c1,c2,s1,s2;
(FigureMention<-{FigureMention{-> ASSIGN(c1, FigureMention.chapter), ASSIGN(s1, FigureMention.section)};}
#{-> Missing(c1,s1+1,c2,s2-1)}
FigureMention<-{FigureMention{-> ASSIGN(c2, FigureMention.chapter), ASSIGN(s2, FigureMention.section)};})
{c1 == (c2), s1 < (s2 - 1)};
在 UIMA Ruta 中,布尔表达式(如 while)没有循环,只有现有注释。这使得在相同偏移量上为每个缺失的 Fig 创建单独的注释变得更加复杂。但是,它可以通过递归 BLOCK 来完成。答案的脚本创建了一个注释,定义了一系列缺失的数字。
对于问题的文本示例,创建了两个 FigureMissing 注释:
FigureMissing
- begin: 41
- end: 112
- minChapter: 1
- minSection: 3
- maxChapter: 1
- maxSection: 3
FigureMissing
- begin: 123
- end: 165
- minChapter: 1
- minSection: 5
- maxChapter: 1
- maxSection: 5
如果不应创建第二个 FigureMissing,则附加规则可以根据现有的 FigureMentions 再次删除它。如果已经创建了单独的 FirgureMssing 注释,例如使用 BLOCK,这当然会简单得多。
免责声明:我是 UIMA Ruta 的开发人员