我正在构建一个模型来按文章质量对原始维基百科文本进行分类(维基百科有一个包含约 30,000 篇手工评分文章及其相应质量等级的数据集。)。尽管如此,我正在尝试找出一种通过算法计算页面上出现的引用数量的方法。
举个简单的例子:这是原始 Wiki 页面的摘录:
'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>
到目前为止,我已经得出结论,我可以通过计算出现次数来找到图像的数量[[Image:
。我希望我可以为参考做类似的事情。事实上,在比较原始 Wiki 页面和它们相应的实时页面之后,我想我能够确定它</ref>
对应于 Wiki 页面上引用的结束符号。-->例如:这里,你可以看到作者在段落末尾做了一个陈述,并在{text}中引用了Hammond, 58–9<ref>
</ref>
如果有人熟悉 Wiki 的原始数据并且可以对此有所了解,请告诉我!另外,如果你知道更好的方法,也请告诉我!
提前谢谢了!