regex - 如何计算维基百科原始文本中的引用/参考数量？

Question

我正在构建一个模型来按文章质量对原始维基百科文本进行分类（维基百科有一个包含约 30,000 篇手工评分文章及其相应质量等级的数据集。）。尽管如此，我正在尝试找出一种通过算法计算页面上出现的引用数量的方法。

举个简单的例子：这是原始 Wiki 页面的摘录：

'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>

到目前为止，我已经得出结论，我可以通过计算出现次数来找到图像的数量[[Image:。我希望我可以为参考做类似的事情。事实上，在比较原始 Wiki 页面和它们相应的实时页面之后，我想我能够确定它</ref>对应于 Wiki 页面上引用的结束符号。-->例如：这里，你可以看到作者在段落末尾做了一个陈述，并在{text}中引用了Hammond, 58–9<ref></ref>

如果有人熟悉 Wiki 的原始数据并且可以对此有所了解，请告诉我！另外，如果你知道更好的方法，也请告诉我！

提前谢谢了！

score 1 · Accepted Answer

ref 并不总是包含指向源的链接。有时包含指定的解释等。
您不仅要计算脚注模板<ref>...</ref>，还要计算脚注模板。
如果您需要计算唯一引用，则必须排除分组引用（带有 name="xxx" 参数的引用或具有相同内容的自动分组脚注模板）。

对不起我的英语不好。

score 0 · Accepted Answer

计算 wiki 标记中的引用标签不一定准确，因为引用可以重复使用，因此两个</ref>只会在最后的列表中显示为一个引用。有一个 API 应该提供文章列表，但由于某种原因它被停用了，但是 BeautifulSoup 使这变得非常简单。我没有对此进行测试以检查它是否正确计算所有文章，但它有效：

from bs4 import BeautifulSoup
import requests

page=requests.get('https://en.wikipedia.org/wiki/Stack_Overflow')       
soup=BeautifulSoup(page.content,'html.parser') 
count = 0
for eachref in soup.find_all('span', attrs={'class':'reference-text'}):
    count = count + 1

print (count)

regex - 如何计算维基百科原始文本中的引用/参考数量？

2 回答 2

Related

Reference