1

我正在构建一个模型来按文章质量对原始维基百科文本进行分类(维基百科有一个包含约 30,000 篇手工评分文章及其相应质量等级的数据集。)。尽管如此,我正在尝试找出一种通过算法计算页面上出现的引用数量的方法。

举个简单的例子:这是原始 Wiki 页面的摘录:

'[[Image:GD-FR-Paris-Louvre-Sculptures034.JPG|320px|thumb|Tomb of Philippe Pot, governor of [[Burgundy (region)|Burgundy]] under [[Louis XI]]|alt=A large sculpture of six life-sized black-cloaked men, their faces obscured by their hoods, carrying a slab upon which lies the supine effigy of a knight, with hands folded together in prayer. His head rests on a pillow, and his feet on a small reclining lion.]]\n[[File:Sejong tomb 1.jpg|thumb|320px|Korean tomb mound of King [[Sejong the Great]], d. 1450]]\n[[Image:Istanbul - Süleymaniye camii - Türbe di Roxellana - Foto G. Dall\'Orto 28-5-2006.jpg|thumb|320px|[[Türbe]] of [[Roxelana]] (d. 1558), [[Süleymaniye Mosque]], [[Istanbul]]]]\n\'\'\'Funerary art\'\'\' is any work of [[art]] forming, or placed in, a repository for the remains of the [[death|dead]]. [[Tomb]] is a general term for the repository, while [[grave goods]] are objects—other than the primary human remains—which have been placed inside.<ref>Hammond, 58–9 characterizes [[Dismemberment|disarticulated]] human skeletal remains packed in body bags and incorporated into [[Formative stage|Pre-Classic]] [[Mesoamerica]]n [[mass burial]]s (along with a set of primary remains) at Cuello, [[Belize]] as "human grave goods".</ref>

到目前为止,我已经得出结论,我可以通过计算出现次数来找到图像的数量[[Image:。我希望我可以为参考做类似的事情。事实上,在比较原始 Wiki 页面和它们相应的实时页面之后,我想我能够确定它</ref>对应于 Wiki 页面上引用的结束符号。-->例如:这里,你可以看到作者在段落末尾做了一个陈述,并在{text}中引用了Hammond, 58–9<ref></ref>

如果有人熟悉 Wiki 的原始数据并且可以对此有所了解,请告诉我!另外,如果你知道更好的方法,也请告诉我!

提前谢谢了!


4

2 回答 2

1
  1. ref 并不总是包含指向源的链接。有时包含指定的解释等。
  2. 您不仅要计算脚注模板<ref>...</ref>,还要计算脚注模板
  3. 如果您需要计算唯一引用,则必须排除分组引用(带有 name="xxx" 参数的引用或具有相同内容的自动分组脚注模板)。

对不起我的英语不好。

于 2018-08-20T18:41:04.323 回答
0

计算 wiki 标记中的引用标签不一定准确,因为引用可以重复使用,因此两个</ref>只会在最后的列表中显示为一个引用。有一个 API 应该提供文章列表,但由于某种原因它被停用了,但是 BeautifulSoup 使这变得非常简单。我没有对此进行测试以检查它是否正确计算所有文章,但它有效:

from bs4 import BeautifulSoup
import requests

page=requests.get('https://en.wikipedia.org/wiki/Stack_Overflow')       
soup=BeautifulSoup(page.content,'html.parser') 
count = 0
for eachref in soup.find_all('span', attrs={'class':'reference-text'}):
    count = count + 1

print (count)
于 2018-08-20T22:29:32.567 回答