4

我正在尝试按照下面提到的两个 DBpedia 属性来构建主题层次结构。

  1. skos:更广泛的财产
  2. dcterms:主题属性

我的意图是给这个词确定它的主题。例如,给定单词;'支持向量机',我想从中识别主题,例如分类算法、机器学习等。

但是,有时我对如何构建主题层次结构感到有些困惑,因为我获得了超过 5 个主题 URI 和许多更广泛属性的 URI。有没有办法测量强度或其他东西并减少我从 DBpedia 获得的额外 URI 并仅分配最高可能的 URI?

那里似乎有两个问题。

  1. 如何限制 DBpedia Spotlight 结果的数量。
  2. 如何限制特定结果的主题和类别数量。

我当前的代码如下。

from SPARQLWrapper import SPARQLWrapper, JSON
import requests
import urllib.parse

## initial consts
BASE_URL = 'http://api.dbpedia-spotlight.org/en/annotate?text={text}&confidence={confidence}&support={support}'
TEXT = 'First documented in the 13th century, Berlin was the capital of the Kingdom of Prussia (1701–1918), the German Empire (1871–1918), the Weimar Republic (1919–33) and the Third Reich (1933–45). Berlin in the 1920s was the third largest municipality in the world. After World War II, the city became divided into East Berlin -- the capital of East Germany -- and West Berlin, a West German exclave surrounded by the Berlin Wall from 1961–89. Following German reunification in 1990, the city regained its status as the capital of Germany, hosting 147 foreign embassies.'
CONFIDENCE = '0.5'
SUPPORT = '120'
REQUEST = BASE_URL.format(
    text=urllib.parse.quote_plus(TEXT), 
    confidence=CONFIDENCE, 
    support=SUPPORT
)
HEADERS = {'Accept': 'application/json'}
sparql = SPARQLWrapper("http://dbpedia.org/sparql")
all_urls = []

r = requests.get(url=REQUEST, headers=HEADERS)
response = r.json()
resources = response['Resources']

for res in resources:
    all_urls.append(res['@URI'])

for url in all_urls:
    sparql.setQuery("""
        SELECT * WHERE {<"""
             +url+
            """>skos:broader|dct:subject ?resource 
            }
    """)

    sparql.setReturnFormat(JSON)
    results = sparql.query().convert()

    for result in results["results"]["bindings"]:
        print('resource ---- ', result['resource']['value'])

如果需要,我很乐意提供更多示例。

4

1 回答 1

2

您似乎正在尝试检索与给定段落相关的维基百科类别。

小建议

首先,我建议您执行一个请求,将 DBpedia Spotlight 结果收集到VALUES,例如,以这种方式:

values = '(<{0}>)'.format('>) (<'.join(all_urls))

其次,如果您在谈论主题层次结构,您应该使用 SPARQL 1.1属性路径

这两个建议略有不相容。当查询同时包含多个起点(即VALUES)和任意长度的路径(即*+运算符)时,Virtuoso 效率非常低。

下面我使用dct:subject/skos:broader属性路径,即检索“下一级”类别。

方法一

第一种方法是按资源的普遍受欢迎程度排序资源,例如他们的PageRank

values = '(<{0}>)'.format('>) (<'.join(all_urls))

sparql.setQuery(
    """PREFIX vrank:<http://purl.org/voc/vrank#>
       SELECT DISTINCT ?resource ?rank
       FROM <http://dbpedia.org> 
       FROM <http://people.aifb.kit.edu/ath/#DBpedia_PageRank>
       WHERE {
           VALUES (?s) {""" + values + 
    """    }
       ?s dct:subject/skos:broader ?resource .
       ?resource vrank:hasRank/vrank:rankValue ?rank.
       } ORDER BY DESC(?rank)
         LIMIT 10
    """)

结果是:

dbc:Member_states_of_the_United_Nations
dbc:Country_subdivisions_of_Europe
dbc:Republics
dbc:Demography
dbc:Population
dbc:Countries_in_Europe
dbc:Third-level_administrative_country_subdivisions
dbc:International_law
dbc:Former_countries_in_Europe
dbc:History_of_the_Soviet_Union_and_Soviet_Russia

方法二

第二种方法是计算给定文本的类别频率......

values = '(<{0}>)'.format('>) (<'.join(all_urls))

sparql.setQuery(
    """SELECT ?resource count(?resource) AS ?count WHERE {
           VALUES (?s) {""" + values + 
    """    }
       ?s dct:subject ?resource
       } GROUP BY ?resource
         # https://github.com/openlink/virtuoso-opensource/issues/254
         HAVING (count(?resource) > 1)
         ORDER BY DESC(count(?resource))
         LIMIT 10
    """)

结果是:

dbc:Wars_by_country
dbc:Wars_involving_the_states_and_peoples_of_Europe
dbc:Wars_involving_the_states_and_peoples_of_Asia
dbc:Wars_involving_the_states_and_peoples_of_North_America
dbc:20th_century_in_Germany
dbc:Modern_history_of_Germany
dbc:Wars_involving_the_Balkans
dbc:Decades_in_Germany
dbc:Modern_Europe
dbc:Wars_involving_the_states_and_peoples_of_South_America

使用dct:subject而不是dct:subject/skos:broader,结果会更好:

dbc:Former_polities_of_the_Cold_War
dbc:Former_republics
dbc:States_and_territories_established_in_1949
dbc:20th_century_in_Germany_by_period
dbc:1930s_in_Germany
dbc:Modern_history_of_Germany
dbc:1990_disestablishments_in_West_Germany
dbc:1933_disestablishments_in_Germany
dbc:1949_establishments_in_West_Germany
dbc:1949_establishments_in_Germany

结论

结果不是很好。我看到两个原因:DBpedia 类别非常随机,工具非常原始。结合方法一和方法二,或许有可能取得更好的结果。无论如何,还是需要用一个大的语料库进行实验。

于 2018-04-16T20:57:36.737 回答