0

我现在正在尝试在本地查询知识图。知识图是CORD-19 Annotated By SemRep,是tar.gz下载后的文件。我想搜索知识图中是否存在实体或关系。

预期结果

就像是:

{ "head": { "link": [], "vars": ["entity_text", "preferred_name", "semantic_type", "semantic_type_name", "count"] },
  "results": { "distinct": false, "ordered": true, "bindings": [
    { "entity_text": { "type": "literal", "value": "COVID-19" } , "preferred_name": { "type": "literal", "value": "COVID-19" }  , "semantic_type": { "type": "literal", "value": "dsyn" }   , "semantic_type_name": { "type": "literal", "value": "Disease or Syndrome" }   , "count": { "type": "typed-literal", "datatype": "http://www.w3.org/2001/XMLSchema#integer", "value": "602476" }} ] } }

SPARQL

###########
# CSQ4: Top Entities of CORD-19 Dataset Annotated by SemRep
# Author: Chuming Chen (chenc@udel.edu)
###########
PREFIX cord19: <https://semrep.nlm.nih.gov/covid19/cord19#>
PREFIX semrep: <https://semrep.nlm.nih.gov/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT  ?entity_text ?preferred_name ?semantic_type ?semantic_type_name  (COUNT(concat(?entity_text,?preferred_name, ?semantic_type, ?semantic_type_name )) AS ?count)
FROM <https://semrep.nlm.nih.gov/covid19/cord19>
WHERE
{
    [
        cord19:_id ?doc_id ;
        cord19:annotations
        [
             cord19:entity
            [
                a semrep:Entity ;
                cord19:semantic_type ?Semantic_type ;
                cord19:entity_text  ?entity_text ;
                cord19:preferred_name  ?preferred_name ;
            ]
        ]
    ] .
    ?Semantic_type a semrep:SemanticType .
    ?Semantic_type rdfs:label ?semantic_type.
    ?Semantic_type rdfs:comment ?semantic_type_name.  
}
GROUP BY ?entity_text ?semantic_type  ?semantic_type_name ?preferred_name
ORDER BY DESC(?count)
LIMIT 50

我想使用 API 实现的目标

现在我使用给定的 api 和 SPARQL 来搜索以 kg 为单位的实体。我可以得到结果,但它太慢了。这就是我想在本地尝试的原因。

import requests
headers = {
            'Accept': 'application/sparql-results+json',
        }       
        
query = '' # use the same query
data = {'query': query}
        
response = requests.post('https://sparql.proconsortium.org/virtuoso4/sparql', headers=headers, data=data)

我试过的

我已经提取了压缩文件。文件夹里面有很多ttl文件。文件从这里下载,第 7 行。

import tarfile
file = tarfile.open('cord19.rdf.tar.gz', 'r:gz')
file.extractall('foldername')
file.close()

我试图通过解析ttl文件rdflib

from rdflib import Graph
from glob import glob
filenames = glob("folder/*.ttl") # len(filenames) = 593241
g = Graph()
for filename in filenames[:2]:
  g.parse(filename, format = 'ttl')

我只尝试先解析两个文件。我使用的查询与前一个查询几乎相同,但只是删除了该FROM行。

###########
# CSQ4: Top Entities of CORD-19 Dataset Annotated by SemRep
# Author: Chuming Chen (chenc@udel.edu)
###########
PREFIX cord19: <https://semrep.nlm.nih.gov/covid19/cord19#>
PREFIX semrep: <https://semrep.nlm.nih.gov/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

SELECT DISTINCT  ?entity_text ?preferred_name ?semantic_type ?semantic_type_name  (COUNT(concat(?entity_text,?preferred_name, ?semantic_type, ?semantic_type_name )) AS ?count)
#FROM <https://semrep.nlm.nih.gov/covid19/cord19>
WHERE
{
    [
        cord19:_id ?doc_id ;
        cord19:annotations
        [
             cord19:entity
            [
                a semrep:Entity ;
                cord19:semantic_type ?Semantic_type ;
                cord19:entity_text  ?entity_text ;
                cord19:preferred_name  ?preferred_name ;
            ]
        ]
    ] .
    ?Semantic_type a semrep:SemanticType .
    ?Semantic_type rdfs:label ?semantic_type.
    ?Semantic_type rdfs:comment ?semantic_type_name.
  FILTER regex(?entity_text, "covid-19", "i").
}
GROUP BY ?entity_text ?semantic_type  ?semantic_type_name ?preferred_name
ORDER BY DESC(?count)
LIMIT 50
results = g.query(query)

中没有结果results.bindings

更多尝试

我试图解析其他链接PREFIX

from rdflib import Graph
from glob import glob
filenames = glob("folder/*.ttl")
g = Graph()
for filename in filenames[:2]:
  g.parse(filename, format = 'ttl')
g.parse("http://www.w3.org/2000/01/rdf-schema#")
g.query(query) # empty result
from rdflib import Graph
from glob import glob
filenames = glob("folder/*.ttl")
g = Graph()
for filename in filenames[:2]:
  g.parse(filename, format = 'ttl')
g.parse("http://www.w3.org/2000/01/rdf-schema#")
g.parse("https://semrep.nlm.nih.gov/") # No plugin registered for (text/html, <class 'rdflib.parser.Parser'>

再次重复我的问题

到目前为止,我可以解析多个文件GraphConjunctiveGraph函数,但未能得到结果。如何成功获得可以具有非空结果(通过查询)的本地图?

同时,如果解析所有的ttl文件是正确的做法,文件名的长度超级大(593241),而且解析时间很长。如何解决?(仅当需要解析所有 ttl 文件时,这才是扩展问题)

4

0 回答 0