我现在正在尝试在本地查询知识图。知识图是CORD-19 Annotated By SemRep,是tar.gz
下载后的文件。我想搜索知识图中是否存在实体或关系。
预期结果
就像是:
{ "head": { "link": [], "vars": ["entity_text", "preferred_name", "semantic_type", "semantic_type_name", "count"] },
"results": { "distinct": false, "ordered": true, "bindings": [
{ "entity_text": { "type": "literal", "value": "COVID-19" } , "preferred_name": { "type": "literal", "value": "COVID-19" } , "semantic_type": { "type": "literal", "value": "dsyn" } , "semantic_type_name": { "type": "literal", "value": "Disease or Syndrome" } , "count": { "type": "typed-literal", "datatype": "http://www.w3.org/2001/XMLSchema#integer", "value": "602476" }} ] } }
SPARQL
###########
# CSQ4: Top Entities of CORD-19 Dataset Annotated by SemRep
# Author: Chuming Chen (chenc@udel.edu)
###########
PREFIX cord19: <https://semrep.nlm.nih.gov/covid19/cord19#>
PREFIX semrep: <https://semrep.nlm.nih.gov/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?entity_text ?preferred_name ?semantic_type ?semantic_type_name (COUNT(concat(?entity_text,?preferred_name, ?semantic_type, ?semantic_type_name )) AS ?count)
FROM <https://semrep.nlm.nih.gov/covid19/cord19>
WHERE
{
[
cord19:_id ?doc_id ;
cord19:annotations
[
cord19:entity
[
a semrep:Entity ;
cord19:semantic_type ?Semantic_type ;
cord19:entity_text ?entity_text ;
cord19:preferred_name ?preferred_name ;
]
]
] .
?Semantic_type a semrep:SemanticType .
?Semantic_type rdfs:label ?semantic_type.
?Semantic_type rdfs:comment ?semantic_type_name.
}
GROUP BY ?entity_text ?semantic_type ?semantic_type_name ?preferred_name
ORDER BY DESC(?count)
LIMIT 50
我想使用 API 实现的目标
现在我使用给定的 api 和 SPARQL 来搜索以 kg 为单位的实体。我可以得到结果,但它太慢了。这就是我想在本地尝试的原因。
import requests
headers = {
'Accept': 'application/sparql-results+json',
}
query = '' # use the same query
data = {'query': query}
response = requests.post('https://sparql.proconsortium.org/virtuoso4/sparql', headers=headers, data=data)
我试过的
我已经提取了压缩文件。文件夹里面有很多ttl
文件。文件从这里下载,第 7 行。
import tarfile
file = tarfile.open('cord19.rdf.tar.gz', 'r:gz')
file.extractall('foldername')
file.close()
我试图通过解析ttl
文件rdflib
from rdflib import Graph
from glob import glob
filenames = glob("folder/*.ttl") # len(filenames) = 593241
g = Graph()
for filename in filenames[:2]:
g.parse(filename, format = 'ttl')
我只尝试先解析两个文件。我使用的查询与前一个查询几乎相同,但只是删除了该FROM
行。
###########
# CSQ4: Top Entities of CORD-19 Dataset Annotated by SemRep
# Author: Chuming Chen (chenc@udel.edu)
###########
PREFIX cord19: <https://semrep.nlm.nih.gov/covid19/cord19#>
PREFIX semrep: <https://semrep.nlm.nih.gov/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT DISTINCT ?entity_text ?preferred_name ?semantic_type ?semantic_type_name (COUNT(concat(?entity_text,?preferred_name, ?semantic_type, ?semantic_type_name )) AS ?count)
#FROM <https://semrep.nlm.nih.gov/covid19/cord19>
WHERE
{
[
cord19:_id ?doc_id ;
cord19:annotations
[
cord19:entity
[
a semrep:Entity ;
cord19:semantic_type ?Semantic_type ;
cord19:entity_text ?entity_text ;
cord19:preferred_name ?preferred_name ;
]
]
] .
?Semantic_type a semrep:SemanticType .
?Semantic_type rdfs:label ?semantic_type.
?Semantic_type rdfs:comment ?semantic_type_name.
FILTER regex(?entity_text, "covid-19", "i").
}
GROUP BY ?entity_text ?semantic_type ?semantic_type_name ?preferred_name
ORDER BY DESC(?count)
LIMIT 50
results = g.query(query)
中没有结果results.bindings
。
更多尝试
我试图解析其他链接PREFIX
from rdflib import Graph
from glob import glob
filenames = glob("folder/*.ttl")
g = Graph()
for filename in filenames[:2]:
g.parse(filename, format = 'ttl')
g.parse("http://www.w3.org/2000/01/rdf-schema#")
g.query(query) # empty result
from rdflib import Graph
from glob import glob
filenames = glob("folder/*.ttl")
g = Graph()
for filename in filenames[:2]:
g.parse(filename, format = 'ttl')
g.parse("http://www.w3.org/2000/01/rdf-schema#")
g.parse("https://semrep.nlm.nih.gov/") # No plugin registered for (text/html, <class 'rdflib.parser.Parser'>
再次重复我的问题
到目前为止,我可以解析多个文件Graph
或ConjunctiveGraph
函数,但未能得到结果。如何成功获得可以具有非空结果(通过查询)的本地图?
同时,如果解析所有的ttl文件是正确的做法,文件名的长度超级大(593241),而且解析时间很长。如何解决?(仅当需要解析所有 ttl 文件时,这才是扩展问题)