使用 Apache Jena Fuseki 我正在尝试从Wikidata加载 latest-truthy.nt 数据集,但在尝试导入文件时出现以下错误。受到来自Bitplan的以下成功的启发,他们确实取得了成功。
错误日志:
14:36:16 INFO loader :: Add: 198.500.000 latest-truthy.nt (Batch: 453.309 / Avg: 213.382)
14:36:17 ERROR riot :: [line: 198884173, col: 87] Bad IRI: <https://abertillerymuseum@btconnect.com> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
org.apache.jena.riot.RiotException: [line: 198884173, col: 87] Bad IRI: <https://abertillerymuseum@btconnect.com> Code: 58/PROHIBITED_COMPONENT_PRESENT in USER: A component that is prohibited by the scheme is present.
at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:146)
at org.apache.jena.riot.system.ParserProfileStd.internalMakeIRI(ParserProfileStd.java:112)
at org.apache.jena.riot.system.ParserProfileStd.resolveIRI(ParserProfileStd.java:85)
at org.apache.jena.riot.system.ParserProfileStd.createURI(ParserProfileStd.java:187)
at org.apache.jena.riot.system.ParserProfileStd.create(ParserProfileStd.java:259)
at org.apache.jena.riot.lang.LangNTriples.tokenAsNode(LangNTriples.java:70)
at org.apache.jena.riot.lang.LangNTuple.parseTriple(LangNTuple.java:109)
at org.apache.jena.riot.lang.LangNTriples.parseOne(LangNTriples.java:61)
at org.apache.jena.riot.lang.LangNTriples.runParser(LangNTriples.java:53)
at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:184)
at org.apache.jena.riot.RDFParser.read(RDFParser.java:357)
at org.apache.jena.riot.RDFParser.parseURI(RDFParser.java:323)
at org.apache.jena.riot.RDFParser.parse(RDFParser.java:298)
at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:550)
at org.apache.jena.tdb2.loader.base.LoaderOps.inputFile(LoaderOps.java:107)
at org.apache.jena.tdb2.loader.base.LoaderBase.loadOne(LoaderBase.java:125)
at org.apache.jena.tdb2.loader.base.LoaderBase.lambda$load$0(LoaderBase.java:102)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
at org.apache.jena.tdb2.loader.base.LoaderBase.load(LoaderBase.java:99)
at tdb2.tdbloader.lambda$execBulkLoad$4(tdbloader.java:196)
at org.apache.jena.atlas.lib.Timer.time(Timer.java:85)
at tdb2.tdbloader.execBulkLoad(tdbloader.java:194)
at tdb2.tdbloader.loadQuads(tdbloader.java:175)
at tdb2.tdbloader.exec(tdbloader.java:136)
at org.apache.jena.cmd.CmdMain.mainMethod(CmdMain.java:92)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at org.apache.jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at tdb2.tdbloader.main(tdbloader.java:64)
导入脚本:
@ECHO off
cd apache-jena-4.0.0
echo start import on %DATE% %TIME%
tdb2_tdbloader --loader=parallel --loc "C:\fuseki\data" "F:\latest-truthy.nt" > tdb2-out.log 2> tdb2-err.log
echo finish import on %DATE% %TIME%
pause
文件结构:
- C:/fuseki/
-- apache-jena-4.0.0/
-- apache-jena-fuseki-4.0.0/
-- data/
-- startfusekidb.bat
-- wikidata2fuseki.bat
- F:/
-- latest-truthy.nt
这是Fuseki的问题吗?我无法自己打开 .nt 文件来解决问题。有没有我可以使用的标志,所以它跳过了使用 tdbloader 对给定导入的验证?
我也在 Wikidata 的 IRC 频道中询问这个问题,看看他们是否可以帮助我。
更新:我得到了 IRC 某人的回答,他们告诉我数据集中存在很多错误 Wikidata中的错误所以我知道需要找到一种方法来跳过与错误相关的行并继续加载。但是Fuseki TDB2 命令没有显示任何帮助。
还尝试 --help 输出以下内容,从而表明不存在跳过?
c:\fuseki\apache-jena-4.0.0\bin>tdb2_tdbloader -h
tdbloader--loader= [--desc DATASET | --loc DIR] FILE ...
Location
--loc=DIR Location (a directory)
--tdb= Assembler description file
--graph=IRI Act on a named graph
--loader= Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or 'light'
--syntax=LANG Syntax of data from stdin
Symbol definition
--set Set a configuration symbol to a value
--mem=FILE Execute on an in-memory TDB database (for testing)
--desc= Assembler description file
General
-v --verbose Verbose
-q --quiet Run with minimal output
--debug Output information for debugging
--help
--version Version information
--strict Operate in strict SPARQL mode (no extensions of any kind)