java - 使用 Jena 库从 Java 中的 RDF 网页中提取 URI

Question

application/rdf-xml我编写了以下代码，用于从内容类型为链接数据应用程序的网页中提取 URI 。

public static void test(String url) {
    try {
        Model read = ModelFactory.createDefaultModel().read(url);
        System.out.println("to go");
        StmtIterator si;
        si = read.listStatements();
        System.out.println("to go");
        while(si.hasNext()) {
            Statement s=si.nextStatement();
            Resource r=s.getSubject();
            Property p=s.getPredicate();
            RDFNode o=s.getObject();
            System.out.println(r.getURI());
            System.out.println(p.getURI());
            System.out.println(o.asResource().getURI());
        }
    }
    catch(JenaException | NoSuchElementException c) {}
}

但是对于输入

<?xml version="1.0"?>
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:ex="http://example.org/stuff/1.0/">
    <rdf:Description rdf:about="http://www.w3.org/TR/rdf-syntax-grammar"
        dc:title="RDF/XML Syntax Specification (Revised)">
        <ex:editor>
            <rdf:Description ex:fullName="Dave Beckett">
                <ex:homePage rdf:resource="http://purl.org/net/dajobe/" />
            </rdf:Description>
        </ex:editor>
    </rdf:Description>
</rdf:RDF>

输出是：

Subject URI is http://www.w3.org/TR/rdf-syntax-grammar
Predicate  URI is http://example.org/stuff/1.0/editor
Object URI is null
Subject URI is http://www.w3.org/TR/rdf-syntax-grammar
Predicate  URI is http://purl.org/dc/elements/1.1/title
Website is read

我需要在输出中包含该页面上存在的所有 URI，以便为 RDF 页面构建网络爬虫。我需要输出中的所有以下链接：

       http://www.w3.org/TR/rdf-syntax-grammar
       http://example.org/stuff/1.0/editor
       http://purl.org/net/dajobe
       http://example.org/stuff/1.0/fullName
       http://www.w3.org/TR/rdf-syntax-grammar
       http://purl.org/dc/elements/1.1/title

score 2 · Accepted Answer

小错误：你的意思是application/rdf+xml（注意加号）。

无论如何，你的问题很简单：

catch(JenaException | NoSuchElementException c) {}

坏的！您错过了此处引发的错误，并且输出被截断：

System.out.println(o.asResource().getURI());

o 并不总是一种资源，这将打破三重奏

<http://www.w3.org/TR/rdf-syntax-grammar> dc:title "RDF/XML Syntax ..."

所以你需要提防：

if (o.isResource()) System.out.println(o.asResource().getURI());

甚至更具体：

if (o.isURIResource()) System.out.println(o.asResource().getURI());

这将跳过null您看到的输出ex:editor。

现在写一千遍我不会吞下异常:-)

score 1 · Accepted Answer

不，您不了解 RDF 的用途。爬虫是一种旨在检索在线内容并将其编入索引的程序。一个简单的爬虫可以输入一个 HTML 文档，它会下载（可能是递归的）元素href属性中提到的所有文档。<a>

RDF 中充满了 URL，因此您可能认为提供爬虫是完美的，但不幸的是，RDF 文档中的 URL 并非旨在检索其他文档。例子：

http://example.org/stuff/1.0/editor 404 未找到
http://purl.org/net/dajobe 302 临时移动
http://example.org/stuff/1.0/fullName 404 未找到
http://www.w3.org/TR/rdf-syntax-grammar 301 永久移动
http://purl.org/dc/elements/1.1/title 302 暂时移动

会不会是巧合？我不这么认为。事实上，RDF 旨在描述现实世界，并且碰巧它可以以 XML 形式序列化，但 XML 并不是唯一可用的序列化。

那么，文档中的 URL 是做什么用的呢？它们用来命名事物。你认识几个约翰？可能有几十个，但仍然存在数千个 John... 但是，如果我拥有该域，example.com我可以使用 URLhttp://example.com/friends/John来引用我的朋友 John。RDF可以用来描述你的朋友John在Abc Avenue 123工作，通过两个URL和一个字符串

"http://me.com/John"   "http://me.com/works_at"   "123, Abc avenue"

这被称为三元组，其中包含的 URL 并不意味着可以通过 TCP 套接字和理解 HTTP 协议的客户端检索某些内容。请注意，您的朋友 (John)和谓词(works at) 都是通过 URL 在三元组中引用的。但是，如果您在浏览器中尝试这些 URL，您将一无所获。

我不知道您为什么要构建您的爬虫以及它应该做什么，但 RDF 肯定不是您完成工作所需要的。

java - 使用 Jena 库从 Java 中的 RDF 网页中提取 URI

2 回答 2

Related

Reference