1

我在 Windows 10 上使用 Java 11 (AdoptOpenJDK 11.0.5 2019-10-15)。我有一些想要处理的旧 XHTML 1.1 文件。它们采用以下一般形式:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">
<head>
  <title>XHTML 1.1 Skeleton</title>
</head>
<body>
</body>
</html>

为了避免解析器等待连接到 Internet,我安装了一个自定义程序来加载存储在程序资源中的EntityResolver已知实体(从它们的公共 ID,例如)。-//W3C//ELEMENTS XHTML Inline Style 1.0//EN此类DefaultEntityResolver还打印调试消息,指示解析器正在加载哪些实体。

这是我解析的基本形式:

DocumentBuilderFactory documentBuilderFactory = DocumentBuilderFactory.newInstance();
documentBuilderFactory.setNamespaceAware(true);
DocumentBuilder documentBuilder = documentBuilderFactory.newDocumentBuilder();
documentBuilder.setEntityResolver(DefaultEntityResolver.getInstance());
final Document document;
try (InputStream inputStream = new BufferedInputStream(getClass().getResourceAsStream("xhtml-1.1-test.xhtml"))) {
  document = documentBuilder.parse(inputStream);
}

由于 中的调试消息DefaultEntityResolver,我可以看到解析器按此顺序加载了以下实体。

  • -//W3C//DTD XHTML 1.1//EN( http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd)
  • -//W3C//ELEMENTS XHTML Inline Style 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod)
  • -//W3C//ENTITIES XHTML Datatypes 1.0//EN( http://www.w3.org/TR/xhtml11/DTD/xhtml-datatypes-1.mod)
  • -//W3C//ENTITIES XHTML Modular Framework 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod)
  • -//W3C//ENTITIES XHTML Datatypes 1.0//EN( http://www.w3.org/TR/xhtml11/DTD/xhtml-datatypes-1.mod)
  • -//W3C//ENTITIES XHTML Qualified Names 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod)
  • -//W3C//ENTITIES XHTML Intrinsic Events 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod)
  • -//W3C//ENTITIES XHTML Common Attributes 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod)
  • -//W3C//ENTITIES XHTML 1.1 Document Model 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod)
  • -//W3C//ENTITIES XHTML Character Entities 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod)
  • -//W3C//ENTITIES Latin 1 for XHTML//EN( http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent)
  • -//W3C//ENTITIES Symbols for XHTML//EN( http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent)
  • -//W3C//ENTITIES Special for XHTML//EN( http://www.w3.org/MarkUp/DTD/xhtml-special.ent)
  • -//W3C//ELEMENTS XHTML Text 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod)
  • -//W3C//ELEMENTS XHTML Inline Structural 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod)
  • -//W3C//ELEMENTS XHTML Inline Phrasal 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod)
  • -//W3C//ELEMENTS XHTML Block Structural 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod)
  • -//W3C//ELEMENTS XHTML Block Phrasal 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod)
  • -//W3C//ELEMENTS XHTML Hypertext 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-hypertext-1.mod)
  • -//W3C//ELEMENTS XHTML Lists 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-list-1.mod)
  • -//W3C//ELEMENTS XHTML Editing Elements 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-edit-1.mod)
  • -//W3C//ELEMENTS XHTML BIDI Override Element 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-bdo-1.mod)
  • -//W3C//ELEMENTS XHTML Ruby 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-ruby-1.mod)
  • -//W3C//ELEMENTS XHTML Presentation 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-pres-1.mod)
  • -//W3C//ELEMENTS XHTML Inline Presentation 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-inlpres-1.mod)
  • -//W3C//ELEMENTS XHTML Block Presentation 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-blkpres-1.mod)
  • -//W3C//ELEMENTS XHTML Link Element 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-link-1.mod)
  • -//W3C//ELEMENTS XHTML Metainformation 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-meta-1.mod)
  • -//W3C//ELEMENTS XHTML Base Element 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-base-1.mod)
  • -//W3C//ELEMENTS XHTML Scripting 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-script-1.mod)
  • -//W3C//ELEMENTS XHTML Style Sheets 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-style-1.mod)
  • -//W3C//ELEMENTS XHTML Images 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-image-1.mod)
  • -//W3C//ELEMENTS XHTML Client-side Image Maps 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-csismap-1.mod)
  • -//W3C//ELEMENTS XHTML Server-side Image Maps 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-ssismap-1.mod)
  • -//W3C//ELEMENTS XHTML Param Element 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-param-1.mod)
  • -//W3C//ELEMENTS XHTML Embedded Object 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-object-1.mod)
  • -//W3C//ELEMENTS XHTML Tables 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-table-1.mod)
  • -//W3C//ELEMENTS XHTML Forms 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-form-1.mod)
  • -//W3C//ELEMENTS XHTML Document Structure 1.0//EN( http://www.w3.org/MarkUp/DTD/xhtml-struct-1.mod)

请注意,其中一些实体不再存在于指定的 URL;尽管如此,我DefaultEntityResolver已经将这些实体存储并键入了它们的公共 ID,因此仍将它们提供给解析器。

到目前为止,一切都很好。但是当我立即调用时document.normalizeDocument(),程序会暂停然后打印:

[Error] xhtml11.dtd:129:43: The entity "LanguageCode.datatype" was referenced, but not declared.
[Error] xhtml11.dtd:130:44: The entity "LanguageCode.datatype" was referenced, but not declared.
[Error] xhtml11.dtd:194:47: The entity "Common.attrib" was referenced, but not declared.

请注意,这不是打印这些错误的程序;显然是里面的东西document.normalizeDocument()。此外,这里还有另外两个好奇心:

  • 如果我从 Eclipse 中运行我的应用程序,则不会发生这种情况。
  • 如果我禁用我的网络连接,这不会发生。

我最好的猜测是document.normalizeDocument()没有使用EntityResolver我在文档生成器中安装的自定义。因为某些实体不再存在于其预期的 URL(例如http://www.w3.org/TR/xhtml11/DTD/xhtml-datatypes-1.mod),它们无法加载,因此指示的引用实体永远不会被定义。但是,Web 服务器需要很长时间才能响应实体丢失(因为您可以手动测试),这使得程序似乎暂停了。这也可以解释为什么当我的网络连接被禁用时错误消息没有出现;我猜无法加载任何外部实体,立即失败,但这不被视为错误。(不过,这些都不能解释为什么它在 Eclipse 中没有暂停或错误消息。)

事实上,DOMConfiguration文档提示我需要设置某种resource-resolver参数,尽管我不确定为什么DOMConfiguration不默认使用我在用于解析 XML 文档的原始文档构建器中设置的实体解析器。

为了让事情有点奇怪,我将上面的 XHTML 1.1 框架文档放在我的资源中,并创建了一个与上面的代码完全相同的单元测试,然后是document.normalizeDocument(),测试通过,没有停顿也没有错误,即使是从命令行

但是,如果我for(int i = 0; i < 100; i++)在单元测试中放置一个循环;加载、解析和规范化文档 100 次(但使用相同的DocumentBuilderFactory);我的单元测试完全崩溃了分叉的单元测试JVM!

org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.22.1:test (default-test) on project [...]: There are test failures.

Please refer to [...]\xml\target\surefire-reports for the individual test results.
Please refer to dump files (if any exist) [date].dump, [date]-jvmRun[N].dump and [date].dumpstream.
The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was cmd.exe /X /C [...]
Process Exit Code: 0
Crashed tests:
[...].XmlDomTest
org.apache.maven.surefire.booter.SurefireBooterForkException: The forked VM terminated without properly saying goodbye. VM crash or System.exit called?
Command was cmd.exe /X /C [...]
Process Exit Code: 0
Crashed tests:
com.globalmentor.xml.XmlDomTest
        at org.apache.maven.plugin.surefire.booterclient.ForkStarter.fork(ForkStarter.java:669)
        at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:282)
        at org.apache.maven.plugin.surefire.booterclient.ForkStarter.run(ForkStarter.java:245)
        at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeProvider(AbstractSurefireMojo.java:1183)
        at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:1011)
        at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:857)
        at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:137)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:210)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:156)
        at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:148)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:117)
        at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:81)
        at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:56)
        at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:305)
        at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:192)
        at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:105)
        at org.apache.maven.cli.MavenCli.execute(MavenCli.java:957)
        at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:289)
        at org.apache.maven.cli.MavenCli.main(MavenCli.java:193)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:282)
        at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:225)
        at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:406)
        at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:347)

    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:215)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:156)
    at org.apache.maven.lifecycle.internal.MojoExecutor.execute (MojoExecutor.java:148)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:117)
    at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject (LifecycleModuleBuilder.java:81)
    at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build (SingleThreadedBuilder.java:56)
    at org.apache.maven.lifecycle.internal.LifecycleStarter.execute (LifecycleStarter.java:128)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:305)
    at org.apache.maven.DefaultMaven.doExecute (DefaultMaven.java:192)
    at org.apache.maven.DefaultMaven.execute (DefaultMaven.java:105)
    at org.apache.maven.cli.MavenCli.execute (MavenCli.java:957)
    at org.apache.maven.cli.MavenCli.doMain (MavenCli.java:289)
    at org.apache.maven.cli.MavenCli.main (MavenCli.java:193)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0 (Native Method)
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke (NativeMethodAccessorImpl.java:62)
    at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke (DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke (Method.java:566)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced (Launcher.java:282)
    at org.codehaus.plexus.classworlds.launcher.Launcher.launch (Launcher.java:225)
    at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode (Launcher.java:406)
    at org.codehaus.plexus.classworlds.launcher.Launcher.main (Launcher.java:347)
Caused by: org.apache.maven.plugin.MojoExecutionException: There are test failures.

所以我想我想避免document.normalizeDocument(),但我欢迎对这种行为进行任何澄清。

4

1 回答 1

0

不是真正的答案,但您可能会发现有用的信息:Saxon 具有相关 DTD 文件的内置副本,并使用自己的 EntityResolver,所以我想我会尝试一下。它解析文档如下:

Using parser org.apache.xerces.jaxp.SAXParserImpl$JAXPSAXParser
Building tree for file:/Users/mike/Desktop/temp/test.xhtml using class net.sf.saxon.tree.tiny.TinyBuilder
Fetching Saxon copy of w3c/xhtml11/xhtml11.dtd
Fetching Saxon copy of w3c/xhtml11/xhtml-inlstyle-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-framework-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-datatypes-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-qname-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-events-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-attribs-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml11-model-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-charent-1.mod
Fetching Saxon copy of w3c/xhtml-lat1.ent
Fetching Saxon copy of w3c/xhtml-symbol.ent
Fetching Saxon copy of w3c/xhtml-special.ent
Fetching Saxon copy of w3c/xhtml11/xhtml-text-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-inlstruct-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-inlphras-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-blkstruct-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-blkphras-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-hypertext-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-list-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-edit-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-bdo-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-ruby-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-pres-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-inlpres-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-blkpres-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-link-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-meta-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-base-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-script-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-style-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-image-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-csismap-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-ssismap-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-param-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-object-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-table-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-form-1.mod
Fetching Saxon copy of w3c/xhtml11/xhtml-struct-1.mod
Tree built in 88.44306ms

我没有尝试使用该 EntityResolver 构建 DOM,但原则上它肯定是可能的。而且我还没有将这个实体列表与你报告的列表进行比较。

更多信息:搜索 Saxon 具有本地副本的 DTD 实体,我发现LanguageCode.datatype在许多地方声明的实体,包括xhtml-math11-f.dtd, xhtml-math11-f-a.dtd, svg-datatypes.mod, svg11-flat.dtd, xhtml-datatypes-1.mod(在您的列表中)和其他几个地方。

存在于撒克逊的实体名单是在几年的时间里积累起来的,涉及大量的试验和错误。W3C 没有一个明确的列表。W3C 集合中也有很多不一致的地方,例如没有公共 ID 的模块,有多个公共 ID 或系统 ID 的模块等,具有相同公共 ID 的多个模块等。Saxon 列表已经稳定了几年所以它现在希望可以使用,但没有真正的方法知道。

于 2020-03-09T08:15:11.593 回答