html - R XML 包在解析 xml 和 html 文件时出现奇怪的错误

Question

我正在使用 R 的 XML 包从各种 html 和 xml 文件中提取所有可能的数据。这些文件基本上是文档或构建属性或自述文件。

<?xml version='1.0' encoding='utf-8'?>
<!DOCTYPE chapter PUBLIC '-//OASIS//DTD DocBook XML V4.1.2//EN'
                      'http://www.oasis-open.org/docbook/xml/4.0 docbookx.dtd'>

<chapter lang="en">
<chapterinfo>
<author>
<firstname>Jirka</firstname>
<surname>Kosek</surname>
</author>
<copyright>
<year>2001</year>
<holder>Ji&rcaron;&iacute; Kosek</holder>
</copyright>
<releaseinfo>$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp $</releaseinfo>
</chapterinfo>
<title>Using XSL stylesheets to generate HTML Help</title>
<?dbhtml filename="htmlhelp.html"?>

<para>HTML Help (HH) is help-format used in newer versions of MS
Windows and applications written for this platform. This format allows
to pack several HTML files together with images, table of contents and
index into single file. Windows contains browser for this file-format
and full-text search is also supported on HH files. If you want know
more about HH and its capabilities look at <ulink
url="http://msdn.microsoft.com/library/tools/htmlhelp/chm/HH1Start.htm">HTML
Help pages</ulink>.</para>

<section>
<title>How to generate first HTML Help file from DocBook sources</title>

<para>Working with HH stylesheets is same as with other XSL DocBook
stylesheets. Simply run your favorite XSLT processor on your document
with stylesheet suited for HH:</para>

</section>

</chapter>

我的目标是在使用 htmlTreeParse 或 xmlTreeParse 解析树后使用类似这样的东西（对于 xml 文件..）

Text = xmlValue(xmlRoot(xmlTreeParse(XMLFileName)))

但是，当我对 xml 和 html 文件执行此操作时，会出现一个错误。如果有 2 级或更高级别的子节点，则文本字段将被粘贴，它们之间没有任何空格。

例如，在上面的例子中

xmlValue(chapterInfo) 是

JirkaKosek2001JiKosek$Id: htmlhelp.xml,v 1.1 2002/05/15 17:22:31 isberg Exp

每个子节点（递归）的 xmlValues 粘贴在一起，它们之间没有添加空格。如何让 xmlValue 在提取此数据时添加空格

非常感谢您提前提供的帮助，

希瓦尼

score 3 · Accepted Answer

根据文档，xmlValue仅适用于单个文本节点，或“包含单个文本节点的 XML 节点”。非文本节点中的空格显然没有保留。

但是，即使在单个文本节点的情况下，您的代码也会去除空格。

library(XML)
doc <- xmlTreeParse("<a> </a>")
xmlValue(xmlRoot(doc))
# [1] ""

您可以将ignoreBlanks=FALSE和useInternalNodes=TRUE 参数添加到xmlTreeParse, 以保留所有空格。

doc <- xmlTreeParse(
  "<a> </a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] " "

# Spaces inside text nodes are preserved
doc <- xmlTreeParse(
  "<a>foo <b>bar</b></a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foo bar"

# Spaces between text nodes (inside non-text nodes) are not preserved
doc <- xmlTreeParse(
  "<a><b>foo</b> <b>bar</b></a>", 
  ignoreBlanks = FALSE, 
  useInternalNodes = TRUE
)
xmlValue(xmlRoot(doc))
# [1] "foobar"

html - R XML 包在解析 xml 和 html 文件时出现奇怪的错误

1 回答 1

Related

Reference