html - R中的HTML字符实体替换

Question

我有一大组 HTML 文件，其中包含来自节点中杂志的文本span。 我的 PDF 到 HTML 转换器在整个 HTML 中插入了字符实体。问题是在 R 中，我使用该xmlValue函数（在 XML 包中）来提取文本，但无论 单词之间有空格的地方都被消除了。例如：

<span class="ft6">kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>

将从xmlValue功能中出来：

"kids,and kids in your community,in DIYprojects."

我在想解决这个问题的最简单方法是 在运行span节点之前找到所有节点xmlValue，然后用" "（空格）替换它们。我将如何处理？

score 1 · Accepted Answer

我已经重写了答案，以反映原始海报无法从XMLValue. 可能有不同的方法来解决这个问题，但一种方法是直接打开/替换/写入 HTML 文件本身。通常使用正则表达式处理 XML/HTML 是一个坏主意，但在这种情况下，我们有一个直接的问题，即不需要的不间断空格，所以这可能不是太大的问题。以下代码是如何创建匹配文件列表gsub并对内容执行 a 的示例。它应该很容易根据需要进行修改或扩展。

setwd("c:/test/")
# Create 'html' file to use with test
txt <- "<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>
<span class=ft6>kids,&nbsp;and kids in your community,&nbsp;in                                   DIY&nbsp;projects.&nbsp;</span>"
writeLines(txt, "file1.html")

# Now read files - in this case only one
html.files <- list.files(pattern = ".html")
html.files

# Loop through the list of files
retval <- lapply(html.files, function(x) {
          in.lines <- readLines(x, n = -1)
          # Replace non-breaking space with space
          out.lines <- gsub("&nbsp;"," ", in.lines)
          # Write out the corrected lines to a new file
          writeLines(out.lines, paste("new_", x, sep = ""))
})

html - R中的HTML字符实体替换

1 回答 1

Related

Reference