我已经重写了答案,以反映原始海报无法从XMLValue
. 可能有不同的方法来解决这个问题,但一种方法是直接打开/替换/写入 HTML 文件本身。通常使用正则表达式处理 XML/HTML 是一个坏主意,但在这种情况下,我们有一个直接的问题,即不需要的不间断空格,所以这可能不是太大的问题。以下代码是如何创建匹配文件列表gsub
并对内容执行 a 的示例。它应该很容易根据需要进行修改或扩展。
setwd("c:/test/")
# Create 'html' file to use with test
txt <- "<span class=ft6>kids, and kids in your community, in DIY projects. </span>
<span class=ft6>kids, and kids in your community, in DIY projects. </span>
<span class=ft6>kids, and kids in your community, in DIY projects. </span>"
writeLines(txt, "file1.html")
# Now read files - in this case only one
html.files <- list.files(pattern = ".html")
html.files
# Loop through the list of files
retval <- lapply(html.files, function(x) {
in.lines <- readLines(x, n = -1)
# Replace non-breaking space with space
out.lines <- gsub(" "," ", in.lines)
# Write out the corrected lines to a new file
writeLines(out.lines, paste("new_", x, sep = ""))
})