regex - 从 html 文件中提取数据（R 和正则表达式）

Question

我想从 R 中的 HTML 文件中提取数据。我有一个具有这种结构的大文件：

a <-  "</span>Cabildo \t456\t386\t70\t21\t4\t101\t36\t12\t88\t48\t84\t62\t-</p></td></tr><tr><td colspan=\"14\" bgcolor=\"#CCDDE7\"><p class=\"s3\" style=\"padding-top: 1pt;padding-left: 5pt;text-indent: 0pt;text-align: left;\"><span style=\" color: black; font-style: normal; font-weight: normal;\"></span>Sierra Gorda\t106 \t89 \t17 \t-\t-\t26 \t9 \t8 \t15 \t10 \t18 \t20 \t-</p>"

这里是一个文件示例： http: //dl.getdropbox.com/u/18116710/file.htm

我想用这种模式提取所有的行：

</span>Cabildo \t456\t386\t70\t21\t4\t101\t36\t12\t88\t48\t84\t62\t-</p>

以便获得一个数据库，例如：

Cabildo      456 386 70 21  4 101 36 12 88 48 62 -
Sierra Gorda 106  89 17  -  -  26  9  8 15 10 20 -
...

“-”表示缺失（NA）。我一直在玩 str_extract 函数而没有任何结果（我对正则表达式很陌生）。

我的想法是获取和之间的内容</span>，</p>然后使用 read.csv（带有制表符分隔符）读取行，但也许这不是最好的方法，因为其他东西可能在这些标签之间。

有什么建议吗？

score 2 · Accepted Answer

这应该让您知道该怎么做 -

# break the string at each occurrence of </span> or </p>
b <- unlist(strsplit(a,"</span>|</p>"))
# removing the first element, which is just a blank
b <- b[-1]

# remove unneeded elements by looking for the </td> tag and filtering them out, this logic can be changed depending on how the complete dataset looks
c <- grep(x = b, pattern =  "</td>", invert = TRUE, value = TRUE)

# breaking each string b/w </span> and </p> into individual columns, split by '/t'
d <- (strsplit(c,"\t"))

# appending all rows together to get one dataset
e <- data.frame(do.call(rbind,d))

输出 -

> e
            X1   X2  X3  X4 X5 X6  X7 X8 X9 X10 X11 X12 X13 X14
1     Cabildo   456 386  70 21  4 101 36 12  88  48  84  62   -
2 Sierra Gorda 106  89  17   -  - 26  9  8  15  10  18  20    -

score 1 · Accepted Answer

如果你有很多这样的 html 文件，你可能想看看这个包： http ://www.rexamine.com/resources/stringi/

它比 stringr 包具有更快的正则表达式函数实现。要安装此软件包，只需运行：

source('http://static.rexamine.com/packages/stringi_install.R')

例子：

stri_split_regex(a, "</span>|</p>")

regex - 从 html 文件中提取数据（R 和正则表达式）

2 回答 2

Related

Reference