regex - retrieve text inside tags in R

Question

I'm trying to retrieve information from a text file that contains tags, e.g.:

<name> Joe </name>

The text file consists of multiple lines some with more of these tags (e.g. for height and weight) and some with just other text. I refer to the text file as "sheet" (see code below).

I would like to retrieve the text between the tags. I have come up with the following solution to do so:

m1 <- regexpr("<name> [a-zA-Z]+ </name>", sheet)
m2 <- regmatches(sheet,m1)
m3 <- gsub("<name> ", "", gsub(" </name>", "", m2))
m3

I have not worked with regular expressions before, but I was wondering whether I am not taking a detour with my 'regmatches'. It seems there should be a more direct way to retrieve text inside tags?

Thanks,

Richard

score 4 · Accepted Answer

你可以通过一个gsub电话来做到这一点。因此，您通过(和围绕您的模式创建了一个组)。可以使用数字\\1（反向引用）访问该组，例如：

sheet <- "<name>foobar</name>"
gsub(pattern="<name>([a-zA-Z]+)</name>", replacement="\\1", x=sheet)
# [1] "foobar"

但正如@DieterMenne 建议的那样，您应该尝试 HTML 的XML包（它支持XPath）：

library("XML")
doc <- xmlParse("<html><name>foobar</name></html>")
xpathSApply(doc, "//name", xmlValue)
# [1] "foobar"

regex - retrieve text inside tags in R

1 回答 1

Related

Reference