regex - 从语料库中删除除 html 标签之外的所有内容

Question

我正在使用包tm。我有一个充满 html 文档的语料库，我想删除除 html 标签之外的所有内容。几天来我一直在尝试这样做，但我似乎无法找到任何好的解决方案。

例如，假设我有一个这样的文档：

<html>
<body>

<h1>hello</h1>

</body>
</html>

我希望文档变成这样：

<html> <body> <h1>

（或者使用结束标签，我真的不介意。）

我的目标是计算每个标签在文档中使用的次数。

score 2 · Accepted Answer

我对 tm 不熟悉，但这是使用正则表达式的方法。

（假设：您的字符串以 HTML 标记开头和结尾）

str <- "<html><body><p>test<p>test2</body></html>"
str <- gsub(">[^<^>]+<", "> <", str) # remove all the text in between HTML tags, leaving only HTML tags (opening and closing)
str <- gsub("</[^<^>]+>", "", str) #remove all closing HTML tags.

那会给你留下你想要的字符串。

If you're new to RegEx, check out this site for additional info getting started. Basically, the first gsub above is going to replace all text in between > and < which isn't an open or close bracket (i.e. all non-tag text). The second gsub will replace all text which starts with </ and ends with > with nothing -- removing the closing tags from the string

score 0 · Accepted Answer

You should look into something like http://rss.acs.unt.edu/Rdoc/library/XML/html/xmlTreeParse.html

In the link above, look at the example code. There is a section that shows how to print the entities. I haven't used this package so I can't vouch for it directly.

score 0 · Accepted Answer

(1) gsubfn

Assuming s is the input string (it may contain newlines) this matches < followed by anything that is not /, > or a space and extracts it into tags. The table function tabulates the occurrences:

library(gsubfn)
tags <- strapply(tolower(s), "\\<([^/> ]+)", c, simplify = unlist)
table(tags)

For example,

s <- "<html>
<body>

<h1>hello</h1>

</body>
</html>"
tags <- strapply(tolower(s), "\\<([^/> ]+)", c, simplify = unlist)
table(tags)

gives this:

tags
body   h1 html 
   1    1    1

If your file is very large then the development version of gsubfn has a fast version called strapplyc .

(2) XML

The above approach could get confused if there are < and > symbols in quoted strings and other border cases. There may not be any such instances in your input anyways but just in case this second approach should not have that problem:

library(XML)
doc <- htmlTreeParse(tolower(s), asText = TRUE, useInternalNodes = TRUE)
tags <- xpathSApply(doc, "//*", xmlName)
table(tags)

regex - 从语料库中删除除 html 标签之外的所有内容

3 回答 3

Related

Reference