r - 将 substr 与开始词和停止词一起使用，而不是整数

Question

我想从下载的 html 代码中提取信息。html-Code 以字符串形式给出。所需的信息存储在特定的 html 表达式之间。例如，如果我想在字符串中包含每个标题，我必须搜索“H1>”和“/H1>”以及这些 html 表达式之间的文本。

到目前为止，我使用了substr()，但我必须先计算“H1>”和“/H1>”的位置。

htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
startposition = c(21,55) # calculated with gregexpr
stopposition = c(28, 63) # calculated with gregexpr
substr(htmlcode, startposition[1], stopposition[1])
substr(htmlcode, startposition[2], stopposition[2])

输出是正确的，但是要计算每个单独的开始和停止位置是很多工作。相反，我搜索类似 substr () 的函数，您可以在其中使用开始词和停止词而不是位置。例如像这样：function(htmlcode, startword = "H1>", stopword = "/H1>")

score 0 · Accepted Answer

我同意使用为 html 处理构建的包可能是处理您提供的示例的最佳方式。但是，根据字符值对字符串进行子串化的一种潜在方法是执行以下操作。

步骤1：定义一个简单的函数来返回字符串中字符的位置，在这个例子中我只使用固定字符串。

strpos_fixed=function(string,char){
  a<-gregexpr(char,string,fixed=T)
  b<-a[[1]][1:length(a[[1]])]
  return(b)
}

第 2 步：使用刚刚定义的strpos_fixed()函数定义新的子字符串函数

char_substr<-function(string,start,stop){
  x<-strpos_fixed(string,start)+nchar(start)
  y<-strpos_fixed(string,stop)-1
  z<-cbind(x,y)
  apply(z,1,function(x){substr(string,x[1],x[2])})
}

第 3 步：测试

htmlcode = " some html code <H1>headline</H1> some other code <H1>headline2</H1> "
htmlcode2 = " some html code <H1>baa dee ya</H1> some other code <H1>say do you remember?</H1>"
htmlcode3<- "<x>baa dee ya</x> skdjalhgfjafha <x>dancing in september</x>"
char_substr(htmlcode,"<H1>","</H1>")
char_substr(htmlcode2,"<H1>","</H1>")
char_substr(htmlcode3,"<x>","</x>")

score 0 · Accepted Answer

您在这里有两个选择。首先，使用专门为解析 HTML 结构而开发的包，例如rvest. 网上有很多教程。

其次，对于可能需要从不一定是格式良好的 HTML 的字符串中提取的边缘情况，您应该使用正则表达式。一个更简单的实现来自stringr::str_match：

# 1. the parenthesis define regex groups
# 2. ".*?" means any character, non-greedy
# 3. so together we are matching the expression <H1>some text or characters of any length</H1>

str_match(htmlcode, "(<H1>)(.*?)(</H1>)")

这将产生一个矩阵，其中的列（按顺序）是完全匹配的字符串，后跟我们指定的每个独立的正则表达式组。在这种情况下，如果您想要<H1>标签之间的任何文本（第 3 列），您只需要拉出第二组。

r - 将 substr 与开始词和停止词一起使用，而不是整数

2 回答 2

Related

Reference