regex - 使用正则表达式将 URL 提取到新的数据框列中

Question

我想使用正则表达式从数据框中的文本中提取所有 URL 到一个新列中。我有一些用于提取关键字的旧代码，因此我希望将代码改编为正则表达式。我想将正则表达式保存为字符串变量并在此处应用：

data$ContentURL <- apply(sapply(regex, grepl, data$Content, fixed=FALSE), 1, function(x) paste(selection[x], collapse=','))

似乎fixed=FALSE应该说明grepl它是一个正则表达式，但 R 不喜欢我尝试将正则表达式保存为：

regex <- "http.*?1-\\d+,\\d+"

我的数据组织在这样的数据框中：

data <- read.table(text='"Content"     "date"   
 1     "a house a home https://www.foo.com"     "12/31/2013"
 2     "cabin ideas https://www.example.com in the woods"     "5/4/2013"
 3     "motel is a hotel"   "1/4/2013"', header=TRUE)

并希望看起来像：

                                           Content       date              ContentURL
1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
3                                 motel is a hotel   1/4/2013

score 24 · Accepted Answer

Hadleyverse 解决方案（stringr包）具有不错的 URL 模式：

library(stringr)

url_pattern <- "http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+"

data$ContentURL <- str_extract(data$Content, url_pattern)

data

##                                            Content       date              ContentURL
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

str_extract_all如果中有多个，您可以使用Content，但这将涉及您之后的一些额外处理。

score 3 · Accepted Answer

这是使用该qdapRegex库的一种方法：

library(qdapRegex)
data[["url"]] <- unlist(rm_url(data[["Content"]], extract=TRUE))
data

##                                            Content       date                     url
## 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
## 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
## 3                                 motel is a hotel   1/4/2013                    <NA>

要查看函数使用的正则表达式（qdapRegex旨在帮助分析和了解正则表达式），您可以使用grab函数名称前缀为的函数@：

grab("@rm_url")

## [1] "(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)"

grepl告诉您此字符串包含或不包含的逻辑输出。 grep告诉您索引或给出值，但值是整个字符串螺母您想要的子字符串。

因此，要将这个正则表达式传递给 base 或stringi包（qdapRegex包装stingi以进行提取），您可以这样做：

regmatches(data[["Content"]], gregexpr(grab("@rm_url"), data[["Content"]], perl = TRUE))

library(stringi)
stri_extract(data[["Content"]], regex=grab("@rm_url"))

我确信也有一种stringr方法，但我不熟悉这个包。

score 0 · Accepted Answer

分割空间然后找到“http”：

data$ContentURL <- unlist(sapply(strsplit(as.character(data$Content), split = " "),
                                 function(i){
                                   x <- i[ grepl("http", i)]
                                   if(length(x) == 0) x <- NA
                                   x
                                 }))


data
#                                            Content       date              ContentURL
# 1               a house a home https://www.foo.com 12/31/2013     https://www.foo.com
# 2 cabin ideas https://www.example.com in the woods   5/4/2013 https://www.example.com
# 3                                 motel is a hotel   1/4/2013                    <NA>

score 0 · Accepted Answer

您可以使用软件包unglue：

library(unglue)
unglue_unnest(data,Content, "{=.*?}{url=http[^ ]*}{=.*?}",remove = FALSE)
#>                                            Content       date                       url
#> 1               a house a home https://www.f00.com 12/31/2013 1     https://www.f00.com
#> 2 cabin ideas https://www.example.com in the woods   5/4/2013 2 https://www.example.com
#> 3                                 motel is a hotel   1/4/2013 3                    <NA>

{=.*?}匹配任何内容并且未分配给提取的列，因此 lhs=为空
{url=http[^ ]*}匹配以非空格开头http且后跟非空格的内容，因为 lhs 是url它被提取到url

Ps：由于SO限制，我在答案中手动更改foo为f00

regex - 使用正则表达式将 URL 提取到新的数据框列中

4 回答 4

Related

Reference