我一直在为像这样的常见任务处理一组固定的正则表达式,我已经将它们放入github 上的一个包 qdapRegex 中,最终将转到 CRAN。它还可以提取碎片以及将它们分出。欢迎对包装提供任何反馈以供查看。
这里是:
library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)
x <- c("download file from http://example.com",
"this is the link to my website http://example.com",
"go to http://example.com from more info.",
"Another url ftp://www.example.com",
"And https://www.example.net",
"twitter type: t.co/N1kq0F26tG",
"still another one https://t.co/N1kq0F26tG :-)")
rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))
## [1] "download file from" "this is the link to my website"
## [3] "go to from more info." "Another url"
## [5] "And" "twitter type:"
## [7] "still another one :-)"
rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)
## [[1]]
## [1] "http://example.com"
##
## [[2]]
## [1] "http://example.com"
##
## [[3]]
## [1] "http://example.com"
##
## [[4]]
## [1] "ftp://www.example.com"
##
## [[5]]
## [1] "https://www.example.net"
##
## [[6]]
## [1] "t.co/N1kq0F26tG"
##
## [[7]]
## [1] "https://t.co/N1kq0F26tG"
编辑我看到推特链接没有被删除。我不会将此添加到特定于rm_url
函数的正则表达式中,而是将其添加到qdapRegex
. 因此,没有特定的功能可以同时删除标准 url 和 twitter,但是pastex
(粘贴正则表达式)允许您轻松地从字典中获取正则表达式并将它们一起粘贴(使用管道运算符,|
)。由于所有rm_XXX
样式函数的工作原理基本相同,因此您可以将pastex
输出传递给pattern
任何rm_XXX
函数的参数或创建自己的函数,如下所示:
rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)