12

我在 R 中有一个字符串向量——<code>myStrings——看起来像:

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

哪里another url是一个有效的 http url,但 stackoverflow 不会让我插入多个 url,这就是我写的原因another url。我想删除所有网址,myStrings使其看起来像:

[1] download file from
[2] this is the link to my website
[3] go to from more info.

我已经尝试了stringr包中的许多功能,但没有任何效果。

4

4 回答 4

18

您可以使用gsub正则表达式来匹配 URL,

设置一个向量:

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

从每个字符串中删除所有 URL:

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"   

更新:最好您可以发布几个不同的 URL,以便我们知道我们正在使用什么。但我认为这个正则表达式适用于您在评论中提到的 URL:

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

上面的表达式解释:

  • ?可选空间
  • (f|ht)匹配"f""ht"
  • tp匹配"tp"
  • (s?)可选匹配"s"(如果存在)
  • (://)匹配"://"
  • (.*)匹配每个字符(一切)最多
  • [.|/]句号或正斜杠
  • (.*)之后的一切

我不是正则表达式的专家,但我认为我解释正确。

注意:SO 答案中不再允许使用 url 缩短器,因此我在进行最近的编辑时被迫删除了一个部分。请参阅该部分的编辑历史记录。

于 2014-08-17T18:54:56.733 回答
9

我一直在为像这样的常见任务处理一组固定的正则表达式,我已经将它们放入github 上的一个包 qdapRegex 中,最终将转到 CRAN。它还可以提取碎片以及将它们分出。欢迎对包装提供任何反馈以供查看。

这里是:

library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)

x <- c("download file from http://example.com", 
         "this is the link to my website http://example.com", 
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))

## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"         

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)

## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

编辑我看到推特链接没有被删除。我不会将此添加到特定于rm_url函数的正则表达式中,而是将其添加到qdapRegex. 因此,没有特定的功能可以同时删除标准 url 和 twitter,但是pastex(粘贴正则表达式)允许您轻松地从字典中获取正则表达式并将它们一起粘贴(使用管道运算符,|)。由于所有rm_XXX样式函数的工作原理基本相同,因此您可以将pastex输出传递给pattern任何rm_XXX函数的参数或创建自己的函数,如下所示:

rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)
于 2014-08-17T19:39:29.107 回答
4
 str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info")

 gsub('http\\S+\\s*',"", str1)
 #[1] "download file from "                         
 #[2] "this is the link to my website for more info"

 library(stringr)
 str_trim(gsub('http\\S+\\s*',"", str1)) #removes trailing/leading spaces
 #[1] "download file from"                          
 #[2] "this is the link to my website for more info"

更新

为了匹配ftp,我会使用@Richard Scriven 帖子中的相同想法

  str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info",
  "this link to ftp://www.example.org/community/mail/view.php?f=db/6463 gives more info")


  gsub('(f|ht)tp\\S+\\s*',"", str1)
  #[1] "download file from "                         
  #[2] "this is the link to my website for more info"
  #[3] "this link to gives more info"     
于 2014-08-17T19:05:31.913 回答
2

以前的一些答案会在 URL 末尾之外删除,并且“\b”扩展名会有所帮助。它还可以涵盖“sftp://”网址。

对于常规网址:

gsub("(s?)(f|ht)tp(s?)://\\S+\\b", "", x)

对于小网址:

gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", x)
于 2018-03-23T22:33:49.497 回答