r - 从字符串中删除 URL

Question

我在 R 中有一个字符串向量——<code>myStrings——看起来像：

[1] download file from `http://example.com`
[2] this is the link to my website `another url`
[3] go to `another url` from more info.

哪里another url是一个有效的 http url，但 stackoverflow 不会让我插入多个 url，这就是我写的原因another url。我想删除所有网址，myStrings使其看起来像：

[1] download file from
[2] this is the link to my website
[3] go to from more info.

我已经尝试了stringr包中的许多功能，但没有任何效果。

score 18 · Accepted Answer

您可以使用gsub正则表达式来匹配 URL，

设置一个向量：

x <- c(
    "download file from http://example.com", 
    "this is the link to my website http://example.com", 
    "go to http://example.com from more info.",
    "Another url ftp://www.example.com",
    "And https://www.example.net"
)

从每个字符串中删除所有 URL：

gsub(" ?(f|ht)tp(s?)://(.*)[.][a-z]+", "", x)
# [1] "download file from"             "this is the link to my website"
# [3] "go to from more info."          "Another url"                   
# [5] "And"

更新：最好您可以发布几个不同的 URL，以便我们知道我们正在使用什么。但我认为这个正则表达式适用于您在评论中提到的 URL：

" ?(f|ht)(tp)(s?)(://)(.*)[.|/](.*)"

上面的表达式解释：

?可选空间
(f|ht)匹配"f"或"ht"
tp匹配"tp"
(s?)可选匹配"s"（如果存在）
(://)匹配"://"
(.*)匹配每个字符（一切）最多
[.|/]句号或正斜杠
(.*)之后的一切

我不是正则表达式的专家，但我认为我解释正确。

注意：SO 答案中不再允许使用 url 缩短器，因此我在进行最近的编辑时被迫删除了一个部分。请参阅该部分的编辑历史记录。

score 9 · Accepted Answer

我一直在为像这样的常见任务处理一组固定的正则表达式，我已经将它们放入github 上的一个包 qdapRegex 中，最终将转到 CRAN。它还可以提取碎片以及将它们分出。欢迎对包装提供任何反馈以供查看。

这里是：

library (devtools)
install_github("trinker/qdapRegex")
library(qdapRegex)

x <- c("download file from http://example.com", 
         "this is the link to my website http://example.com", 
         "go to http://example.com from more info.",
         "Another url ftp://www.example.com",
         "And https://www.example.net",
         "twitter type: t.co/N1kq0F26tG",
         "still another one https://t.co/N1kq0F26tG :-)")

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"))

## [1] "download file from"             "this is the link to my website"
## [3] "go to from more info."          "Another url"                   
## [5] "And"                            "twitter type:"                 
## [7] "still another one :-)"         

rm_url(x, pattern=pastex("@rm_twitter_url", "@rm_url"), extract=TRUE)

## [[1]]
## [1] "http://example.com"
## 
## [[2]]
## [1] "http://example.com"
## 
## [[3]]
## [1] "http://example.com"
## 
## [[4]]
## [1] "ftp://www.example.com"
## 
## [[5]]
## [1] "https://www.example.net"
## 
## [[6]]
## [1] "t.co/N1kq0F26tG"
## 
## [[7]]
## [1] "https://t.co/N1kq0F26tG"

编辑我看到推特链接没有被删除。我不会将此添加到特定于rm_url函数的正则表达式中，而是将其添加到qdapRegex. 因此，没有特定的功能可以同时删除标准 url 和 twitter，但是pastex（粘贴正则表达式）允许您轻松地从字典中获取正则表达式并将它们一起粘贴（使用管道运算符，|）。由于所有rm_XXX样式函数的工作原理基本相同，因此您可以将pastex输出传递给pattern任何rm_XXX函数的参数或创建自己的函数，如下所示：

rm_twitter_url <- rm_(pattern=pastex("@rm_twitter_url", "@rm_url"))
rm_twitter_url(x)
rm_twitter_url(x, extract=TRUE)

score 4 · Accepted Answer

 str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info")

 gsub('http\\S+\\s*',"", str1)
 #[1] "download file from "                         
 #[2] "this is the link to my website for more info"

 library(stringr)
 str_trim(gsub('http\\S+\\s*',"", str1)) #removes trailing/leading spaces
 #[1] "download file from"                          
 #[2] "this is the link to my website for more info"

更新

为了匹配ftp，我会使用@Richard Scriven 帖子中的相同想法

  str1 <- c("download file from http://example.com", "this is the link to my website https://www.google.com/ for more info",
  "this link to ftp://www.example.org/community/mail/view.php?f=db/6463 gives more info")


  gsub('(f|ht)tp\\S+\\s*',"", str1)
  #[1] "download file from "                         
  #[2] "this is the link to my website for more info"
  #[3] "this link to gives more info"

score 2 · Accepted Answer

以前的一些答案会在 URL 末尾之外删除，并且“\b”扩展名会有所帮助。它还可以涵盖“sftp://”网址。

对于常规网址：

gsub("(s?)(f|ht)tp(s?)://\\S+\\b", "", x)

对于小网址：

gsub("[A-Za-z]{1,5}[.][A-Za-z]{2,3}/[A-Za-z0-9]+\\b", "", x)

r - 从字符串中删除 URL

4 回答 4

更新

Related

Reference