regex - 将 URL 替换为域 (R)

Question

我想http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example用它的域（“Hello world stackoverflow.com”）替换字符串（“Hello world”）中的URL。

到目前为止，我能够用某个常量值而不是 URL 的域来识别和替换 URL：

x <- "Hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example"

gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "URL", x)

它高度赞赏任何帮助。

score 2 · Accepted Answer

根据评估 url 本身的重要性，您可能会摆脱类似的情况：

gsub("(https?://[^/\\s]+)[^\\s]*", "\\1", x)

http://将使用可选的s后跟作为后向参考组 1 进行捕获，然后（贪婪地）one or more non whitespace and \ characters消耗零个或多个。non whitespace characters然后整个匹配将被捕获的组（域）替换。

注意：这假设 url 不包含任何空格。

score 1 · Accepted Answer

您需要使用反向引用。

让我先说我不知道 R，但我假设反向引用的语法是\N其中 N 是匹配组。

所以如果你更换模式

https?://([^/\s]++)\S*+

通过字符串

\1

您最终应该用捕获组替换匹配的模式。

我不知道转义约定是什么，但您可能需要用另一个反斜杠来转义反斜杠。

分解的模式是

https?匹配“http”后跟可选的“s”
://匹配文字“://”
([^/\s]++)匹配并抓取所有内容，直到下一个斜杠或空格（域）
\S*+匹配 URL 的其余部分 - 直到下一个空格

score 0 · Accepted Answer

您可以使用 grep 扫描字符串并提取 http:// 和 / 之间的所有值 grep -Po 'http://\K.*?(?=/)' 查看http://rfunction.com/archives/1481和此处的正则表达式指南：http ://www.regular-expressions.info /

score 0 · Accepted Answer

这里的问题是（与 Stackoverflow 上的先前问题相比）字符串的非 URL 部分应保留，同时 URL 应缩短为其域。

根据我的问题中提到的帖子，我知道使用以下解决方案：

x <- "Hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example"

y.1 <- gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", "", x) 
y.2  <- gsub("www.", "", sapply(strsplit(x, "//|/"), "[", 2))

z <- paste( y.1, y.2, sep="")

z

这不是最优雅的解决方案，但它确实有效。

score 0 · Accepted Answer

    library(httr)
    txt <- "hello world http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible"
    l <- lapply(unlist(strsplit(txt," ",fixed=TRUE)),function(w){
           hostname <- parse_url(w)$hostname
           if(is.null(hostname) ) hostname <- w
           hostname
          })
    paste(l,collapse=" ")
    ## hello world stackoverflow.com

regex - 将 URL 替换为域 (R)

5 回答 5

Related

Reference