r - Extract string elements that possibly appear multiple times, or not at all

Question

Start with a character vector of URLs. The goal is to end up with only the name of the company, meaning a column with only "test", "example" and "sample" in the example below.

urls <- c("http://grand.test.com/", "https://example.com/", 
          "http://.big.time.sample.com/")

Remove the ".com" and whatever might follow it and keep the first part:

urls <- sapply(strsplit(urls, split="(?<=.)(?=\\.com)", perl=T), "[", 1) 

urls
# [1] "http://grand.test"    "https://example"      "http://.big.time.sample"

My next step is to remove the http:// and https:// portions with a chained gsub() call:

urls <- gsub("^http://", "",  gsub("^https://", "", urls))

urls
# [1] "grand.test"       "example"          ".big.time.sample"

But here is where I need help. How do I handle the multiple periods (dots) before the company name in the first and third strings of urls? For example, the call below returns NA for the second string, since the "example" string has no period remaining. Or if I retain only the first part, I lose a company name.

urls  <- sapply(strsplit(urls, split = "\\."), "[", 2)
urls
# [1] "test" NA     "big"

urls  <- sapply(strsplit(urls, split = "\\."), "[", 1)
urls
# [1] "grand"   "example" ""

Perhaps an ifelse() call that counts the number of periods remaining and only uses strsplit if there is more than one period? Also note that it is possible there are two or more periods before the company name. I don't know how to do lookarounds, which might solve my problem. But this didn't

strsplit(urls, split="(?=\\.)", perl=T)

Thank you for any suggestions.

score 3 · Accepted Answer

我认为应该更简单，但这有效：

 sub('.*[.]','',sub('https?:[/]+[.]?(.*)[.]com[/]','\\1',urls))
 [1] "test"    "example" "sample"

“urls”是你第一个 url 的向量。

score 3 · Accepted Answer

我认为有一种方法可以在“.com”之前提取单词，但可能会给出一个想法

sub(".com", "", regmatches(urls, gregexpr("(\\w+).com", urls)))

score 3 · Accepted Answer

这是一种可能比其他一些方法更容易理解和概括的方法：

pat = "(.*?)(\\w+)(\\.com.*)"
gsub(pat, "\\2", urls)

它的工作原理是将每个字符串分成三个捕获组，它们一起匹配整个字符串，然后替换回(2)你想要的捕获组。

pat = "(.*?)(\\w+)(\\.com.*)"
#        ^    ^       ^
#        |    |       |
#       (1)  (2)     (3)

编辑（添加?修饰符的解释）：

请注意，捕获组(1)需要包含“不贪婪”或“最小”量词?（有时也称为“懒惰”或“不情愿”）。它实质上是告诉正则表达式引擎匹配尽可能多的字符...而不用尽任何可能成为以下捕获组一部分的字符(2)。

没有尾随的?，重复量词默认是贪婪的；在这种情况下，一个贪婪的捕获组，(.*)因为它匹配任意数量的任意类型的字符，所以会“吃掉”字符串中的所有字符，而其他两个捕获组根本就没有——这不是我们想要的行为！

score 2 · Accepted Answer

2

于 2014-06-20T17:39:32.140 回答

score 2 · Accepted Answer

使用strsplit可能也值得一试：

sapply(strsplit(urls,"/|\\."),function(x) tail(x,2)[1])
#[1] "test"    "example" "sample"

score 1 · Accepted Answer

您可以使用stringr::word(), 以及basename().

basename()使用 URL 时很方便。

> library(stringr)
> word(basename(urls), start = -2, sep = "\\.")
# [1] "test"    "example" "sample"

basename(urls)给

[1] "grand.test.com"       "example.com"          ".big.time.sample.com"

然后，在函数中，假设分隔符是( ) word()，我们从末尾 ( ) 中取出第二个单词。start = -2.sep = "\\."

score 1 · Accepted Answer

因为您从来没有足够的正则表达式选项，所以这里有一个使用regcapturedmatches.R函数的选项

regcapturedmatches(urls, regexpr("([^.\\/]+)\\.com", urls, perl=T))

如果你只想要一个简单的向量作为返回值，你可以unlist()得到结果。该模式的想法是在“.com”之前抓住不是点或“/”的所有内容。

r - Extract string elements that possibly appear multiple times, or not at all

7 回答 7

Related

Reference