Start with a character vector of URLs. The goal is to end up with only the name of the company, meaning a column with only "test"
, "example"
and "sample"
in the example below.
urls <- c("http://grand.test.com/", "https://example.com/",
"http://.big.time.sample.com/")
Remove the ".com"
and whatever might follow it and keep the first part:
urls <- sapply(strsplit(urls, split="(?<=.)(?=\\.com)", perl=T), "[", 1)
urls
# [1] "http://grand.test" "https://example" "http://.big.time.sample"
My next step is to remove the http://
and https://
portions with a chained gsub()
call:
urls <- gsub("^http://", "", gsub("^https://", "", urls))
urls
# [1] "grand.test" "example" ".big.time.sample"
But here is where I need help. How do I handle the multiple periods (dots) before the company name in the first and third strings of urls? For example, the call below returns NA for the second string, since the "example"
string has no period remaining. Or if I retain only the first part, I lose a company name.
urls <- sapply(strsplit(urls, split = "\\."), "[", 2)
urls
# [1] "test" NA "big"
urls <- sapply(strsplit(urls, split = "\\."), "[", 1)
urls
# [1] "grand" "example" ""
Perhaps an ifelse()
call that counts the number of periods remaining and only uses strsplit if there is more than one period? Also note that it is possible there are two or more periods before the company name. I don't know how to do lookarounds, which might solve my problem. But this didn't
strsplit(urls, split="(?=\\.)", perl=T)
Thank you for any suggestions.