regex - 使用 R，当字符串提取在数据框中创建列表元素时，如何为列表中的每个项目添加一行？

Question

我在数据框变量中有数百个地址，需要从中提取邮政编码。一些地址包含多个城市，每个城市都有一个邮政编码。这是一个提取邮政编码的数据框和 R 代码的模拟示例。

require(qdapRegex)
require(stringr)

df <- data.frame(address = c("Walnut; 94596, Ontario, 91761, Beach, CA 90071", "Irvine Cal 92164"), var2 = "text")
df$zip.Rinker <- sapply(df$address, FUN = rm_zip, extract=TRUE)

rm_zipTyler Rinker包中的函数qdapRegex提取所有邮政编码，如果有多个邮政编码，则将它们放入列表中。

> df
                                         address var2          zip.Rinker
1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596, 91761, 90071
2                               Irvine Cal 92164 text               92164

R如何为zip.Rinker下第1行中的每个邮政编码创建一个新行？像下面这样的东西是理想的。请注意，会有几十个地址有多个邮政编码，所以我希望有一个不需要手动步骤的解决方案。

                                         address var2          zip.Rinker
1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               94596
2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               91761
3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text               90071
4                               Irvine Cal 92164 text               92164

谢谢你的时间。

PS 使用stringr，此代码提取邮政编码并提出相同的挑战。

df$zip.stringr <- str_extract_all(string = df$address, pattern = "\\d{5}")

score 2 · Accepted Answer

You could do:

data.frame(rep(df$address, sapply(df$zip.Rinker, length)), unlist(df$zip.Rinker)

##   rep.df.address..sapply.df.zip.Rinker..length.. unlist.df.zip.Rinker.
## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 94596
## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 91761
## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071                 90071
## 4                               Irvine Cal 92164                 92164

But note that rm_zip is already vectorized and pretty speedy as it wraps the stringi package. So no need for sapply. Here's an approach that makes the code much more condensed using qdapTools's list2df that takes a named list of vectors and turns them into a data.frame.

library(qdapTools)
list2df(setNames(rm_zip(df$address, extract=TRUE), df$address), "zip", "address")[, 2:1]

##                                          address   zip
## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 94596
## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 91761
## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 90071
## 4                               Irvine Cal 92164 92164

And I like the magrittr framework for nested functions so here's that:

library(qdapTools)
library(magrittr)

df$address %>%
    rm_zip(extract=TRUE) %>%
    setNames(df$address) %>%
    list2df("zip", "address") %>%
    `[`(, 2:1)

score 1 · Accepted Answer

这是一种使用“data.table”和gregexpr/的方法regmatches：

library(data.table)
as.data.table(df)[, c(.SD, Zips = unlist(list(
  Zips = regmatches(address, gregexpr("\\d{5}", address))))), 
  by = 1:nrow(df)]
#    nrow                                        address var2  Zips
# 1:    1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596
# 2:    1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 91761
# 3:    1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 90071
# 4:    2                               Irvine Cal 92164 text 92164

score 0 · Accepted Answer

这是一种仅使用stringi包的方法：

library(stringi)
zip <- stri_extract_all_regex(df$address, "\\d{5}") 
data.frame(address=rep(df$address, sapply(zip, length)), zip=unlist(zip))

##                                          address   zip
## 1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 94596
## 2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 91761
## 3 Walnut; 94596, Ontario, 91761, Beach, CA 90071 90071
## 4                               Irvine Cal 92164 92164

score 0 · Accepted Answer

另一种方法，这种方法仅使用带有 hwnd 正则表达式的基本 R 来提取邮政编码从字符串中删除美国邮政编码：Regex

match <- gregexpr('(?<!\\d)\\d{5}(?:[ -]\\d{4})?\\b', df$address, perl=T)
zips <- regmatches(df$address,match)
nn <- rep(1:length(match),sapply(zips,length))
data.frame(df[nn,], zip=unlist(zips))

                                          address var2   zip
1   Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 94596
1.1 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 91761
1.2 Walnut; 94596, Ontario, 91761, Beach, CA 90071 text 90071
2                                 Irvine Cal 92164 text 92164

regex - 使用 R，当字符串提取在数据框中创建列表元素时，如何为列表中的每个项目添加一行？

4 回答 4

Related

Reference