r - 在R数据框的字符列（创建新列）中提取括号之间的文本

Question

抱歉，如果标题有点罗嗦，希望这个例子会有所帮助。我有以下数据集：

my_df
                                     Description thisYVal thisPts
1                     (12:00)   Start Period        0       0
2        (12:00)   Jump Ball Thomas vs Grant        0       0
3      (11:48) [MIA 3-] Wade Layup Shot: Missed     0       2
4  (11:46) [PHL] Thomas Rebound (Off: Def:1)        0       0
6     (11:02) [MIA] Haslem Jump Shot: Missed      -19       2
7  (11:00) [MIA] Haslem Rebound (Off:1 Def:)        0       0
8    (10:57) [MIA] Haslem Layup Shot: Missed        0       2
9 (10:56) [PHL] Coleman Rebound (Off: Def:1)        0       0

dput(my_df)
structure(list(Description = c("(12:00)   Start Period", "(12:00)   Jump Ball Thomas vs Grant", 
"(11:48) [MIA 3-] Wade Layup Shot: Missed", "(11:46) [PHL] Thomas Rebound (Off: Def:1)", 
"(11:02) [MIA] Haslem Jump Shot: Missed", "(11:00) [MIA] Haslem Rebound (Off:1 Def:)", 
"(10:57) [MIA] Haslem Layup Shot: Missed", "(10:56) [PHL] Coleman Rebound (Off: Def:1)"
), thisYVal = c(0L, 0L, 0L, 0L, -19L, 0L, 0L, 0L), thisPts = c(0L, 
0L, 2L, 0L, 2L, 0L, 2L, 0L)), row.names = c(1L, 2L, 3L, 4L, 6L, 
7L, 8L, 9L), class = "data.frame")

...我想提取出现在Description数据框列中的 3 个字母的团队缩写。

3 个字母的描述总是跟在第一个方括号[之后，尽管它并不总是跟在右括号]之后（正如您在数据框的第 3 行中看到的那样）。

我一直在尝试使用 substr() 函数来做到这一点，但到目前为止还没有运气。任何帮助表示赞赏！

编辑：一些额外的上下文 - 一些行（在这种情况下为 1 和 2）没有 [] 或团队缩写。在这些情况下，数据框可能会返回空白字符串、NA 或其他内容。

EDIT-2：只是以防万一，因为它没有明确提到 - 第四列c("", "", "MIA", "PHL", "MIA", "MIA", "MIA", "PHL")是我想要得到的

编辑3：以下让我接近，但不完全在那里

my_df %>% 
  dplyr::mutate(teamAbb = unlist(stringr::str_extract(Description, "\\[(.*)\\]")))

score 2 · Accepted Answer

R 最近引入strcapture了它的标准utils包：

strcapture("(?<=\\[)(.{3})", dat$Description, proto=list(out=character()), perl=TRUE)
#   out
#1 <NA>
#2 <NA>
#3  MIA
#4  PHL
#5  MIA
#6  MIA
#7  MIA
#8  PHL

score 1 · Accepted Answer

您可以str_match从stringr包装中使用。具体来说，您需要在左方括号后查找三个大写字母（假设所有团队缩写都是三个字母）。

> str_match(df$Description, '\\[([A-Z]{3})')
     [,1]   [,2] 
[1,] NA     NA   
[2,] NA     NA   
[3,] "[MIA" "MIA"
[4,] "[PHL" "PHL"
[5,] "[MIA" "MIA"
[6,] "[MIA" "MIA"
[7,] "[MIA" "MIA"
[8,] "[PHL" "PHL"

您会注意到团队缩写模式实际上在括号中；那是因为它是我们要提取的模式的子组。因此，str_match返回 (1) 整个模式，以及 (2) 括号中指定的子组。因此，在这种情况下，我们要取第二列，其中包含来自第一个子组的匹配项。

df$Team <- str_match(df$Description, '\\[([A-Z]{3})')[,2]

这给了我们想要的结果：

                                 Description Team
1                     (12:00)   Start Period <NA>
2        (12:00)   Jump Ball Thomas vs Grant <NA>
3  (11:48) [MIA 3-] Wade Layup Shot: Missed   MIA
4  (11:46) [PHL] Thomas Rebound (Off: Def:1)  PHL
5     (11:02) [MIA] Haslem Jump Shot: Missed  MIA
6  (11:00) [MIA] Haslem Rebound (Off:1 Def:)  MIA
7    (10:57) [MIA] Haslem Layup Shot: Missed  MIA
8 (10:56) [PHL] Coleman Rebound (Off: Def:1)  PHL

score 1 · Accepted Answer

这是另一个选项，它在括号后查找 3 个非数字并将它们放在名为 Team 的新列中：

library(tidyverse)

df %>% mutate(Team = str_extract(Description, "(?<=\\[)\\D{3}"))
#>                                  Description thisYVal thisPts Team
#> 1                     (12:00)   Start Period        0       0 <NA>
#> 2        (12:00)   Jump Ball Thomas vs Grant        0       0 <NA>
#> 3   (11:48) [MIA 3-] Wade Layup Shot: Missed        0       2  MIA
#> 4  (11:46) [PHL] Thomas Rebound (Off: Def:1)        0       0  PHL
#> 5     (11:02) [MIA] Haslem Jump Shot: Missed      -19       2  MIA
#> 6  (11:00) [MIA] Haslem Rebound (Off:1 Def:)        0       0  MIA
#> 7    (10:57) [MIA] Haslem Layup Shot: Missed        0       2  MIA
#> 8 (10:56) [PHL] Coleman Rebound (Off: Def:1)        0       0  PHL

由reprex 包（v0.2.0）于 2018 年 9 月 9 日创建。

r - 在R数据框的字符列（创建新列）中提取括号之间的文本

3 回答 3

Related

Reference