r - 过滤 tibble 列以仅包含在单独的 tibble 中找到的值

Question

问题

我有一个包含基本股票代码信息的小标题（在此处以 .csv 文件的形式提供：https ://www.nasdaq.com/market-activity/stocks/screener ）。

我如何过滤这个 tibble（叫它symbolData）只为第二个小得多的 tibble（叫它DowJones）中列出的公司？请注意，数据集之间的公司名称并不完全一致（即“Apple Inc. - Common Stock”symbolData与中的“Apple Inc.” DowJones）。

代表

#packages
library(dplyr)       library(tibble)
library(httr)        library(utils)
library(reshape2)    library(xml2)
library(rvest)

remove_arrows <- function(x) {sub("[[:space:]]↑&quot;, "", x)}

DowJones <- "https://en.wikipedia.org/wiki/Historical_components_of_the_Dow_Jones_Industrial_Average" %>% 
  GET(config = config(ssl_verifypeer = FALSE)) %>% read_html() %>% html_node(".wikitable") %>% html_table(fill = TRUE) %>% 
  as_tibble() %>%
  filter(!grepl('↓|Dropped', X1)) %>%
  rowid_to_column("index") %>% 
  melt(id.vars="index", value.name="Dow Jones Industrial Average") %>%
  select(-c("variable","index")) %>%
  mutate(across("Dow Jones Industrial Average", remove_arrows)) %>% as_tibble()

symbolData <- read.csv("~/nasdaq_screener.csv") %>% as_tibble()

> head(DowJones)
# A tibble: 6 × 1
  `Dow Jones Industrial Average`
  <chr>                         
1 3M Company                    
2 American Express Company      
3 Amgen Inc.                    
4 Apple Inc.                    
5 The Boeing Company            
6 Caterpillar Inc.

> head(symbolData)  
# A tibble: 6 × 11
  Symbol Name                                                   Last.Sale  Net.Change X..Change  Market.Cap Country         IPO.Year  Volume Sector Industry
  <chr>  <chr>                                                  <chr>           <dbl> <chr>           <dbl> <chr>              <int>   <int> <chr>  <chr>   
1 A      "Agilent Technologies Inc. Common Stock"               "$133.73 "     5.58   4.35%     40167959890 "United States"     1999 3144474 "Capi… "Electr…
2 AA     "Alcoa Corporation Common Stock "                      "$77.85 "      4.55   6.21%     14332165382 ""                  2016 7327361 "Basi… "Metal …
3 AAC    "Ares Acquisition Corporation Class A Ordinary Shares" "$9.76 "       0.01   0.10%      1220000000 ""                  2021   99883 "Fina… "Busine…
4 AACG   "ATA Creativity Global American Depositary Shares"     "$1.36 "       0.02   1.49%        42672611 "China"               NA    7920 "Misc… "Servic…
5 AACI   "Armada Acquisition Corp. I Common Stock"              "$9.81 "       0.01   0.10%       203160195 "United States"     2021     264 ""     ""      
6 AACIW  "Armada Acquisition Corp. I Warrant"                   "$0.23 "      -0.0599 -20.66%             0 "United States"     2021  184363 ""     ""

以前的尝试

我已经尝试了很多方法，包括%in%/ %chin%、grep/ grepl、agrepl/ agrep、str_detect，将DowJones数据框变成列表，以及其他各种我不记得的方法。到目前为止，我尝试过的所有操作都返回了一个空的 tibble，或者返回了与不同列长度相关的错误消息。一些例子：

filter(symbolData, sapply(1:nrow(.), function(i) grepl(DowJones$`Dow Jones Industrial Average`[i], symbolData$Security.Name[i])))
#returns empty tibble

filter(symbolData, str_detect(symbolData$Security.Name, DowJones$`Dow Jones Industrial Average`) == TRUE)
Warning message:
In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
  longer object length is not a multiple of shorter object length
#returns empty tibble

filter(symbolData, unlist(Map(function(x, y) grepl(x, y), DowJones$`Dow Jones Industrial Average`, symbolData$Security.Name)))
Warning message:
In mapply(FUN = f, ..., SIMPLIFY = FALSE) :
  longer argument not a multiple of length of shorter
#returns empty tibble

filter(symbolData, map2_lgl(symbolData$Security.Name, DowJones$`Dow Jones Industrial Average`,  str_detect))
Error: Problem with `filter()` input `..1`.
ℹ Input `..1` is `map2_lgl(...)`.
x Mapped vectors must have consistent lengths:
* `.x` has length 5587
* `.y` has length 30

filter(symbolData, agrepl(DowJones$`Dow Jones Industrial Average`, symbolData$Security.Name, ignore.case = T, fixed = F))
#returns empty tibble

score 0 · Accepted Answer

它应该返回多少？这返回 7

symbolData %>% filter(str_remove_all(Security.Name, " - .*") %in% DowJones$`Dow Jones Industrial Average`)

# A tibble: 7 x 8
  Symbol Security.Name                                 Market.Category Test.Issue Financial.Status Round.Lot.Size ETF   NextShares
  <chr>  <chr>                                         <chr>           <chr>      <chr>                     <int> <chr> <chr>     
1 AAPL   Apple Inc. - Common Stock                     Q               N          N                           100 N     N         
2 AMGN   Amgen Inc. - Common Stock                     Q               N          N                           100 N     N         
3 CSCO   Cisco Systems, Inc. - Common Stock            Q               N          N                           100 N     N         
4 HON    Honeywell International Inc. - Common Stock   Q               N          N                           100 N     N         
5 INTC   Intel Corporation - Common Stock              Q               N          N                           100 N     N         
6 MSFT   Microsoft Corporation - Common Stock          Q               N          N                           100 N     N         
7 WBA    Walgreens Boots Alliance, Inc. - Common Stock Q               N          N                           100 N     N

r - 过滤 tibble 列以仅包含在单独的 tibble 中找到的值

问题

代表

以前的尝试

1 回答 1

Related

Reference