问题
我有一个包含基本股票代码信息的小标题(在此处以 .csv 文件的形式提供:https ://www.nasdaq.com/market-activity/stocks/screener )。
我如何过滤这个 tibble(叫它symbolData
)只为第二个小得多的 tibble(叫它DowJones
)中列出的公司?请注意,数据集之间的公司名称并不完全一致(即“Apple Inc. - Common Stock”symbolData
与 中的“Apple Inc.” DowJones
)。
代表
#packages
library(dplyr) library(tibble)
library(httr) library(utils)
library(reshape2) library(xml2)
library(rvest)
remove_arrows <- function(x) {sub("[[:space:]]↑", "", x)}
DowJones <- "https://en.wikipedia.org/wiki/Historical_components_of_the_Dow_Jones_Industrial_Average" %>%
GET(config = config(ssl_verifypeer = FALSE)) %>% read_html() %>% html_node(".wikitable") %>% html_table(fill = TRUE) %>%
as_tibble() %>%
filter(!grepl('↓|Dropped', X1)) %>%
rowid_to_column("index") %>%
melt(id.vars="index", value.name="Dow Jones Industrial Average") %>%
select(-c("variable","index")) %>%
mutate(across("Dow Jones Industrial Average", remove_arrows)) %>% as_tibble()
symbolData <- read.csv("~/nasdaq_screener.csv") %>% as_tibble()
> head(DowJones)
# A tibble: 6 × 1
`Dow Jones Industrial Average`
<chr>
1 3M Company
2 American Express Company
3 Amgen Inc.
4 Apple Inc.
5 The Boeing Company
6 Caterpillar Inc.
> head(symbolData)
# A tibble: 6 × 11
Symbol Name Last.Sale Net.Change X..Change Market.Cap Country IPO.Year Volume Sector Industry
<chr> <chr> <chr> <dbl> <chr> <dbl> <chr> <int> <int> <chr> <chr>
1 A "Agilent Technologies Inc. Common Stock" "$133.73 " 5.58 4.35% 40167959890 "United States" 1999 3144474 "Capi… "Electr…
2 AA "Alcoa Corporation Common Stock " "$77.85 " 4.55 6.21% 14332165382 "" 2016 7327361 "Basi… "Metal …
3 AAC "Ares Acquisition Corporation Class A Ordinary Shares" "$9.76 " 0.01 0.10% 1220000000 "" 2021 99883 "Fina… "Busine…
4 AACG "ATA Creativity Global American Depositary Shares" "$1.36 " 0.02 1.49% 42672611 "China" NA 7920 "Misc… "Servic…
5 AACI "Armada Acquisition Corp. I Common Stock" "$9.81 " 0.01 0.10% 203160195 "United States" 2021 264 "" ""
6 AACIW "Armada Acquisition Corp. I Warrant" "$0.23 " -0.0599 -20.66% 0 "United States" 2021 184363 "" ""
以前的尝试
我已经尝试了很多方法,包括%in%
/ %chin%
、grep
/ grepl
、agrepl
/ agrep
、str_detect
,将DowJones
数据框变成列表,以及其他各种我不记得的方法。到目前为止,我尝试过的所有操作都返回了一个空的 tibble,或者返回了与不同列长度相关的错误消息。一些例子:
filter(symbolData, sapply(1:nrow(.), function(i) grepl(DowJones$`Dow Jones Industrial Average`[i], symbolData$Security.Name[i])))
#returns empty tibble
filter(symbolData, str_detect(symbolData$Security.Name, DowJones$`Dow Jones Industrial Average`) == TRUE)
Warning message:
In stri_detect_regex(string, pattern, negate = negate, opts_regex = opts(pattern)) :
longer object length is not a multiple of shorter object length
#returns empty tibble
filter(symbolData, unlist(Map(function(x, y) grepl(x, y), DowJones$`Dow Jones Industrial Average`, symbolData$Security.Name)))
Warning message:
In mapply(FUN = f, ..., SIMPLIFY = FALSE) :
longer argument not a multiple of length of shorter
#returns empty tibble
filter(symbolData, map2_lgl(symbolData$Security.Name, DowJones$`Dow Jones Industrial Average`, str_detect))
Error: Problem with `filter()` input `..1`.
ℹ Input `..1` is `map2_lgl(...)`.
x Mapped vectors must have consistent lengths:
* `.x` has length 5587
* `.y` has length 30
filter(symbolData, agrepl(DowJones$`Dow Jones Industrial Average`, symbolData$Security.Name, ignore.case = T, fixed = F))
#returns empty tibble