2

我想知道是否有人可以为我提供一些工具/包/代码来检测用于相对性能评估的对等组的变化。

我有一个数据框,其中包含多年来用于某家公司 (CIK) 的所有同行。下面给出了这个数据的一个例子:

CIK <- c("21344","21344", "21344", "21344", "21344", "21344", "21344", "21344", "21344")
FiscalYear <- c("2013", "2014", "2015", "2016", "2017", "2014", "2015", "2016", "2017")
PeerCIK <- c("1800","1800","1800","1800","1800","21456","21456","21456","21456")
dataframe <- data.frame(CIK, FiscalYear, PeerCIK)

结果如下表:

    CIK FiscalYear PeerCIK
1 21344       2013    1800
2 21344       2014    1800
3 21344       2015    1800
4 21344       2016    1800
5 21344       2017    1800
6 21344       2014   21456
7 21344       2015   21456
8 21344       2016   21456
9 21344       2017   21456

现在,我想确定同行 ( PeerCIK) 是否在公司 ( ) 覆盖的整个期间都存在CIK。因此,我首先需要确定第一年和最后一年CIK(在这个例子中很明显(2013-2017),但我需要为许多公司这样做)。我用于此的代码是:

df2 <- dataframe %>%
    group_by(CIK) %>% 
    summarise(
        start = min(FiscalYear), 
        end = max(FiscalYear)
    )
> df2
    CIK start  end
1 21344  2013 2017

接下来我需要确定在该期间是否存在所有不同的对等方。如果这不是真的,那么对等组中一定发生了变化(对等组被添加到组中或从组中删除)。这是我如何继续的问题。我最终想要的结果是每个公司(CIK)的数据框,每个财政年度与去年相比,同行组是否发生了变化(如果发生变化,变化是一个值为 1 的虚拟变量)。因此,当添加对等点时(在开始日期之后)或当对等点不再包括但尚未达到该特定 CIK 的结束日期时,就会发生这种变化。

预期结果

对于上面的示例,我会得到以下结果,因为从 2014 年开始添加公司 21456,因此与 2013 年相比发生了变化:

    CIK FiscalYear change
1 21344       2013      0
2 21344       2014      1
3 21344       2015      0
4 21344       2016      0
5 21344       2017      0

我真的希望有人可以帮助我,请告诉我

4

1 回答 1

0

通过 , 和一些帮助变量的方法略有不同expand()full_join它们应该涵盖您的大多数边缘情况:

library(tidyverse)
dataframe %>%
    # Add helper variable to indicate present relationships.
    mutate(
        present = 1
    ) %>%
    # Generate all possible variations of CIK, FiscalYear, and PeerCik
    # and join with our data.
    full_join(
        dataframe %>% expand(CIK, FiscalYear, PeerCIK),
        by = c("CIK", "FiscalYear", "PeerCIK")
    ) %>%
    # Set the helper variable to 0 wherever it is missing,
    # which is the case in your newly joined empty data from `expand(...)`.
    mutate(
        present = ifelse(is.na(present), 0, present)
    ) %>%
    # Sort the data because now the order will be important.
    arrange(CIK, PeerCIK, FiscalYear) %>%
    # Group by CIK-PeerCIK relationship...
    group_by(
        CIK, PeerCIK
    ) %>%
    # ...and compare each FiscalYear to the previous FiscalYear. 
    mutate(
        # Check if a relationship was added compared to the year before.
        added = case_when(
            row_number() == 1 ~ 0,
            lag(present) == 0 & present == 1 ~ 1, 
            TRUE ~ 0
        ),
        # Check if a relationship was removed compared to the year before.
        removed = case_when(
            row_number() == 1 ~ 0,
            lag(present) == 1 & present == 0 ~ 1, 
            TRUE ~ 0
        ),
        # Combine those two into one variable.
        change = ifelse(abs(added) + abs(removed) > 0, 1, 0)
    ) %>%
    ungroup() %>%
    # Now to the summary: Group by CIK and FiscalYear...
    group_by(
        CIK, FiscalYear
    ) %>%
    # ...and calculate all sums for each CIK and FiscalYear.
    summarize(
        # Total number of present relationships in this year.
        num_present = sum(present),
        # Number of added relationships in this year.
        num_added = sum(added),
        # Number of removed relationships in this year.
        num_removed = sum(removed),
        # Was there any change in this year?
        # An alternative would be `sum(change)` to
        # indicate the number of changed relationships.
        change = max(change)
    ) %>%
    ungroup()

结果:

# A tibble: 5 × 6
  CIK   FiscalYear num_present num_added num_removed change
  <chr> <chr>            <dbl>     <dbl>       <dbl>  <dbl>
1 21344 2013                 1         0           0      0
2 21344 2014                 2         1           0      1
3 21344 2015                 2         0           0      0
4 21344 2016                 2         0           0      0
5 21344 2017                 2         0           0      0
于 2021-11-13T04:47:52.937 回答