0

我正在尝试将数据集的分析单位从报告的事件更改为报告事件的个人。由于同一个人已经报告了不止一次,我使用了 R 的 RecordLinkage 包中的 compare.dedup 函数来识别匹配对——即同一个人报告的事件对。但是,我正在努力将所有对导出到一个数据集中以进行进一步分析。

这是一些虚拟数据的代码:

incidents <- structure(
  list(
    date = as.Date(c("01-02-2014", "02-02-2014", "02-02-2014", "03-02-2014", "04-02-2014","05-02-2014"), format = "%d-%m-%Y"),
    first_name = structure(c(1L, 2L, 3L, 4L, 5L, 1L), 
                         .Label = c("Dave", "Joe", "David", "Joseph", "Jo","Dave"),
                         class = "factor"),
    last_name = structure(c(1L, 2L, 3L, 4L, 5L, 1L),
                      .Label = c("Evans", "Miles", "Evans", "Myles",
                                 "Doe","Evans"), 
                      class = "factor"),
    sex = structure(c(1L, 1L, 1L, 1L, 2L, 1L), 
                  .Label = c("Male", "Female"), class = "factor"),
    dob = as.Date(c("14-02-1988", "01-05-1987", "14-02-1988", "01-05-1987", "04-02-1999","14-02-1988"), format = "%d-%m-%Y")),
  .Names = c("Date","Name","Surname","Sex","DOB"),
  class = "data.frame", row.names = c(NA, -6L)
)

打印“事件”时如下所示:

        Date   Name Surname    Sex        DOB
1 2014-02-01   Dave   Evans   Male 1988-02-14
2 2014-02-02    Joe   Miles   Male 1987-05-01
3 2014-02-02  David   Evans   Male 1988-02-14
4 2014-02-03 Joseph   Myles   Male 1987-05-01
5 2014-02-04     Jo     Doe Female 1999-02-04
6 2014-02-05   Dave   Evans   Male 1988-02-14

我设法在一行中打印对,但我所追求的是将所有内容聚集成一行(见下文)。

我运行了以下代码来识别和提取匹配的对:

# Generating the pairs

pairs = compare.dedup(incidents,
                      identity = NA, 
                      blockfld = FALSE,
                      phonetic = c(2), #runs phonetic comparison
                      phonfun = pho_h,
                      strcmp = c(3,4,5), #runs a string comparison
                      strcmpfun = levenshteinSim, #use levenshtein distance
                      exclude = c(1))

# Generating the weights
weightedpairs = emWeights(pairs, cutoff = 0.7)

#Classify the pairs
emresult = emClassify(weightedpairs)

我可以在单行中获得链接对:

links=getPairs(emresult,show="links", single.rows=TRUE)

links

    id1     Date.1 Name.1 Surname.1 Sex.1      DOB.1 id2     Date.2 Name.2 Surname.2 Sex.2      DOB.2    Weight
1.1   1 2014-02-01   Dave     Evans  Male 1988-02-14   6 2014-02-05   Dave     Evans  Male 1988-02-14 20.876240
1     1 2014-02-01   Dave     Evans  Male 1988-02-14   3 2014-02-02  David     Evans  Male 1988-02-14 10.208543
3     3 2014-02-02  David     Evans  Male 1988-02-14   6 2014-02-05   Dave     Evans  Male 1988-02-14 10.208543
2     2 2014-02-02    Joe     Miles  Male 1987-05-01   4 2014-02-03 Joseph     Myles  Male 1987-05-01  9.886615

但是,我想要实现的是合并所有匹配项,因此我最终得到每个人一行,按报告日期。或多或少是这样的:

        Date   Name Surname    Sex        DOB    Date2       Name2    Surname2    Sex2    DOB2    Date3    Name3    Surname3    Sex3    DOB3
1 2014-02-01   Dave   Evans   Male 1988-02-14    2014-02-02  David    Evans Male    1988-02-14    2014-02-05   Dave   Evans   Male 1988-02-14
2 2014-02-02    Joe   Miles   Male 1987-05-01    2014-02-03  Joseph   Myles Male    1987-05-01
3 2014-02-04     Jo     Doe Female 1999-02-04    NA          NA       NA

我想知道是否有人对如何实现这一目标提出建议?

提前致谢!

4

0 回答 0