r - 使用 R 映射多个 ID

Question

思路如下。每个患者都有一个唯一的患者 ID，我们称之为hidenic_id。然而，该患者可能多次入院。另一方面，每个条目都有唯一的emtek_id。

患者 110380 于 2001 年 4 月 14 日 11:08 入院，随后转院并于 2001 年 4 月 24 日 18:16 出院。现在这个病人在 2001 年 5 月 11 日 23:24 再次入院，因为他现在有不同的 emtek_id。他于 2001 年 5 月 25 日 16:26 出院。因此，您需要通过检查日期来分配正确的 emtek_ids。如果合并文件中的日期在入院和出院时间段内（或非常接近 24 小时），我们可以分配该 emtek_id。

如何将不同的 emtek_ID 分配给具有 hidenic_id 和承认时间的条目？

score 1 · Accepted Answer

我有几个值得分享的想法。

首先，根据 hidenic_id 和日期制作 emtek_id。其次，使 emtek_id 逻辑解析，例如 emtek_id@dataTime。第三，使数据库成为全局向量。根据内存限制，必须有比这更快的方法，但它可能会给你一些想法。

主要问题是处理 NA 值和不正确的 hidenic_id、验证 hidenic_id(s)，以及如果您没有字符前导（这将是一个快速修复）填充 ID。最后，您想如何处理不正确但不是 NA/null 的输入？例如，假设您输入“ID”而不是“ID12345”，您是否要将其视为分配新值的调用或提示输入正确的 XOR NA 值？我会假设你只提供正确的 ID 输入或 NA 值，但这是我的微不足道的假设。

这是一些开始这个想法的伪代码。您选择如何存储数据（例如 csv 文件，然后使用 data.table::fread()）：

#this file's name is "make.hidenic_id.R"
library(data.table)
library(stringr)
set.seed(101)
#one might one a backup written, perhaps conditionally updating it every hour or so.
database.hidenic_id <<-data.table::fread("database.filename.hidenic_id.csv")
database.emtek_id   <<-data.table::fread("database.filename.emtek_id.csv") 

make.hidenic_Id = function(in.hidenic_id){
            if(is.na(in.hidenic_id) | !(in.hidenic_id %in% database.hidenic_id)){
                new.hidenic_id=NA
                #conditionally make new hidenic_id
                while( new.hidenic_id %in% database.hidenic_id){
                    new.hidenic_id = paste0("ID",str_pad(sample.int(99999, 1),5,pad=0))
                }
                #make new emtek_id
                new.emtek_id <- paste0(new.hidenic_id,  sep="@",  str_sub(Sys.time(),1,16))
                
                #update databases; e.g., c(database.emtek_id, new.emtek_id)
                database.hidenic_id <<- c(database.hidenic_id, new.hidenic_id)
                database.emtek_id   <<- c(database.emtek_id,   new.emtek_id)
            }else{
                new.emtek_id <- paste0(in.hidenic_id,  sep="@",  str_sub(Sys.time(),1,16))
              # update database.emtek_id 
              database.emtek_id   <<- c(database.emtek_id,   new.emtek_id)  
            }
            return(new.emtek_id)
}
temp = readline(prompt="Enter hidenic_id OR type \"NA\": ")
data.table::fwrite(database.emtek_id,  "database.filename.emtek_id.csv") 
data.table::fwrite(database.hidenic_id,"database.filename.hidenic_id.csv")

并调用文件

source("make.hidenic_id.R")

我没有做很多“好的做法”来管理糟糕的输入数据或优化搜索，但这是一个很好的开始。其他一些好的做法是使用更长的整数或不同的前导字符串，但您从未说过我们可以使用输入值来制作 ID。

您可以说这是受人口普查的启发，因为每个地理 ID 变量都只是一个巨大的字符串。

score 0 · Accepted Answer

我对你的问题很感兴趣，所以我创建了一些模拟数据并试图解决问题，但我自己遇到了一些困惑，然后发布了我的问题，我认为这是你要问的，但更笼统。您可以在此处查看响应：如何判断在一组之前和之后时间之间是否存在时间点

我的帖子生成了我认为是您开始的内容，而检查过的答案是我相信您正在寻找的内容。完整代码如下。您将需要安装zoo和IRanges. 另外，我在 2.15.3 版本中这样做了。 IRanges在 3.0.0 中没有正确安装。

## package installation
source("http://bioconductor.org/biocLite.R")
  biocLite("IRanges")
install.packages("zoo")


## generate the emtek and hidenic file data
library(zoo)
date_string <- paste("2001", sample(12, 10, 3), sample(28,10), sep = "-")
time_string <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26",
                 "23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")

entry_emtek <- strptime(paste(date_string, time_string), "%Y-%m-%d %H:%M:%S")
entry_emtek <- entry_emtek[order(entry_emtek)]
exit_emtek <- entry_emtek + 3600 * 24
emtek_file <- data.frame(emtek_id = 1:10, entry_emtek, exit_emtek)

hidenic_id <- 110380:110479
date_string <- paste("2001", sample(12, 100, replace = TRUE), sample(28,100, replace = T), sep = "-")
time_string <- rep(c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26",
                 "23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26"),10)
hidenic_time <- strptime(paste(date_string, time_string), "%Y-%m-%d %H:%M:%S")
hidenic_time <- hidenic_time[order(hidenic_time)]
hidenic_file <- data.frame(hidenic_id, hidenic_time)

## Find the intersection of emtek and hidenic times.  This part was done by user: agstudy
library(IRanges)
## create a time intervals 
subject <- IRanges(as.numeric(emtek_file$entry_emtek),
        as.numeric(emtek_file$exit_emtek))
## create a time intervals (start=end here)
query <- IRanges(as.numeric(hidenic_file$hidenic_time),
        as.numeric(hidenic_file$hidenic_time))
## find overlaps and extract rows (both time point and intervals)  
emt.ids <- subjectHits(findOverlaps(query,subject))
hid.ids <- queryHits(findOverlaps(query,subject))
cbind(hidenic_file[hid.ids,],emtek_file[emt.ids,])

r - 使用 R 映射多个 ID

2 回答 2

Related

Reference