0

I am trying to merge two large matrices by row.names in R with merge, but it's taking quite some time. Is there a possibility to parallelize the merge method? Maybe somehow with foreach library? Or maybe there are faster solutions that do the job?

I have 8 cores and 24 GB of RAM. Both matrices are about 1.4 Gb and consist of ~900 rows and ~22000 columns.

Here is the code to reproduce a small example of my data set:

df1 <- data.frame(x = 1:3, y = 1:3, row.names = c('r1', 'r2', 'r3'))
df2 <- data.frame(z = 5:7, row.names = c('r1', 'r3', 'r7'))
dfMerged <- merge(df1, df2, by = "row.names", all = TRUE)
dfMerged[is.na(dfMerged)] <- 0
4

1 回答 1

1

同样的合并应该在data.table. 我认为它也应该是并行可行的,但它可能会变得更加复杂。这是相同的合并data.table

#Create data.table objects
dt1 <- data.table(x = 1:3, y = 1:3, var=c('r1', 'r2', 'r3'))
dt2 <- data.table(z = 5:7, var = c('r1', 'r3', 'r7'))

#Set merge keys
setkey(dt1,var)
setkey(dt2,var)

#Perform full outer join
dtMerged <- merge(dt1,dt2,all=T)

#Replace NAs with zeros (edited for more efficient answer suggest by Arun)
for (j in c("x", "y", "z")) 
  set(dtMerged, i=which(is.na(dtMerged[[j]])), j=j, value=0L)
dtMerged

var x y z
1:  r1 1 1 5
2:  r2 2 2 0
3:  r3 3 3 6
4:  r7 0 0 7
于 2014-05-05T15:02:19.800 回答