I'm using data.table
(1.8.9) and the :=
operator to update the values in one table from the values in another. The table to be updated (dt1) has many factor columns, and the table with the updates (dt2) has similar columns with values that may not exist in the other table. If the columns in dt2 are characters I get an error message, but when I factorize them I get incorrect values.
How can I update a table without converting all factors to characters first?
Here is a simplified example:
library(data.table)
set.seed(3957)
## Create some sample data
## Note column y is a factor
dt1<-data.table(x=1:10,y=factor(sample(letters,10)))
dt1
## x y
## 1: 1 m
## 2: 2 z
## 3: 3 t
## 4: 4 b
## 5: 5 l
## 6: 6 a
## 7: 7 s
## 8: 8 y
## 9: 9 q
## 10: 10 i
setkey(dt1,x)
set.seed(9068)
## Create a second table that will be used to update the first one.
## Note column y is not a factor
dt2<-data.table(x=sample(1:10,5),y=sample(letters,5))
dt2
## x y
## 1: 2 q
## 2: 7 k
## 3: 3 u
## 4: 6 n
## 5: 8 t
## Join the first and second tables on x and attempt to update column y
## where there is a match
dt1[dt2,y:=i.y]
## Error in `[.data.table`(dt1, dt2, `:=`(y, i.y)) :
## Type of RHS ('character') must match LHS ('integer'). To check and
## coerce would impact performance too much for the fastest cases. Either
## change the type of the target column, or coerce the RHS of := yourself
## (e.g. by using 1L instead of 1)
## Create a third table that is the same as the second, except y
## is also a factor
dt3<-copy(dt2)[,y:=factor(y)]
## Join the first and third tables on x and attempt to update column y
## where there is a match
dt1[dt3,y:=i.y]
dt1
## x y
## 1: 1 m
## 2: 2 i
## 3: 3 m
## 4: 4 b
## 5: 5 l
## 6: 6 b
## 7: 7 a
## 8: 8 l
## 9: 9 q
## 10: 10 i
## No error message this time, but it is using the levels and not the labels
## from dt3. For example, row 2 should be q but it is i.
Page 3 of the data.table help file says:
When LHS is a factor column and RHS is a character vector with items missing from the factor levels, the new level(s) are automatically added (by reference, efficiently), unlike base methods.
This makes it seem like what I've tried should work, but obviously I'm missing something. I wonder if this is related to this similar issue:
rbindlist two data.tables where one has factor and other has character type for a column