我写了一个函数来匿名化数据框中给定一些键的名称,一旦它匿名化很多名称,它就会爬行,但我不明白为什么。
有问题的数据框是一组通过 Twitter API 收集的 4733 条推文,其中每行是一条包含 32 列数据的推文。无论名称出现在哪一行,这些名称都将被匿名化,因此我不想将函数限制为仅查看这 32 列中的几列。
关键是一个包含 211121 对真实姓名和虚假姓名的数据帧,真实姓名和虚假姓名在数据帧中都是唯一的。在匿名化大约 100k 个名称后,该功能会大大减慢。
该函数如下所示:
pseudonymize <- function(df, key) {
for(name in key$realNames) {
df <- as.data.frame(apply(df, 2, function(column) gsub(name, key[key$realNames == name, 2], column)))
}
}
这里有什么明显的东西会导致速度变慢吗?我完全没有优化代码以提高速度的经验。
编辑1:
以下是要匿名的数据框中的几行。
"https://twitter.com/__jgil/statuses/825559753447313408","__jgil",0.000576911235261567,756,4,13,17,7,16,23,10,0.28166915052161,0.390123456790124,0.00271311644806025,0.474529795261862,0.00641025649383664,"@jadahung20 GIRL I am tooooooo salty tonight lolll","lolll","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",4057,214,241,"Canada","Nouvelle-Ecosse","Middleton","indefini","Shari"
"https://twitter.com/__paigewhite/statuses/827988259573788673","__paigewhite",0,1917,0,8,8,0,9,9,16,0.143476044852192,0.162056634159209,0.000172947386274259,0,0,"@abbytutty_ i miss emily lololol _Ù÷â_Ù÷É","lololol","adjoint","anglais","indefini","anglais","anglais","non","iPhone, Twitter",8366,392,661,"Canada","Nouvelle-Ecosse","indefini","indefini","Shari"
"https://twitter.com/_brookehynes/statuses/821022926287884288","_brookehynes",0,1917,1,6,7,1,7,8,1,1,1,0.000196850793912616,0.00393656926735126,0.200000002980232,"@tdesj3 @belle lol yea doubt it.","lol","adjoint","indefini","anglais","anglais","anglais","non","iPhone, Twitter",1184,87,70,"Canada","Nouvelle-Ecosse","Halifax","indefini","Shari"
这是关键的几行。
"","realNames","fakeNames"
"1","________","Tajid_Pinkley"
"2","____________aho","Monica_Yujiri"
"3","___________ass","Alexander_Garay-Grajeda"
编辑2:
我已将 DF 简化为仅需要匿名化的两列,这使事情变得更快,但在完成了大约 155k 的名称后它仍然会退出。
根据评论中的要求,这dput()
是要匿名的 DF 前三行的输出。
structure(list(
utilisateur = c("___Yeliab", "__courtlezz", "__courtlezz"),
texte = c("@EmilyIsPro ik lol", "@NikkiErica21 there was a sighting in sunset ridge too. Keep Winnie and bob safe lol", "@NikkiErica21 lol yes _Ã\231։")
),
row.names = c(NA, 3L),
class = "data.frame")
这dput()
是密钥的前三行。
structure(list(
realNames = c("________", "____________aho", "___________ass"),
fakeNames = c("Abhinav_Chang", "Caleb_Dunn-Sparks", "Taryn_Hunzicker")
),
row.names = c(NA, 3L),
class = "data.frame")