我有两个示例数据框,df1如下df2所示。
df1具有选定的网球比赛装置列表,其中包含球员姓名 ( player1_name, player_name2) 和比赛日期。此处为玩家使用全名。
df2具有每个日期的所有网球比赛结果 ( winner, ) 的列表。loser在这里,使用名字的第一个字母和完整的姓氏。固定装置和结果的球员姓名是从不同的网站上抓取的。因此,在某些情况下,姓氏可能不完全匹配。考虑到这一点,我想添加一个新列df1,说明 player1 或 player2 是否赢了。基本上,我想通过给定相同日期的某些部分匹配方式从 df2映射player1_name和player2_name从到赢家和输家。df1
dput(df1)
structure(list(date = structure(c(18534, 18534, 18534, 18534,
18534, 18534, 18534), class = "Date"), player1_name = c("Laslo Djere",
"Hugo Dellien", "Quentin Halys", "Steve Johnson", "Henri Laaksonen",
"Thiago Monteiro", "Andrej Martin"), player2_name = c("Kevin Anderson",
"Ricardas Berankis", "Marcos Giron", "Roberto Carballes", "Pablo Cuevas",
"Nikoloz Basilashvili", "Joao Sousa")), row.names = c(NA, -7L
), class = "data.frame")
dput(df2)
structure(list(date = structure(c(18534, 18534, 18534, 18534,
18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534, 18534,
18534, 18534, 18534, 18534, 18534, 18534, 18534), class = "Date"),
winner = c("L Harris", "M Berrettini", "M Polmans", "C Garin",
"A Davidovich Fokina", "D Lajovic", "K Anderson", "R Berankis",
"M Giron", "A Rublev", "N Djokovic", "R Carballes Baena",
"A Balazs", "P Cuevas", "T Monteiro", "S Tsitsipas", "D Shapovalov",
"G Dimitrov", "R Bautista Agut", "A Martin"), loser = c("A Popyrin",
"V Pospisil", "U Humbert", "P Kohlschreiber", "H Mayot",
"G Mager", "L Djere", "H Dellien", "Q Halys", "S Querrey",
"M Ymer", "S Johnson", "Y Uchiyama", "H Laaksonen", "N Basilashvili",
"J Munar", "G Simon", "G Barrere", "R Gasquet", "J Sousa"
)), row.names = c(NA, -20L), class = "data.frame")
我创建了一个自定义函数,该函数可以使用 RecordLinkage 包将字符串与字符串向量中最接近的匹配项进行匹配。我可以使用这个函数编写一个超级低效的代码,但在去那里之前,我想看看我是否能以更有效的方式来做。
ClosestMatch <- function(string, stringVector,max_threshold=0.5) {
df<- character()
for (i in 1:length(string)) {
distance <- levenshteinSim(string[i], stringVector)
if (max(distance)>=max_threshold) {
df[i]<- stringVector[which.max(distance)]
}
else {
df[i]= NA
}
}
return(df)
}