我只有一列名为“标题”的小标题。
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
2 zostavax shingles vaccine
3 xray operator
4 workplace mental health wellbeing workshop
5 zostavax recall toolkit
6 xray meetint
7 workplace mental health and wellbeing
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
我希望找到类似的记录并将它们组合在一起(同时保持它们的索引):
> dat
# A tibble: 13 x 1
title
<chr>
1 lymphoedema clinic
8 lymphoedema early intervenstion
9 lymphoedema expo
10 lymphoedema for breast care nurses
2 zostavax shingles vaccine
5 zostavax recall toolkit
3 xray operator
6 xray meetint
11 xray meeting and case studies
12 xray online examination
13 xray operator in service paediatric extremities
4 workplace mental health wellbeing workshop
7 workplace mental health and wellbeing
我正在使用下面的函数来查找彼此足够接近的字符串(cutoff = 0.75)
compareJW <- function(string1, string2, cutoff)
{
require(RecordLinkage)
jarowinkler(string1, string2) > cutoff
}
我已经实现了下面的循环,以便在一个新的数据框中一起“发送”类似的记录,但它不能正常工作,我尝试了一些变化,但还没有任何工作。
# create new database
newDB <- data.frame(matrix(ncol = ncol(dat), nrow = 0))
colnames(newDB) <- names(dat)
newDB <- as_tibble(newDB)
for(i in 1:nrow(dat))
{
# print(dat$title[i])
for(j in 1:nrow(dat))
{
print(dat$title[i])
print(dat$title[j])
# score <- jarowinkler(dat$title[i], dat$title[j])
if(dat$title[i] != dat$title[j]
&&
compareJW(dat$title[i], dat$title[j], 0.75))
{
print("if")
# newDB <- rbind(newDB,
# dat$title[i],
# dat$title[j])
}
else
{
print("else")
# newDB <- rbind(newDB, dat$title[i])
}
}
}
(我在循环中插入了打印“看看发生了什么”)
可重复的数据:
dat <-
structure(list(title = c("lymphoedema clinic", "zostavax shingles vaccine",
"xray operator", "workplace mental health wellbeing workshop",
"zostavax recall toolkit", "xray meetint", "workplace mental health and wellbeing",
"lymphoedema early intervenstion", "lymphoedema expo", "lymphoedema for breast care nurses",
"xray meeting and case studies", "xray online examination", "xray operator in service paediatric extremities"
)), row.names = c(NA, -13L), class = c("tbl_df", "tbl", "data.frame"
))
请问有什么建议吗?编辑:我还想要一个名为“组”的新索引列,如下所示:
> dat
# A tibble: 13 x 1
index group title
<chr>
1 1 lymphoedema clinic
8 1 lymphoedema early intervenstion
9 1 lymphoedema expo
10 1 lymphoedema for breast care nurses
2 2 zostavax shingles vaccine
5 2 zostavax recall toolkit
3 3 xray operator
6 3 xray meetint
11 3 xray meeting and case studies
12 3 xray online examination
13 3 xray operator in service paediatric extremities
4 4 workplace mental health wellbeing workshop
7 4 workplace mental health and wellbeing