我必须关注数据:
attributes <- c("apple-water-orange", "apple-water", "apple-orange", "coffee", "coffee-croissant", "green-red-yellow", "green-red-blue", "green-red","black-white","black-white-purple")
attributes
attributes
1 apple-water-orange
2 apple-water
3 apple-orange
4 coffee
5 coffee-croissant
6 green-red-yellow
7 green-red-blue
8 green-red
9 black-white
10 black-white-purple
我想要的是另一列,它根据观察相似性为每一行分配一个类别。
category <- c(1,1,1,2,2,3,3,3,4,4)
df <- as.data.frame(cbind(df, category))
attributes category
1 apple-water-orange 1
2 apple-water 1
3 apple-orange 1
4 coffee 2
5 coffee-croissant 2
6 green-red-yellow 3
7 green-red-blue 3
8 green-red 3
9 black-white 4
10 black-white-purple 4
它是更广泛意义上的聚类,但我认为大多数聚类方法仅适用于数字数据,并且单热编码有很多缺点(这是我在互联网上读到的)。
有谁知道如何完成这项任务?也许一些单词匹配方法?
如果我可以根据参数调整相似度(粗略与体面的“聚类”),那也很棒。
提前感谢您的任何想法!