我有一大组数据,我是从 excel 中导入的。我希望获得数据集的词频表。但是,当我使用 strspplit 时,它包含引号和其他标点符号会导致错误的结果。
我使用 strsplit 的方式有一个小错误,需要帮助,因为我自己无法弄清楚。
df = read_excel("C:/Users/BM Consulting/Documents/Book2.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))
vect <- c(df[1])
vectsplit <- strsplit(tolower(vect), "\s+")
vectlev <- 唯一(unlist(vectsplit))
vecttermf <- sapply(vectsplit, function(x) table(factor(x, levels=vectlev)))
输出 vect 是这样的:
[1]“3英寸c夹”“婴儿虎钳”“婴儿虎钳”“婴儿虎钳”
[5]“台钳”“台虎钳”“台虎钳”“台虎钳”
[9]“台声音”“台钳” wise" "bench wise heavy" "bench wise table"
[13] "工具箱" "c 夹" "c 夹组" "c 夹"
[17] "木匠工具" "木匠工具低价" "铸铁管》《夹具》
[21]《夹具组》《夹具木工》 "g 夹" "g 夹套 3 英寸"
我需要把每一个字都说出来。当我使用 strplit 时,它包括所有标点符号。
下面是我得到的一小部分 vectsplit。它包括我不想要的所有引号、反斜杠和逗号。
[1] "c(\"3" "inch" "c" "clamp\"," "\"baby" "vice\"," "\"baby" "vice"
[9] "bench\"," "\"宝贝" "虎钳\"," "\"bench\"," "\"bench" "vice\"," "\"bench" "vice"
[17] "clamp\"," "\"长凳““老虎钳”,“\”长凳““声音\”,“\”长凳““明智的\”,“\”长凳”
[25]“明智的”“重的\”,“\”bench" "wise" "table\"," "\"box" "for" "tools\","
[33] "\"c" "clamp\"," "\"c" "clamp" "set\ "," "\"c" "夹子\"," "\"木匠"
[41] "工具\"," "\"木匠" "工具" "低价" "价格\"," "\"铸件" "铁" "管子\","