0

我有一大组数据,我是从 excel 中导入的。我希望获得数据集的词频表。但是,当我使用 strspplit 时,它包含引号和其他标点符号会导致错误的结果。

我使用 strsplit 的方式有一个小错误,需要帮助,因为我自己无法弄清楚。

df = read_excel("C:/Users/BM Consulting/Documents/Book2.xlsx", col_types=c("text","numeric"), range=cell_cols("A:B"))

vect <- c(df[1])

vectsplit <- strsplit(tolower(vect), "\s+")

vectlev <- 唯一(unlist(vectsplit))

vecttermf <- sapply(vectsplit, function(x) table(factor(x, levels=vectlev)))

输出 vect 是这样的:

[1]“3英寸c夹”“婴儿虎钳”“婴儿虎钳”“婴儿虎钳”
[5]“台钳”“台虎钳”“台虎钳”“台虎钳”
[9]“台声音”“台钳” wise" "bench wise heavy" "bench wise table"
[13] "工具箱" "c 夹" "c 夹组" "c 夹"
[17] "木匠工具" "木匠工具低价" "铸铁管》《夹具》
[21]《夹具组》《夹具木工》 "g 夹" "g 夹套 3 英寸"

我需要把每一个字都说出来。当我使用 strplit 时,它包括所有标点符号。

下面是我得到的一小部分 vectsplit。它包括我不想要的所有引号、反斜杠和逗号。

[1] "c(\"3" "inch" "c" "clamp\"," "\"baby" "vice\"," "\"baby" "vice"
[9] "bench\"," "\"宝贝" "虎钳\"," "\"bench\"," "\"bench" "vice\"," "\"bench" "vice"
[17] "clamp\"," "\"长凳““老虎钳”,“\”长凳““声音\”,“\”长凳““明智的\”,“\”长凳”
[25]“明智的”“重的\”,“\”bench" "wise" "table\"," "\"box" "for" "tools\","
[33] "\"c" "clamp\"," "\"c" "clamp" "set\ "," "\"c" "夹子\"," "\"木匠"
[41] "工具\"," "\"木匠" "工具" "低价" "价格\"," "\"铸件" "铁" "管子\","

4

1 回答 1

1

如果你检查 vect 的类,你会注意到它不是一个字符向量,而是一个列表。

vect<-c(df[1])
class(vect)
> "list"

如果您将 vect 定义如下,问题就会消失:

vect<-df[[1]]
class(vect)
> "character"

如果您这样定义 vect 然后使用 strsplit,它应该可以正常工作。请记住,不同类型的子集([1] 与 [[1]])将产生不同类别的输出。

于 2019-08-01T17:38:36.157 回答