我正在尝试从我的数据文本分析中删除拼写错误。所以我正在使用 quanteda 包的字典功能。它适用于 Unigram。但它为 Bigrams 提供了意想不到的输出。不知道如何处理拼写错误,以免它们潜入我的 Bigrams 和 Trigrams。
ZTestCorp1 <- c("The new law included a capital gains tax, and an inheritance tax.",
"New York City has raised a taxes: an income tax and a sales tax.")
ZcObj <- corpus(ZTestCorp1)
mydict <- dictionary(list("the"="the", "new"="new", "law"="law",
"capital"="capital", "gains"="gains", "tax"="tax",
"inheritance"="inheritance", "city"="city"))
Zdfm1 <- dfm(ZcObj, ngrams=2, concatenator=" ",
what = "fastestword",
toLower=TRUE, removeNumbers=TRUE,
removePunct=TRUE, removeSeparators=TRUE,
removeTwitter=TRUE, stem=FALSE,
ignoredFeatures=NULL,
language="english",
dictionary=mydict, valuetype="fixed")
wordsFreq1 <- colSums(sort(Zdfm1))
电流输出
> wordsFreq1
the new law capital gains tax inheritance city
0 0 0 0 0 0 0 0
不使用字典,输出如下:
> wordsFreq
tax and the new new law law included included a a capital
2 1 1 1 1 1
capital gains gains tax and an an inheritance inheritance tax new york
1 1 1 1 1 1
york city city has has raised raised a a taxes taxes an
1 1 1 1 1 1
an income income tax and a a sales sales tax
1 1 1 1 1
预期的 Bigram
The new
new law
law capital
capital gains
gains tax
tax inheritance
inheritance city
ps 我假设在字典匹配后完成标记化。但根据我看到的结果,情况似乎并非如此。
另一方面,我尝试将我的字典对象创建为
mydict <- dictionary(list(mydict=c("the", "new", "law", "capital", "gains",
"tax", "inheritance", "city")))
但它没有用。所以我不得不使用我认为效率不高的方法。
更新 根据 Ken 的解决方案添加了输出:
> (myDfm1a <- dfm(ZcObj, verbose = FALSE, ngrams=2,
+ keptFeatures = c("the", "new", "law", "capital", "gains", "tax", "inheritance", "city")))
Document-feature matrix of: 2 documents, 14 features.
2 x 14 sparse Matrix of class "dfmSparse" features
docs the_new new_law law_included a_capital capital_gains gains_tax tax_and an_inheritance
text1 1 1 1 1 1 1 1 1
text2 0 0 0 0 0 0 1 0
features
docs inheritance_tax new_york york_city city_has income_tax sales_tax
text1 1 0 0 0 0 0
text2 0 1 1 1 1 1