问题标签 [term-document-matrix]

问问题

For questions regarding programming in ECMAScript (JavaScript/JS) and its various dialects/implementations (excluding ActionScript). Note JavaScript is NOT the same as Java! Please include all relevant tags on your question; e.g., [node.js], [jquery], [json], [reactjs], [angular], [ember.js], [vue.js], [typescript], [svelte], etc.

147 问题

0 投票

3 回答

4395 浏览

r - TermDocumentMatrix 有时会抛出错误

我正在根据来自不同运动队的推文创建一个词云。此代码成功执行大约 10 次：

10 次中的其他 9 次，它会引发以下错误：

有什么想法吗？我用谷歌搜索过，但到目前为止还不够！请记住，我是 R 中的绝对新手！

r word-cloud term-document-matrix

2014-09-06T10:31:54.310

0 投票

2 回答

1121 浏览

r - tm 包：在矩阵中输出 findAssocs() 而不是 R 中的列表

考虑以下列表：

我如何设法拥有一个数据框，其中包含与列中这 3 个单词相关联的所有术语并显示：

对应的相关系数（如果存在）
NA 如果该词不存在（例如，the couple (oil, they) 将显示 NA）

r matrix tm term-document-matrix

2014-09-24T03:59:30.930

0 投票

1 回答

727 浏览

r - R 使用 removeSparseTerms 参数构建 TermDocumentMatrix

我可以在创建tm::TermDocumentMatrix对象时删除稀疏术语吗？

我试过：

但它不起作用。

r text-mining tm term-document-matrix

2014-10-21T11:05:02.483

0 投票

1 回答

1881 浏览

r - R：聚类文档

我有一个如下所示的 documentTermMatrix：

包装内tm中，可以计算 2 个文档之间的汉明距离。但是现在我想对所有汉明距离小于 3 的文档进行聚类。所以在这里我希望集群 1 是文档 1 和 2，而集群 2 是文档 3 和 4。有可能这样做吗？

r matrix cluster-analysis hamming-distance term-document-matrix

2014-10-27T09:49:32.470

0 投票

1 回答

6708 浏览

r - 大文本语料库打破 tm_map

在过去的几天里，我一直在为此烦恼。我搜索了所有 SO 档案并尝试了建议的解决方案，但似乎无法让它发挥作用。我在诸如 2000 06、1995 -99 等文件夹中有一组 txt 文档，并且想要运行一些基本的文本挖掘操作，例如创建文档术语矩阵和术语文档矩阵，以及做一些基于单词协同定位的操作。我的脚本适用于较小的语料库，但是，当我尝试使用较大的语料库时，它失败了。我已经粘贴了一个这样的文件夹操作的代码。

当我在 tm_map 中使用 mc.cores=1 参数时，操作会无限期地继续。但是，如果我在 tm_map 中使用了lazy=TRUE 参数，它看起来很顺利，但是后续操作会出现此错误。

我一直在寻找解决方案，但一直失败。任何帮助将不胜感激！

最好的！ķ

r text-mining tm text-analysis term-document-matrix

2014-11-09T23:30:15.213

0 投票

2 回答

4357 浏览

r - 如何仅选择语料库术语的子集以在 tm 中创建 TermDocumentMatrix

我有一个庞大的语料库，我只对我预先知道的少数术语的出现感兴趣。有没有办法使用tm包从语料库创建术语文档矩阵，其中只有我预先指定的术语才能使用和包含？

我知道我可以对语料库的结果 TermDocumentMatrix 进行子集化，但由于内存大小限制，我想避免从构建完整的术语文档矩阵开始。

r tm corpus term-document-matrix

2014-11-19T03:12:58.960

0 投票

1 回答

1130 浏览

r - TermDocumentMatrix as.matrix uses large amounts of memory

I'm currently using the tm package to extract out terms to cluster on for duplicate detection in a decently sized database of 25k items (30Mb) this runs on my desktop, but when I try to run it on my server It seems to take an ungodly amount of time. On closer inspection I found that I had blown through 4GB of swap running the line apply(posts.TmDoc, 1, sum) to calculate the frequencies of the terms. Furthermore even running as.matrix generates a document of 3GB on my desktop see http://imgur.com/a/wllXv

Is this necessary just to generate a frequency count for 18k terms on 25k items? Is there any other way to generate the frequency count without coercing the TermDocumentMatrix to a matrix or a vector?

I cannot remove terms based on sparseness as that's how the actual algorithim is implemented. It looks for terms that are common to at least 2 but not more than 50 and groups on them, calculating a similarity value for each group.

Here is the code in context for reference

r tm term-document-matrix

2014-12-08T10:27:11.567

0 投票

2 回答

10099 浏览

r - R 和 tm 包：用一个或两个单词的字典创建一个术语文档矩阵？

目的： 我想使用具有复合词或bigrams作为一些关键字的字典创建一个术语文档矩阵。

网络搜索： 作为文本挖掘和中的tm包的新手R，我上网了解如何做到这一点。以下是我找到的一些相关链接：

背景： 其中，我更喜欢NGramTokenizer在RWeka包中使用的解决方案R，但我遇到了一个问题。在下面的示例代码中，我创建了三个文档并将它们放在一个corpus中。请注意，Docs 1and2每个都包含两个单词。 Doc 3只包含一个词。我的字典关键字是两个二元组和一个一元组。

问题：上述链接中的NGramTokenizer解决方案没有正确计算Doc 3.

我期待着为其他两个Doc 3人付出代价。我有什么误解吗？1jedi0

r tm n-gram term-document-matrix rweka

2015-01-19T20:33:18.247

0 投票

1 回答

139 浏览

r - 我的 DocumentTermMatrix 减少到零列

Train.tsv 包含 1,56,060 行文本，其中包含 4 个列名称 Phrase、PhraseID、SentenceID 和 Sentiment（范围为 0 到 4）。Phrase 列具有文本行。(Tm 包已经加载) R 版本: 3.1.2 ; 操作系统：Windows 7、64 位、4 GB RAM。

这是火车文件的前 6 行。

这里我做了两个函数。一个用于清理语料库，另一个用于制作 DTM（文档术语矩阵）。我还将每个情绪值与每一行文本联系起来。现在当我使用 dtm1 的尺寸时；它显示 156060 行但 0 列。

那么，如何生成带有情感标签的 DTM？

r text-mining tm term-document-matrix

2015-01-31T05:35:35.527

0 投票

1 回答

137 浏览

r - 如何在 R 中构建 termdocumentmatrix

我想知道是否可以在不使用包 tm 的情况下构建 TermdocumentMatrix。

我正在考虑将两个 for 循环与 grep 结合使用，但不幸的是我没有设法创建有用的东西。

提前谢谢

r matrix binary term-document-matrix

2015-03-16T17:34:03.763

1 2 3 4 5 6 7 8 9 10

问题标签 [term-document-matrix]

Reference