“countvectorizer”的相关标签问题

0 投票

1 回答

1526 浏览

apache-spark - Spark - 如何为 countVectorizer 模型创建一个包含值数组的 Spark 数据框

我正在尝试执行 Spark 的 countVectorizer 模型。作为此要求的一部分，我正在读取一个 csv 文件并从中创建一个 Dataframe (inp_DF)。

它有 3 列，如下所示

我需要在同一数据框中创建第 4 列，其中包含所有这 3 列的值数组，例如

问题 1：有没有像 .concat 这样的简单命令来实现这一点？

需要此数组，因为 countVectorizer 模型的输入应该是包含值数组的列。它不应该是下面错误消息中提到的字符串数据类型：

线程“主”java.lang.IllegalArgumentException 中的异常：要求失败：列状态的类型必须等于以下类型之一：[ArrayType(StringType,true), ArrayType(StringType,false)] 但实际上是 StringType 类型. 在 scala.Predef$.require(Predef.scala:224) 在 org.apache.spark.ml.util.SchemaUtils$.checkColumnTypes(SchemaUtils.scala:58) 在 org.apache.spark.ml.feature.CountVectorizerParams$class .validateAndTransformSchema(CountVectorizer.scala:75) at org.apache.spark.ml.feature.CountVectorizer.validateAndTransformSchema(CountVectorizer.scala:123) at org.apache.spark.ml.feature.CountVectorizer.transformSchema(CountVectorizer.scala:188 ) 在 org.apache.spark.ml.feature.CountVectorizer.fit(CountVectorizer.scala:155) 在 org.apache.spark.examples 的 org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)。 ml.CountVectorizerExample$.main(CountVectorizerExample.scala:54) 在 org.apache.spark.examples.ml.CountVectorizerExample.main(CountVectorizerExample.scala) 在 sun.reflect.NativeMethodAccessorImpl。在 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在 java.lang.reflect.Method.invoke(Method.java: 43) 的 invoke0(Native Method) 498) 在 com.intellij.rt.execution.application.AppMain.main(AppMain.java:147) Java HotSpot(TM) 客户端 VM 警告：忽略选项 MaxPermSize=300m；在 8.0 中删除了支持

我试图从这 3 列输入数据框创建一个数组，但数组元素包含在方括号 [] 中。

下面给出示例代码片段供您参考

问题 2：如何从这些数组元素中删除方括号 [] 并在数据框中使用数组的值创建一个新列？

问题 3：我们能否提供单列值作为 countVectorizer 模型的输入并获取特征作为输出？

2017-09-05T07:33:24.750

0 投票

1 回答

335 浏览

python - Python: how to turn list of word counts into format suitable for CountVectorizer

I have ~100,000 lists of strings of the form:
['the: 652', 'of: 216', 'in: 168', 'to: 159', 'is: 145'] etc.
which essentially makes up my corpus. Each list contains the words from a document and their word counts.

How can I put this corpus into a form that I can feed into CountVectorizer?

Is there a quicker way than turning each list into a string containing 'the' 652 times, 'of' 216 times, etc.?

python python-2.7 nlp nltk countvectorizer

2017-09-16T10:20:03.353

0 投票

0 回答

283 浏览

python-3.x - IndexError：索引 2 超出轴 1 的范围，大小为 2

我收到一个错误 Index is out of bounds in my line doctopic = clf.fit_transform(dtm)并且在我的 Data 文件夹中我有两个 CSV 文件，有人可以解释如何解决这个 Index 错误。

python-3.x jupyter-notebook decomposition countvectorizer

user8678674

2017-09-26T19:24:41.737

0 投票

2 回答

1007 浏览

python - 返回 pandas 单元格中每个单词的列表以及该单词在整个列中的总数

我有一个熊猫数据框 df ，它看起来像这样：

我想生成一个column2，它是行中每个单词的列表以及整个列中每个单词的总数。所以输出会是这样的......

我尝试使用 sklearn，但未能实现上述目标。需要帮忙。

python scikit-learn word-frequency countvectorizer

2017-10-01T07:50:44.773

0 投票

1 回答

506 浏览

python - 如何在python中将多个句子转换为bigram

我对python相当陌生，我想将一组句子转换为二元组，有没有办法做到这一点？例如

如果 ngram = 2 我期待词汇有类似的东西

所以 X 可以转换为

有没有我可以用 countvectorizer 做的功能？

python text-mining n-gram countvectorizer

2017-10-08T06:48:20.200

0 投票

2 回答

574 浏览

python-2.7 - sklearn CountVectorizer

我对使用 words_.get 有疑问，代码如下。如下所示，我在其中一个机器学习练习中使用了 CountVectorizer，以获取特定单词的出现次数。

输出：

因此我怀疑为什么'bistie'显示正确的特征编号，即 2 而'BESTIE'显示 None 。词汇表_.get 不能很好地与大写向量一起使用吗？

python-2.7 machine-learning scikit-learn countvectorizer

2017-10-09T15:39:43.670

0 投票

0 回答

498 浏览

python - 将 counvectorizer() 用于 pandas 数据帧时，python 中的内存错误

我正在使用下面的代码在 python 中构造文档术语矩阵。

对于 10000 数据集，代码工作正常，但是当我考虑大约 1100000 的大型数据集时，执行时出现内存错误

有人可以告诉我哪里出错了吗？

python out-of-memory nltk sklearn-pandas countvectorizer

2017-10-13T05:27:50.023

0 投票

1 回答

1167 浏览

python-3.x - 使大型数据集的 CountVectorizer 更快

您好，我只想根据标题对电影进行聚类。我的函数对我的数据非常有效，但我有一个大问题，我的样本是 150.000 部大电影，实际上它非常慢需要 3 天才能对所有电影进行聚类

过程：

根据长度对电影标题进行排序

使用 countvectorizer 转换电影并计算每个电影的相似度（对于每个聚类电影，我每次都适合矢量化器并转换目标电影）

python-3.x performance scikit-learn countvectorizer

2017-10-31T08:52:17.673

0 投票

1 回答

872 浏览

python-3.x - 使用 countvectorizer 训练的 gensim ldamodel 中的主题分布

我的任务是这样的：

我的任务是估计语料库上的 LDA 模型参数，找到 10 个主题的列表以及每个主题中最重要的 10 个单词，我这样做是这样的：

哪个通过了自动分级机罚款。下一个任务是找到一个新文档的主题分布，我尝试这样做如下：

然而，这只是返回

gensim.interfaces.TransformedCorpus

我还从文档中看到了以下声明：“然后，您可以使用 >>> doc_lda = lda[doc_bow] 推断新的、看不见的文档的主题分布”，但在这里也没有成功。任何帮助表示赞赏。

python-3.x gensim topic-modeling countvectorizer

2017-11-16T09:04:42.343

0 投票

0 回答

539 浏览

scikit-learn - 在 sklearn CountVectorizer() 中使用 bigrams 提供 stop_words

是否有一种廉价且简单的方法可以防止 sklearnCountVectorizer仅使用stop_words参数停止一元组，并使其也停止二元组？我的意思在以下片段中进行了说明：

所以这段代码的作用是输出以下内容：

如您所见，我希望计算出双字母“你好”（它被喂给停用词）。我看过一些他们使用管道或自定义分析器的帖子，并且我浏览了文档，但是没有更简单的方法解决这个问题吗？

谢谢！

scikit-learn stop-words countvectorizer

2017-11-16T20:04:03.050

问题标签 [countvectorizer]

Reference