python - 计算电影数据库 python/graphab 上标签的 TF

Question

给定许多电影及其相关标签（标签是关键字），我如何计算每部电影的TF或TF-IDF向量？他们是使用Graphlab或Python中的库自动执行此操作的吗？这是我的输入：

   print HH_tag_5K

  +---------+-----------------+
  | movieId |       tag       |
  +---------+-----------------+
  |   2324  |   bittersweet   |
  |   2324  |    holocaust    |
  |   2324  |   World War II  |
  |   357   |      Garath     |
  |   260   | Science Fiction |
  |  55267  |   large family  |
  |  55267  |    realistic    |
  |  55267  |     romantic    |
  |  55267  |   Steve Carell  |
  |  55267  |    the music    |
  +---------+-----------------+
  [194527 rows x 2 columns]
  Note: Only the head of the SFrame is printed.
  You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

事实上，我认为sklearn.feature_extraction.text.TfidfVectorizer这是这个问题的答案，但我还没有弄清楚如何将它用于我的问题？谢谢

参考：链接到 sklearn.feature_extraction.text.TfidfVectorizer

score 0 · Accepted Answer

这是这样做的方法graphlab：

import graphlab
sf = graphlab.SFrame({'movie_id': [1, 2, 3],
                      'title': ['the dog is brown',
                                'the cat is brown',
                                'the mouse is yellow']})
sf['tf_idf'] = graphlab.text_analytics.tf_idf(sf['title'])

现在SFrame有另一个名为tf_idf包含字典的列。字典的键是相应标题中的单词，值是 tf-idf 分数。

+----------+---------------------+-------------------------------+
| movie_id |        title        |             tf_idf            |
+----------+---------------------+-------------------------------+
|    1     |   the dog is brown  | {'brown': 0.40546510810816... |
|    2     |   the cat is brown  | {'brown': 0.40546510810816... |
|    3     | the mouse is yellow | {'is': 0.0, 'mouse': 1.098... |
+----------+---------------------+-------------------------------+
[3 rows x 3 columns]

python - 计算电影数据库 python/graphab 上标签的 TF

1 回答 1

Related

Reference