0

给定许多电影及其相关标签(标签是关键字),我如何计算每部电影的TFTF-IDF向量?他们是使用GraphlabPython中的库自动执行此操作的吗?这是我的输入:

   print HH_tag_5K

  +---------+-----------------+
  | movieId |       tag       |
  +---------+-----------------+
  |   2324  |   bittersweet   |
  |   2324  |    holocaust    |
  |   2324  |   World War II  |
  |   357   |      Garath     |
  |   260   | Science Fiction |
  |  55267  |   large family  |
  |  55267  |    realistic    |
  |  55267  |     romantic    |
  |  55267  |   Steve Carell  |
  |  55267  |    the music    |
  +---------+-----------------+
  [194527 rows x 2 columns]
  Note: Only the head of the SFrame is printed.
  You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.

事实上,我认为sklearn.feature_extraction.text.TfidfVectorizer这是这个问题的答案,但我还没有弄清楚如何将它用于我的问题?谢谢

参考:链接到 sklearn.feature_extraction.text.TfidfVectorizer

4

1 回答 1

0

这是这样做的方法graphlab

import graphlab
sf = graphlab.SFrame({'movie_id': [1, 2, 3],
                      'title': ['the dog is brown',
                                'the cat is brown',
                                'the mouse is yellow']})
sf['tf_idf'] = graphlab.text_analytics.tf_idf(sf['title'])

现在SFrame有另一个名为tf_idf包含字典的列。字典的键是相应标题中的单词,值是 tf-idf 分数。

+----------+---------------------+-------------------------------+
| movie_id |        title        |             tf_idf            |
+----------+---------------------+-------------------------------+
|    1     |   the dog is brown  | {'brown': 0.40546510810816... |
|    2     |   the cat is brown  | {'brown': 0.40546510810816... |
|    3     | the mouse is yellow | {'is': 0.0, 'mouse': 1.098... |
+----------+---------------------+-------------------------------+
[3 rows x 3 columns]
于 2016-03-31T06:29:00.607 回答