0

我正在准备数据以从 Graphlab 运行 KMEAMS,并且遇到以下错误:

 tmp = data.select_columns(['a.item_id'])
 tmp['sku'] = tmp['a.item_id'].apply(lambda x: x.split(','))
 tmp = tmp.unpack('sku')

 kmeans_model = gl.kmeans.create(tmp, num_clusters=K)

 Feature 'sku.0' excluded because of its type. Kmeans features must be int, float, dict, or array.array type.
 Feature 'sku.1' excluded because of its type. Kmeans features must be int, float, dict, or array.array type.

以下是每列的当前数据类型:

a.item_id   str
sku.0   str
sku.1   str

如果我可以将数据类型从 str 获取到 int,我认为它应该可以工作。然而,使用 SFrames 比标准的 python 库更棘手。任何帮助到达那里表示赞赏。

4

1 回答 1

0

kmeans 模型确实允许字典形式的特征,但不允许列表形式。这与您现在得到的略有不同,因为字典丢失了您的 SKU 的顺序,但就模型质量而言,我怀疑它实际上更有意义。它们的关键功能是count_words,在文本分析工具包中。

https://dato.com/products/create/docs/generated/graphlab.text_analytics.count_words.html

import graphlab as gl
sf = gl.SFrame({'item_id': ['abc,xyz,cat', 'rst', 'abc,dog']})
sf['sku_count'] = gl.text_analytics.count_words(sf['item_id'], delimiters=[','])

model = gl.kmeans.create(sf, num_clusters=2, features=['sku_count'])
print model.cluster_id  

+--------+------------+----------------+
| row_id | cluster_id |    distance    |
+--------+------------+----------------+
|   0    |     1      | 0.866025388241 |
|   1    |     0      |      0.0       |
|   2    |     1      | 0.866025388241 |
+--------+------------+----------------+
[3 rows x 3 columns]
于 2016-06-22T16:16:38.677 回答