python - using numpy to reduce the size of the matrix

Question

I have to create an adjacency list of users and TV shows where the rows are the users and the TV shows are the columns. If a user follows that TV show then there will be a 1 in the matrix else a zero. This information I have already collected from twitter. In total there are 140 TV shows and approximately 530000 unique users. I am using the following code to generate the matrix, using python:

NoTvShows: Total number of TV shows(IDs)
unique_user: All the unique users
collected_users: This is a list of lists. The sublists correspond to TV shows and list the IDs of the followers.

for i in range(0,NoTvShows):
    for every_user in unique_users:
        if every_user in collected_users[i]:
            matrix.append(1)
        else:
            matrix.append(0)
    main_matrix.append(matrix)
    matrix = []

the_matrix = zip(*main_matrix)
simplejson.dump(the_matrix,fwrite)
fwrite.close()

When I try executing my program on the server, it crashes as it is taking a lot of time and memory. I know I can use numpy to reduce the size of my matrix and then use it to compute similarities between the users. However, I am not sure as to how to code the numpy in this code and generate the reduced matrix.

I hope someone can guide me in this regard

Thank you

Richa

score 6 · Accepted Answer

稀疏矩阵（如@phg 所建议的）很好，因为矩阵中的大多数条目可能都是 0（假设大多数用户只关注少数电视节目）。

不过，可能更重要的是，您正在以一种非常低效的方式构建矩阵（制作大量 python 列表并复制它们），而不是首先将它们放在一个不错的紧凑 numpy 数组中。此外，您花费大量时间搜索列表（使用in语句），而这对于您的循环来说根本不是必需的。

此代码遍历关注者列表并在user_ids字典中查找每个 id 的用户 #。您可以非常简单地将其调整为稀疏矩阵类（我认为只需切换np.zeros到scipy.sparse.coo_matrix）。

user_ids = dict((user, i) for i, user in enumerate(unique_users))

follower_matrix = np.zeros(NoTvShows, len(unique_users), dtype=bool)
for show_idx, followers in enumerate(collected_users):
    for user in followers:
        follower_matrix[show_idx, user_ids[user]] = 1

一旦你有了矩阵，你真的，真的不想将它保存为 JSON，除非你必须这样做：对于数字矩阵来说，这是一种非常浪费的格式。numpy.save如果您只在 numpy. numpy.savetxt也可以工作，至少消除了括号和逗号，并且在写入时可能会减少内存开销。但是当你有一个 0-1 矩阵并且它是布尔数据类型时，numpy.save每个矩阵元素只需要一个位，而numpy.savetxt需要两个字节 = 16 位（一个 ascii'0'或'1'加上一个空格或换行符），并且 json 使用至少三个字节，我认为（逗号，空格，加上每行的一些括号）。

您可能还在谈论降维技术。这也很有可能；有很多技术可以通过某种 PCA 类型的技术、主题模型或基于聚类的方法将 140 维的向量（电视节目紧随其后）降低到更低的维度......如果你唯一担心的是构建矩阵需要很长时间，但这根本没有帮助（因为这些技术通常需要完整的原始矩阵，然后给你一个低维版本）。在这里尝试我的建议，如果它不够好，请尝试稀疏矩阵，然后担心减少数据的奇特方法（可能通过学习数据子集的降维然后构造其余部分）。

score 3 · Accepted Answer

您可能希望使用稀疏矩阵来减少空间。我为 scipy 找到了这个：http: //docs.scipy.org/doc/scipy/reference/sparse.html

我希望这就是你的意思。

score 1 · Accepted Answer

如果您有兴趣，这是另一种方法。它假定您的用户按存储顺序排列，但它们可以是数字或字符串 ID：

# The setup
users = ['bob', 'dave', 'steve']
users = np.array(users)
collected_users = [['bob'], ['dave'], ['steve', 'dave'], ['bob', 'steve', 'dave']]
NoTvShows = len(collected_users)

# The meat
matrix = np.zeros((NoTvShows, len(users)), 'bool')
for i, watches in enumerate(collected_users):
    index = users.searchsorted(watches)
    matrix[i, index] = 1

python - using numpy to reduce the size of the matrix

3 回答 3

Related

Reference