string - 构造词句矩阵时八度非常慢

Question

我有一个词汇表（字符串向量）和一个充满句子的文件。我想构建一个矩阵，显示每个句子包含每个单词的频率。我目前的实现速度非常慢，我相信这可以更快。大概十个字的一句话，大概需要一分钟。

你能解释一下为什么会这样以及如何加快速度吗？

注意：我使用稀疏矩阵，因为它不适合内存。词汇量大约为 10.000 个单词。运行程序不会耗尽我的工作记忆，所以这不是问题。

这是相关的代码。未提及的变量之前已初始化，如 totalLineCount、vocab 和 vocabCount。

% initiate sentence structure
wordSentenceMatrix = sparse(vocabCount, totalLineCount);
% fill the sentence structure
fid = fopen(fileLocation, 'r');
lineCount = 0;
while ~feof(fid),
    line = fgetl(fid);
    lineCount = lineCount + 1;
    line = strsplit(line, " ");
    % go through each word and increase the corresponding value in the matrix
    for j=1:size(line,2),
        for k=1:vocabCount,
            w1 = line(j);
            w2 = vocab(k);
            if strcmp(w1, w2),
                wordSentenceMatrix(k, lineCount) = wordSentenceMatrix(k, lineCount) + 1;
            end;
        end;
    end;
end;

score 1 · Accepted Answer

一个稀疏矩阵实际上存储在内存中的三个数组中。在一种简化的语言中，您可以将其存储描述为一个行索引数组、一个列索引数组和一个非零条目值数组。（稍微复杂一点的故事称为压缩稀疏列。）

通过在代码中逐个元素扩展稀疏矩阵，您正在重复更改该矩阵（或稀疏模式）的结构。不建议这样做，因为它涉及大量的内存复制。

您在词汇表中查询单词索引的方式也很慢，因为对于句子中的每个单词，您都在遍历整个词汇表。更好的方法是在 Matlab 中使用 Java HashMap。

我将您的代码修改为以下内容：

rowIdx = [];
colIdx = [];
vocabHashMap = java.util.HashMap;
for k = 1 : vocabCount
    vocabHashMap.put(vocab{k}, k);
end

fid = fopen(fileLocation, 'r');
lineCount = 0;
while ~feof(fid),
    line = fgetl(fid);
    lineCount = lineCount + 1;
    line = strsplit(line, " ");
    % go through each word and increase the corresponding value in the matrix
    for j = 1 : length(line)
        rowIdx = [rowIdx; vocabHashMap.get(line{j})];
        colIdx = [colIdx; lineCount];
    end
end
assert(length(rowIdx) == length(colIdx));
nonzeros = length(rowIdx);
wordSentenceMatrix = sparse(rowIdx, colIdx, ones(nonzeros, 1));

当然，如果您事先知道文本集合的长度，则应该预先分配rowIdxand的内存colIdx：

rowIdx = zeros(nonzeros, 1);
colIdx = zeros(nonzeros, 1);

如果可以，请将其移植到 Octave。

string - 构造词句矩阵时八度非常慢

1 回答 1

Related

Reference