arrays - 更快地搜索一个巨大的数组 matlab

Question

我目前正在 matlab 中实现一种算法，该算法通过购买某些文章的客户数据库进行搜索。该数据库如下所示：

[ 0   1   2   3   4   5 NaN NaN;
  4   6   7   8 NaN NaN NaN NaN;
...]

只是那个东西的大小是 size(data) = [90810 30]。现在我想在那个数据库中找到频繁的项目集（不要过多地使用工具箱）。我将在这里提供一个玩具示例：

toyset = [
  0,  1,  2,  3,  4,  5,  6,  7,  8,  9;
  5,  6,  7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
  5,  6,  7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
  1,  6,  7,  9, 10, 11,NaN,NaN,NaN,NaN;
  2,  4,  8, 11, 12,NaN,NaN,NaN,NaN,NaN];

当应用最小支持 0.5 [support = (occurences_of_set) / (all_sets) ] 时，这将生成以下项集：

frequent_itemsets = [
  7,NaN,NaN;
  6,NaN,NaN;
  5,NaN,NaN;
  6,  7,NaN;
  5,  7,NaN;
  5,  6,NaN;
  5,  6,  7];

我现在的问题是找出项目集在数据集中的频率。目前我使用以下算法（顺便说一句效果很好）：

function list = preprocess(subjectArray, combinations, progressBar)
% =========================================================================
% 
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
% 
% =========================================================================
% 
%   preprocesses the array; Finds the frequency of articles
%   subjectArray    - Array that contains customer data
%   combinations    - The article combinations to be found
%   progressBar     - The progress bar to indicate the progress of the 
%                     algorithm 
% 
% =========================================================================

    [countCustomers,maxSizeCustomers] = size(subjectArray);
    [countCombinations,sizeCombinations] = size(combinations);
    list=zeros(1,countCombinations);

    for i = 1:countCustomers
        waitbar(i/countCustomers,progressBar,sprintf('Preprocess: %.0f/%.0f\nSet size:%.0f',i,countCustomers,sizeCombinations));
        for k = 1 : countCombinations
            helpArray = zeros(1,maxSizeCustomers);
            help2Array = zeros(1,sizeCombinations);
            for j = 1:sizeCombinations
                helpArray = helpArray + (subjectArray(i,:) == combinations(k,j));
                help2Array(j) = any(helpArray);
            end
            list(k) = list(k) + all(help2Array);
        end
    end
end

我唯一的问题是需要 AGES ！！！字面上地！！是否有任何简单的可能性（除了长度为 1 的集合，我知道可以通过简单的计数来加快速度）来加快速度？

我认为这是：

helpArray = helpArray + (subjectArray(i,j) == combinations(k,:));

是瓶颈？但我不确定，因为我不知道 matlab 执行某些操作的速度有多快。

感谢您研究它，请注意_

我最终做了什么：

function list = preprocess(subjectArray, combinations)
% =========================================================================
% 
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
% 
% =========================================================================
% 
%   preprocesses the array; Finds the frequency of articles
%   subjectArray    - Array that contains customer data
%   combinations    - The article combinations to be found
% 
% =========================================================================

    [countCustomers,maxSizeCustomers] = size(subjectArray);
    [countCombinations,sizeCombinations] = size(combinations);
    list=zeros(1,countCombinations);


    if sizeCombinations == 1
        for i = 1 : countCustomers
            for j = 1 : maxSizeCustomers
                x = subjectArray(i,j) + 1;
                if isnan(x), break; end
                list(x+1) = list(x+1) + 1;
            end
        end
    else
        for i = 1:countCombinations
            logical = zeros(size(subjectArray));
            for j = 1:sizeCombinations
                logical = logical + (subjectArray == combinations(i,j));
            end
            list(i) = sum(sum(logical,2) == sizeCombinations);
        end
    end
end

感谢所有的支持！

score 3 · Accepted Answer

抱歉，我不能发表评论（我想我的声誉太低了）频繁项集挖掘非常复杂。如果您有一个庞大的数据集，并且您为一个项目（集合）选择了较低的阈值，那么您的方法（先验？）您必须准备等待很长时间:) 通常当您处理嵌套的 for 循环时matlab 你也会遇到性能低下的问题。你选择了什么门槛？你的数据集有多大？

score 1 · Accepted Answer

我立即看到的三个建议：

首先，您的等待栏为您的搜索增加了三分半钟。根据这个线程：http ://www.mathworks.com/matlabcentral/newsreader/view_thread/261380如果包含等待栏，它需要执行 240,000 个项目的代码额外执行 550 秒，将其扩展到 90,000 并且您仍然有 3加时半分钟。

要计算最初频繁出现的选项，请使用逻辑索引的总和，例如，查看 7 在数据集中出现的频率。

logical7=subjectArray==7;
numOf7s=sum(sum(logical7));

对每个值都这样做，我有一种感觉，即使会有额外的代码，它也会大大加快初始处理速度。

为了使该代码更好，您可以执行以下操作

预分配逻辑垫，每个 3d 切片代表一个数字（第 6 个切片代表频率。= 5，第 7 个切片代表频率。= 6）

logMat=zeros([size(subjectArray) maxPossibleVal+1])%Max Possible val 是 9 在玩具盒中。

然后用逻辑#矩阵填充每个切片

for i=0:maxPossibleVal
  logMat(:,:,i+1) = subjectArray==i;
end

再一次，您可以从每个逻辑切片中获取总和，并且可以从日志中删除小于某个阈值的总和。mat（我也会使用逻辑索引来删除不符合阈值的切片）

现在，对所有内容进行逻辑索引的好处是，您可以将切片与加法或乘法相结合，以获得不同的组合频率。您甚至可以旋转这些操作的结果，然后使用“sum”命令，然后进行逻辑索引以获得两个或三个数字一起出现的频率。

logM7=logMat(:,:,8)
logM8=logMat(:,:,9)

combo7and8=logical(double(logM7)+double(logM8))

%你也许可以用 | 替换它使这更简单/更快

freq7and8=sum(sum(combo7and8')==2)

%sum 默认情况下查找列的总和，将我们的行转换为列，然后找出哪些行等于 2，将所有逻辑 1 加在一起，就得到了频率。每个数据集中的 7 和 8。

整篇文章可以总结为两点：

取下等待栏

知道在代码中几乎所有地方都可以使用逻辑索引，这比 for 循环快得多

arrays - 更快地搜索一个巨大的数组 matlab

2 回答 2

Related

Reference