我目前正在 matlab 中实现一种算法,该算法通过购买某些文章的客户数据库进行搜索。该数据库如下所示:
[ 0 1 2 3 4 5 NaN NaN;
4 6 7 8 NaN NaN NaN NaN;
...]
只是那个东西的大小是 size(data) = [90810 30]。现在我想在那个数据库中找到频繁的项目集(不要过多地使用工具箱)。我将在这里提供一个玩具示例:
toyset = [
0, 1, 2, 3, 4, 5, 6, 7, 8, 9;
5, 6, 7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
5, 6, 7,NaN,NaN,NaN,NaN,NaN,NaN,NaN;
1, 6, 7, 9, 10, 11,NaN,NaN,NaN,NaN;
2, 4, 8, 11, 12,NaN,NaN,NaN,NaN,NaN];
当应用最小支持 0.5 [support = (occurences_of_set) / (all_sets) ] 时,这将生成以下项集:
frequent_itemsets = [
7,NaN,NaN;
6,NaN,NaN;
5,NaN,NaN;
6, 7,NaN;
5, 7,NaN;
5, 6,NaN;
5, 6, 7];
我现在的问题是找出项目集在数据集中的频率。目前我使用以下算法(顺便说一句效果很好):
function list = preprocess(subjectArray, combinations, progressBar)
% =========================================================================
%
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
%
% =========================================================================
%
% preprocesses the array; Finds the frequency of articles
% subjectArray - Array that contains customer data
% combinations - The article combinations to be found
% progressBar - The progress bar to indicate the progress of the
% algorithm
%
% =========================================================================
[countCustomers,maxSizeCustomers] = size(subjectArray);
[countCombinations,sizeCombinations] = size(combinations);
list=zeros(1,countCombinations);
for i = 1:countCustomers
waitbar(i/countCustomers,progressBar,sprintf('Preprocess: %.0f/%.0f\nSet size:%.0f',i,countCustomers,sizeCombinations));
for k = 1 : countCombinations
helpArray = zeros(1,maxSizeCustomers);
help2Array = zeros(1,sizeCombinations);
for j = 1:sizeCombinations
helpArray = helpArray + (subjectArray(i,:) == combinations(k,j));
help2Array(j) = any(helpArray);
end
list(k) = list(k) + all(help2Array);
end
end
end
我唯一的问题是需要 AGES !!!字面上地!!是否有任何简单的可能性(除了长度为 1 的集合,我知道可以通过简单的计数来加快速度)来加快速度?
我认为这是:
helpArray = helpArray + (subjectArray(i,j) == combinations(k,:));
是瓶颈?但我不确定,因为我不知道 matlab 执行某些操作的速度有多快。
感谢您研究它,请注意_
我最终做了什么:
function list = preprocess(subjectArray, combinations)
% =========================================================================
%
% Creates a list which indicates how often an article-combination given by
% combinations is present in the array of Customers
%
% =========================================================================
%
% preprocesses the array; Finds the frequency of articles
% subjectArray - Array that contains customer data
% combinations - The article combinations to be found
%
% =========================================================================
[countCustomers,maxSizeCustomers] = size(subjectArray);
[countCombinations,sizeCombinations] = size(combinations);
list=zeros(1,countCombinations);
if sizeCombinations == 1
for i = 1 : countCustomers
for j = 1 : maxSizeCustomers
x = subjectArray(i,j) + 1;
if isnan(x), break; end
list(x+1) = list(x+1) + 1;
end
end
else
for i = 1:countCombinations
logical = zeros(size(subjectArray));
for j = 1:sizeCombinations
logical = logical + (subjectArray == combinations(i,j));
end
list(i) = sum(sum(logical,2) == sizeCombinations);
end
end
end
感谢所有的支持!