matlab - 在使用 Matlab 计算平均平均精度召回时，无法找出数据库中的基本事实

Question

假设我有以下大小的数据集：

train = 500,000 * 960  %number of training samples (vector) each of 960 length

B_base = 1000000*960 %number of base samples  (vector) each of 960 length


Query = 1000*960  %number of query samples  (vector) each of 960 length

truth_nn = 1000*100

truth_nn 包含预先计算的k最近邻及其平方欧几里得距离形式的地面实况邻居。所以，truth_nn 的列代表k = 100最近的邻居。我发现很难在代码片段中应用最近邻搜索。有人可以展示如何应用地面实况邻居 truth_nn来找到平均精度召回吗？

如果有人可以通过以预先计算的 k 个最近邻及其平方欧几里德距离的形式创建任何数据矩阵、查询矩阵和地面实况邻居来展示任何小示例，这将是非常有帮助的。我尝试创建一个示例数据库。

假设基础数据为

B_base = [1 1; 2 2; 3 2; 4 4; 5 6];

查询数据是

 Query = [1 1; 2 1; 6 2];

[neighbors distances] = knnsearch(a,b,'k',2);

会找到 2 个最近的邻居。

问题 1：如何创建包含地面实况邻居和预先计算的 k 个最近邻距离的实况数据？这称为平均平均精度召回。我尝试如下实现最近邻搜索和平均精度召回，但无法理解（不确定）如何应用地面真值表

问题2：

我试图k通过首先将实值特征转换为二进制来应用最近邻搜索。

我无法将 k 最近邻搜索的概念应用于 k = 10、20、50 的不同值，也无法检查使用 GIST 数据库正确调用了多少数据。在 GIST truth_nn() 文件中，当我指定truth_nn(i,1:k)查询向量 i 时，函数 AveragePrecision 会引发错误。因此，如果有人可以展示使用与 GIST 中结构相似的任何样本基础事实，如何正确指定 k 并计算平均精度召回，那么我将能够将该解决方案应用于 GIST 数据库。到目前为止，这是我的方法，如果使用任何示例提供正确的方法，我将更容易与 GIST 数据库相关联，这将有很大的帮助。问题是如何从地面实况中找到邻居并将其与对距离进行排序后获得的邻居进行比较？

我也对如何应用pdist2()而不是当前距离计算感兴趣，因为这需要很长时间。

 numQueryVectors = size(Query,1);
       %Calculate distances
     for i=1:numQueryVectors,
      queryMatrix(i,:)
      dist = sum((repmat(queryMatrix(i,:),numDataVectors,1)-B_base ).^2,2);
     [sortval sortpos] = sort(dist,'ascend');
      neighborIds(i,:) = sortpos(1:k);
     neighborDistances(i,:) = sqrt(sortval(1:k));
    end


        %Sorting calculated nearest neighbor distances for k = 50



 %HOW DO I SPECIFY k = 50 in the ground truth, truth_nn
for i=1:numQueryVectors
  AP(i) = AveragePrecision(neighborIds(i,:),truth_nn(i,:));
end
mAP = mean(AP);


  function ap = AveragePrecision(rank_id, truth_id)
    truth_num = length(truth_id);


truth_pos = zeros(truth_num,1);

for j=1:50  %% for k = 50 nearest neighbors
    truth_pos(j) = find(rank_id == truth_id(j));
end
truth_pos = sort(truth_pos, 'ascend');

% compute average precision as the area below the recall-precision curve
ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
    p = j/truth_pos(j);
    ap = ap + p*delta_recall;
end

    end
end

更新：基于解决方案，我尝试使用此处给出的公式和参考代码来计算平均精度。但是，不确定我的方法是否正确，因为理论说我需要根据索引对返回的查询进行排名。我不完全理解这一点。判断检索算法的质量需要平均精度。

precision = positives/total_data;
recal = positives /(positives+negatives);
precision = positives/total_data;
recall = positives /(positives+negatives);
truth_pos = sort(positives, 'ascend');
truth_num = length(truth_pos);

ap = 0;
delta_recall = 1/truth_num;
for j=1:truth_num
    p = j/truth_pos(j);
    ap = ap + p*delta_recall;
end
ap

ap = infinity 的值，positive = 0 和negatives = 150 的值。这意味着 knnsearch() 根本不起作用。

score 1 · Accepted Answer

我认为你正在做额外的工作。这个过程在matlab中很简单，也可以对整个数组进行操作。这应该比 for 循环更快，并且更容易阅读。

如果没有错误，您的truth_nn和应该有相同的数据。neighbors每行有一个条目。Matlab已经对kmeans结果进行了升序排序，所以第1列是最近的邻居，第2列是第2列，第3最近的是3，....不需要再次对数据进行排序。

只需比较truth_nn以neighbors获取您的统计信息。这是一个简单的示例，向您展示程序应该如何运行。如果不进行一些修改，它将无法处理您的数据

%in your example this is provided, I created my own
truth_nn = [1,2;
            1,3;
            4,3];

B_base = [1 1; 2 2; 3 2; 4 4; 5 6];
Query = [1 1; 2 1; 6 2];

%performs k means
num_clusters = 2;
[neighbors distances] = knnsearch(B_base,Query,'k',num_clusters);

%--- output---
% neighbors = [1,2; 
%              1,2; notice this doesn't match truth_nn 1,3
%              4,3]
% distances = [     0    1.4142;
%              1.0000    1.0000;
%              2.8284    3.0000];

%computes statistics, nnz counts number of nonzero elements, in the first
%case every piece of data that matches 
%NOTE1: the indexing on truth_nn (:,1:num_clusters ) it says all rows
%       but only use the first num_clusters columns. This should
%       prevent the dimension mistmatch error you were getting
positives = nnz(neighbors == truth_nn(:,1:num_clusters ));     %result = 5
negatives = nnz(neighbors ~= truth_nn(:,1:num_clusters ));     %result = 1
%NOTE1: I've switched this from truth_nn to neighbors, this helps
%       when you cahnge num_neghbors 
total_data = numel(neighbors);               %result = 6

percent_incorrect = 100*(negatives / total_data);   % 16.6666
percent_correct   = 100*(positives / total_data);   % 93.3333

matlab - 在使用 Matlab 计算平均平均精度召回时，无法找出数据库中的基本事实

1 回答 1

Related

Reference