function rate = KNN(Train_data,Train_label,Test_data,Test_label,k,Distance_mark);
% K-Nearest-Neighbor classifier(K-NN classifier)
%Input:
% Train_data,Test_data are training data set and test data
% set,respectively.(Each row is a data point)
% Train_label,Test_label are column vectors.They are labels of training
% data set and test data set,respectively.
% k is the number of nearest neighbors
% Distance_mark : ['Euclidean', 'L2'| 'L1' | 'Cos']
% 'Cos' represents Cosine distance.
%Output:
% rate:Accuracy of K-NN classifier
%
% Examples:
%
% %Classification problem with three classes
% A = rand(50,300);
% B = rand(50,300)+2;
% C = rand(50,300)+3;
% % label vector for the three classes
% gnd = [ones(300,1);2*ones(300,1);3*ones(300,1)];
% fea = [A B C]';
% trainIdx = [1:150,301:450,601:750]';
% testIdx = [151:300,451:600,751:900]';
% fea_Train = fea(trainIdx,:);
% gnd_Train = gnd(trainIdx);
% fea_Test = fea(testIdx,:);
% gnd_Test = gnd(testIdx);
% rate = KNN(fea_Train,gnd_Train,fea_Test,gnd_Test,1)
%
%
%
%Reference:
%
% If you used my matlab code, we appreciate it very much if you can cite our following papers:
% Jie Gui, Tongliang Liu, Dacheng Tao, Zhenan Sun, Tieniu Tan, "Representative Vector Machines: A unified framework for classical classifiers", IEEE
% Transactions on Cybernetics.
%This code is written by Gui Jie in the evening 2009/03/11.
%If you have find some bugs in the codes, feel free to contract me
if nargin < 5
error('Not enought arguments!');
elseif nargin < 6
Distance_mark='L2';
end
[n dim] = size(Test_data);% number of test data set
train_num = size(Train_data, 1); % number of training data set
% Normalize each feature to have zero mean and unit variance.
% If you need the following four rows,you can uncomment them.
% M = mean(Train_data); % mean & std of the training data set
% S = std(Train_data);
% Train_data = (Train_data - ones(train_num, 1) * M)./(ones(train_num, 1) * S); % normalize training data set
% Test_data = (Test_data-ones(n,1)*M)./(ones(n,1)*S); % normalize data
U = unique(Train_label); % class labels
nclasses = length(U);%number of classes
Result = zeros(n, 1);
Count = zeros(nclasses, 1);
dist=zeros(train_num,1);
for i = 1:n
% compute distances between test data and all training data and
% sort them
test=Test_data(i,:);
for j=1:train_num
train=Train_data(j,:);V=test-train;
switch Distance_mark
case {'Euclidean', 'L2'}
dist(j,1)=norm(V,2); % Euclead (L2) distance
case 'L1'
dist(j,1)=norm(V,1); % L1 distance
case 'Cos'
dist(j,1)=acos(test*train'/(norm(test,2)*norm(train,2))); % cos distance
otherwise
dist(j,1)=norm(V,2); % Default distance
end
end
[Dummy Inds] = sort(dist);
% compute the class labels of the k nearest samples
Count(:) = 0;
for j = 1:k
ind = find(Train_label(Inds(j)) == U); %find the label of the j'th nearest neighbors
Count(ind) = Count(ind) + 1;
end% Count:the number of each class of k nearest neighbors
% determine the class of the data sample
[dummy ind] = max(Count);
Result(i) = U(ind);
end
correctnumbers=length(find(Result==Test_label));
rate=correctnumbers/n;
--------------------------------------------------以上是代碼---------------------------------------------------------------------
余弦距離和余弦相似度的區(qū)別
餘弦相似度(cosine similarity)乃是傳統(tǒng)文件分類中,最常被拿來度量文件間距離的基本度量方法,其以兩個(gè) d 維向量間的角度差異來度量該向量間的距離,所得數(shù)據(jù)介於 0 ~ 1 之間,當(dāng)兩向量角度越相近時(shí),所求出的餘弦距離越接近1;反之,則越接近 0。假設(shè)在 d 維空間中有兩點(diǎn)a = [a1, a2, …, ad],b = [b1, b2, …,bd],則其餘弦相似度可表示為:
cosineSimilarity(a,b) = dot(a,b) / (norm(a)*norm(b)) [我覺得這里說成cosineSimilarity,不應(yīng)該說成cosineDistance。相似度越大,距離應(yīng)該越小。比如,a和b夾角為0,此時(shí)最相似,相似度最大,距離最小]
dot(a,b) 代表a和b的內(nèi)積,因?yàn)橄蛄績(jī)?nèi)積定義為
a·b = |a| × |b| × cosθ, (一般情況下,θ∈[0,π], http://baike.baidu.com/view/1485493.htm )。故這樣定義不能滿足在 0 ~ 1 之間,而是-1到1之間,有兩種方式:
(1) 我下面的代碼是正確的,用acos,將這個(gè)余弦轉(zhuǎn)化為[0,
π]之間的角度. 未必一定要限制在0 ~ 1 ,我的代碼轉(zhuǎn)化成[0,
π],值越大代表其距離越大;
(2) cosineDistance(a,b) = 1- cosineSimilarity(a,b) = 1- dot(a,b) / (norm(a)*norm(b))。cosineDistance的范圍就在[0 2]。
範(fàn)例:
a=[1 1 1]; b=[1 0 0];
cosineDistance = dot(a,b) / (norm(a)*norm(b))
cosineDistance =
Minimal Local Reconstruction Error Measure Based Discriminant Feature Extraction and Classification的P3
For robustness, before classification, we generally need to normalize feature vectors, making the length of each feature vector to be 1, that is, x -> x /|| x || .
The normalized Euclidean distance is equivalent to the cosine distance.
(1) Lin Zhu 師弟講將循環(huán)改為計(jì)算距離矩陣會(huì)節(jié)省時(shí)間,因?yàn)閙atlab循環(huán)很耗時(shí),但大樣本還必須用循環(huán)否則out of memory.想起以前上課jinsong老師也提供了一個(gè)KNN代碼,不過他的也是用循環(huán)實(shí)現(xiàn)的.matlab有自帶的函數(shù)knnclassify,論文Sparsity preserving projections的代碼SPP_1NN.m中就用的該函數(shù)。在ASLAN上我的KNN和knnclassify識(shí)別率完全一樣(2) 極其重要注意點(diǎn):倒數(shù)第四行程序不要用Result(i) = ind;這對(duì)Yale等標(biāo)號(hào)依次為1,2,3等沒問題。對(duì)二分類1和-1就有問題。SRC_QC和SRC_QC2也是類似的,倒數(shù)第三行不能用Result(i) = index, 要用Result(i) = classLabel(index); 原來只修改了這一處,其實(shí)SRC_QC2的50行和SRC_QC的42行也要將ii修改為classLabel(ii)。正因?yàn)檫@個(gè)錯(cuò)誤,才得出SRC在ASLAN上是50%錯(cuò)誤率方差是0的錯(cuò)誤結(jié)果。正確的SRC_QC2和SRC_QC程序在ASLAN目錄