http://blog.csdn.net/dark_scope/article/details/9421061
==========================================================================================
最近一直在看Deep Learning,各類博客、論文看得不少
但是說實話,這樣做有些疏于實現,一來呢自己的電腦也不是很好,二來呢我目前也沒能力自己去寫一個toolbox
只是跟著Andrew Ng的UFLDL tutorial 寫了些已有框架的代碼(這部分的代碼見github)
后來發現了一個matlab的Deep Learning的toolbox,發現其代碼很簡單,感覺比較適合用來學習算法
再一個就是matlab的實現可以省略掉很多數據結構的代碼,使算法思路非常清晰
所以我想在解讀這個toolbox的代碼的同時來鞏固自己學到的,同時也為下一步的實踐打好基礎
(本文只是從代碼的角度解讀算法,具體的算法理論步驟還是需要去看paper的
我會在文中給出一些相關的paper的名字,本文旨在梳理一下算法過程,不會深究算法原理和公式)
==========================================================================================
使用的代碼:DeepLearnToolbox ,下載地址:點擊打開,感謝該toolbox的作者
==========================================================================================
第一章從分析NN(neural network)開始,因為這是整個deep learning的大框架,參見UFLDL
==========================================================================================
首先看一下\tests\test_example_NN.m ,跳過對數據進行normalize的部分,最關鍵的就是:
(為了注釋顯示有顏色,我把matlab代碼中的%都改成了//)
- nn = nnsetup([784 100 10]);
- opts.numepochs = 1; // Number of full sweeps through data
- opts.batchsize = 100; // Take a mean gradient step over this many samples
- [nn, L] = nntrain(nn, train_x, train_y, opts);
- [er, bad] = nntest(nn, test_x, test_y);
nn = nnsetup([784 100 10]); opts.numepochs = 1; // Number of full sweeps through data opts.batchsize = 100; // Take a mean gradient step over this many samples [nn, L] = nntrain(nn, train_x, train_y, opts); [er, bad] = nntest(nn, test_x, test_y);
很簡單的幾步就訓練了一個NN,我們發現其中最重要的幾個函數就是nnsetup,nntrain和nntest了
那么我們分別來分析著幾個函數,\NN\nnsetup.m
nnsetup
- function nn = nnsetup(architecture)
- //首先從傳入的architecture中獲得這個網絡的整體結構,nn.n表示這個網絡有多少層,可以參照上面的樣例調用nnsetup([784 100 10])加以理解
-
- nn.size = architecture;
- nn.n = numel(nn.size);
- //接下來是一大堆的參數,這個我們到具體用的時候再加以說明
- nn.activation_function = 'tanh_opt'; // Activation functions of hidden layers: 'sigm' (sigmoid) or 'tanh_opt' (optimal tanh).
- nn.learningRate = 2; // learning rate Note: typically needs to be lower when using 'sigm' activation function and non-normalized inputs.
- nn.momentum = 0.5; // Momentum
- nn.scaling_learningRate = 1; // Scaling factor for the learning rate (each epoch)
- nn.weightPenaltyL2 = 0; // L2 regularization
- nn.nonSparsityPenalty = 0; // Non sparsity penalty
- nn.sparsityTarget = 0.05; // Sparsity target
- nn.inputZeroMaskedFraction = 0; // Used for Denoising AutoEncoders
- nn.dropoutFraction = 0; // Dropout level (http://www.cs.toronto.edu/~hinton/absps/dropout.pdf)
- nn.testing = 0; // Internal variable. nntest sets this to one.
- nn.output = 'sigm'; // output unit 'sigm' (=logistic), 'softmax' and 'linear'
- //對每一層的網絡結構進行初始化,一共三個參數W,vW,p,其中W是主要的參數
- //vW是更新參數時的臨時參數,p是所謂的sparsity,(等看到代碼了再細講)
- for i = 2 : nn.n
- // weights and weight momentum
- nn.W{i - 1} = (rand(nn.size(i), nn.size(i - 1)+1) - 0.5) * 2 * 4 * sqrt(6 / (nn.size(i) + nn.size(i - 1)));
- nn.vW{i - 1} = zeros(size(nn.W{i - 1}));
-
- // average activations (for use with sparsity)
- nn.p{i} = zeros(1, nn.size(i));
- end
- end
function nn = nnsetup(architecture) //首先從傳入的architecture中獲得這個網絡的整體結構,nn.n表示這個網絡有多少層,可以參照上面的樣例調用nnsetup([784 100 10])加以理解 nn.size = architecture; nn.n = numel(nn.size); //接下來是一大堆的參數,這個我們到具體用的時候再加以說明 nn.activation_function = 'tanh_opt'; // Activation functions of hidden layers: 'sigm' (sigmoid) or 'tanh_opt' (optimal tanh). nn.learningRate = 2; // learning rate Note: typically needs to be lower when using 'sigm' activation function and non-normalized inputs. nn.momentum = 0.5; // Momentum nn.scaling_learningRate = 1; // Scaling factor for the learning rate (each epoch) nn.weightPenaltyL2 = 0; // L2 regularization nn.nonSparsityPenalty = 0; // Non sparsity penalty nn.sparsityTarget = 0.05; // Sparsity target nn.inputZeroMaskedFraction = 0; // Used for Denoising AutoEncoders nn.dropoutFraction = 0; // Dropout level (http://www.cs.toronto.edu/~hinton/absps/dropout.pdf) nn.testing = 0; // Internal variable. nntest sets this to one. nn.output = 'sigm'; // output unit 'sigm' (=logistic), 'softmax' and 'linear' //對每一層的網絡結構進行初始化,一共三個參數W,vW,p,其中W是主要的參數 //vW是更新參數時的臨時參數,p是所謂的sparsity,(等看到代碼了再細講) for i = 2 : nn.n // weights and weight momentum nn.W{i - 1} = (rand(nn.size(i), nn.size(i - 1)+1) - 0.5) * 2 * 4 * sqrt(6 / (nn.size(i) + nn.size(i - 1))); nn.vW{i - 1} = zeros(size(nn.W{i - 1})); // average activations (for use with sparsity) nn.p{i} = zeros(1, nn.size(i)); end end
nntrain
setup大概就這樣一個過程,下面就到了train了,打開\NN\nntrain.m
我們跳過那些檢驗傳入數據是否正確的代碼,直接到關鍵的部分
denoising 的部分請參考論文:Extracting and Composing Robust Features with Denoising Autoencoders
- m = size(train_x, 1);
- //m是訓練樣本的數量
- //注意在調用的時候我們設置了opt,batchsize是做batch gradient時候的大小
- batchsize = opts.batchsize; numepochs = opts.numepochs;
- numbatches = m / batchsize; //計算batch的數量
- assert(rem(numbatches, 1) == 0, 'numbatches must be a integer');
- L = zeros(numepochs*numbatches,1);
- n = 1;
- //numepochs是循環的次數
- for i = 1 : numepochs
- tic;
- kk = randperm(m);
- //把batches打亂順序進行訓練,randperm(m)生成一個亂序的1到m的數組
- for l = 1 : numbatches
- batch_x = train_x(kk((l - 1) * batchsize + 1 : l * batchsize), :);
- //Add noise to input (for use in denoising autoencoder)
- //加入noise,這是denoising autoencoder需要使用到的部分
- //這部分請參見《Extracting and Composing Robust Features with Denoising Autoencoders》這篇論文
- //具體加入的方法就是把訓練樣例中的一些數據調整變為0,inputZeroMaskedFraction表示了調整的比例
- if(nn.inputZeroMaskedFraction ~= 0)
- batch_x = batch_x.*(rand(size(batch_x))>nn.inputZeroMaskedFraction);
- end
- batch_y = train_y(kk((l - 1) * batchsize + 1 : l * batchsize), :);
- //這三個函數
- //nnff是進行前向傳播,nnbp是后向傳播,nnapplygrads是進行梯度下降
- //我們在下面分析這些函數的代碼
- nn = nnff(nn, batch_x, batch_y);
- nn = nnbp(nn);
- nn = nnapplygrads(nn);
- L(n) = nn.L;
- n = n + 1;
- end
-
- t = toc;
- if ishandle(fhandle)
- if opts.validation == 1
- loss = nneval(nn, loss, train_x, train_y, val_x, val_y);
- else
- loss = nneval(nn, loss, train_x, train_y);
- end
- nnupdatefigures(nn, fhandle, loss, opts, i);
- end
-
- disp(['epoch ' num2str(i) '/' num2str(opts.numepochs) '. Took ' num2str(t) ' seconds' '. Mean squared error on training set is ' num2str(mean(L((n-numbatches):(n-1))))]);
- nn.learningRate = nn.learningRate * nn.scaling_learningRate;
- end
m = size(train_x, 1); //m是訓練樣本的數量 //注意在調用的時候我們設置了opt,batchsize是做batch gradient時候的大小 batchsize = opts.batchsize; numepochs = opts.numepochs; numbatches = m / batchsize; //計算batch的數量 assert(rem(numbatches, 1) == 0, 'numbatches must be a integer'); L = zeros(numepochs*numbatches,1); n = 1; //numepochs是循環的次數 for i = 1 : numepochs tic; kk = randperm(m); //把batches打亂順序進行訓練,randperm(m)生成一個亂序的1到m的數組 for l = 1 : numbatches batch_x = train_x(kk((l - 1) * batchsize + 1 : l * batchsize), :); //Add noise to input (for use in denoising autoencoder) //加入noise,這是denoising autoencoder需要使用到的部分 //這部分請參見《Extracting and Composing Robust Features with Denoising Autoencoders》這篇論文 //具體加入的方法就是把訓練樣例中的一些數據調整變為0,inputZeroMaskedFraction表示了調整的比例 if(nn.inputZeroMaskedFraction ~= 0) batch_x = batch_x.*(rand(size(batch_x))>nn.inputZeroMaskedFraction); end batch_y = train_y(kk((l - 1) * batchsize + 1 : l * batchsize), :); //這三個函數 //nnff是進行前向傳播,nnbp是后向傳播,nnapplygrads是進行梯度下降 //我們在下面分析這些函數的代碼 nn = nnff(nn, batch_x, batch_y); nn = nnbp(nn); nn = nnapplygrads(nn); L(n) = nn.L; n = n + 1; end t = toc; if ishandle(fhandle) if opts.validation == 1 loss = nneval(nn, loss, train_x, train_y, val_x, val_y); else loss = nneval(nn, loss, train_x, train_y); end nnupdatefigures(nn, fhandle, loss, opts, i); end disp(['epoch ' num2str(i) '/' num2str(opts.numepochs) '. Took ' num2str(t) ' seconds' '. Mean squared error on training set is ' num2str(mean(L((n-numbatches):(n-1))))]); nn.learningRate = nn.learningRate * nn.scaling_learningRate; end
下面分析三個函數nnff,nnbp和nnapplygrads
nnff
nnff就是進行feedforward pass,其實非常簡單,就是整個網絡正向跑一次就可以了
當然其中有dropout和sparsity的計算
具體的參見論文“Improving Neural Networks with Dropout“和Autoencoders and Sparsity
- function nn = nnff(nn, x, y)
- //NNFF performs a feedforward pass
- // nn = nnff(nn, x, y) returns an neural network structure with updated
- // layer activations, error and loss (nn.a, nn.e and nn.L)
-
- n = nn.n;
- m = size(x, 1);
-
- x = [ones(m,1) x];
- nn.a{1} = x;
-
- //feedforward pass
- for i = 2 : n-1
- //根據選擇的激活函數不同進行正向傳播計算
- //你可以回過頭去看nnsetup里面的第一個參數activation_function
- //sigm就是sigmoid函數,tanh_opt就是tanh的函數,這個toolbox好像有一點改變
- //tanh_opt是1.7159*tanh(2/3.*A)
- switch nn.activation_function
- case 'sigm'
- // Calculate the unit's outputs (including the bias term)
- nn.a{i} = sigm(nn.a{i - 1} * nn.W{i - 1}');
- case 'tanh_opt'
- nn.a{i} = tanh_opt(nn.a{i - 1} * nn.W{i - 1}');
- end
-
- //dropout的計算部分部分 dropoutFraction 是nnsetup中可以設置的一個參數
- if(nn.dropoutFraction > 0)
- if(nn.testing)
- nn.a{i} = nn.a{i}.*(1 - nn.dropoutFraction);
- else
- nn.dropOutMask{i} = (rand(size(nn.a{i}))>nn.dropoutFraction);
- nn.a{i} = nn.a{i}.*nn.dropOutMask{i};
- end
- end
- //計算sparsity,nonSparsityPenalty 是對沒達到sparsitytarget的參數的懲罰系數
- //calculate running exponential activations for use with sparsity
- if(nn.nonSparsityPenalty>0)
- nn.p{i} = 0.99 * nn.p{i} + 0.01 * mean(nn.a{i}, 1);
- end
-
- //Add the bias term
- nn.a{i} = [ones(m,1) nn.a{i}];
- end
- switch nn.output
- case 'sigm'
- nn.a{n} = sigm(nn.a{n - 1} * nn.W{n - 1}');
- case 'linear'
- nn.a{n} = nn.a{n - 1} * nn.W{n - 1}';
- case 'softmax'
- nn.a{n} = nn.a{n - 1} * nn.W{n - 1}';
- nn.a{n} = exp(bsxfun(@minus, nn.a{n}, max(nn.a{n},[],2)));
- nn.a{n} = bsxfun(@rdivide, nn.a{n}, sum(nn.a{n}, 2));
- end
- //error and loss
- //計算error
- nn.e = y - nn.a{n};
-
- switch nn.output
- case {'sigm', 'linear'}
- nn.L = 1/2 * sum(sum(nn.e .^ 2)) / m;
- case 'softmax'
- nn.L = -sum(sum(y .* log(nn.a{n}))) / m;
- end
- end
function nn = nnff(nn, x, y) //NNFF performs a feedforward pass // nn = nnff(nn, x, y) returns an neural network structure with updated // layer activations, error and loss (nn.a, nn.e and nn.L) n = nn.n; m = size(x, 1); x = [ones(m,1) x]; nn.a{1} = x; //feedforward pass for i = 2 : n-1 //根據選擇的激活函數不同進行正向傳播計算 //你可以回過頭去看nnsetup里面的第一個參數activation_function //sigm就是sigmoid函數,tanh_opt就是tanh的函數,這個toolbox好像有一點改變 //tanh_opt是1.7159*tanh(2/3.*A) switch nn.activation_function case 'sigm' // Calculate the unit's outputs (including the bias term) nn.a{i} = sigm(nn.a{i - 1} * nn.W{i - 1}'); case 'tanh_opt' nn.a{i} = tanh_opt(nn.a{i - 1} * nn.W{i - 1}'); end //dropout的計算部分部分 dropoutFraction 是nnsetup中可以設置的一個參數 if(nn.dropoutFraction > 0) if(nn.testing) nn.a{i} = nn.a{i}.*(1 - nn.dropoutFraction); else nn.dropOutMask{i} = (rand(size(nn.a{i}))>nn.dropoutFraction); nn.a{i} = nn.a{i}.*nn.dropOutMask{i}; end end //計算sparsity,nonSparsityPenalty 是對沒達到sparsitytarget的參數的懲罰系數 //calculate running exponential activations for use with sparsity if(nn.nonSparsityPenalty>0) nn.p{i} = 0.99 * nn.p{i} + 0.01 * mean(nn.a{i}, 1); end //Add the bias term nn.a{i} = [ones(m,1) nn.a{i}]; end switch nn.output case 'sigm' nn.a{n} = sigm(nn.a{n - 1} * nn.W{n - 1}'); case 'linear' nn.a{n} = nn.a{n - 1} * nn.W{n - 1}'; case 'softmax' nn.a{n} = nn.a{n - 1} * nn.W{n - 1}'; nn.a{n} = exp(bsxfun(@minus, nn.a{n}, max(nn.a{n},[],2))); nn.a{n} = bsxfun(@rdivide, nn.a{n}, sum(nn.a{n}, 2)); end //error and loss //計算error nn.e = y - nn.a{n}; switch nn.output case {'sigm', 'linear'} nn.L = 1/2 * sum(sum(nn.e .^ 2)) / m; case 'softmax' nn.L = -sum(sum(y .* log(nn.a{n}))) / m; end end
nnbp
代碼:\NN\nnbp.m
nnbp呢是進行back propagation的過程,過程還是比較中規中矩,和ufldl中的Neural Network講的基本一致
值得注意的還是dropout和sparsity的部分
- if(nn.nonSparsityPenalty>0)
- pi = repmat(nn.p{i}, size(nn.a{i}, 1), 1);
- sparsityError = [zeros(size(nn.a{i},1),1) nn.nonSparsityPenalty * (-nn.sparsityTarget ./ pi + (1 - nn.sparsityTarget) ./ (1 - pi))];
- end
-
- // Backpropagate first derivatives
- if i+1==n % in this case in d{n} there is not the bias term to be removed
- d{i} = (d{i + 1} * nn.W{i} + sparsityError) .* d_act; // Bishop (5.56)
- else // in this case in d{i} the bias term has to be removed
- d{i} = (d{i + 1}(:,2:end) * nn.W{i} + sparsityError) .* d_act;
- end
-
- if(nn.dropoutFraction>0)
- d{i} = d{i} .* [ones(size(d{i},1),1) nn.dropOutMask{i}];
- end
if(nn.nonSparsityPenalty>0) pi = repmat(nn.p{i}, size(nn.a{i}, 1), 1); sparsityError = [zeros(size(nn.a{i},1),1) nn.nonSparsityPenalty * (-nn.sparsityTarget ./ pi + (1 - nn.sparsityTarget) ./ (1 - pi))]; end // Backpropagate first derivatives if i+1==n % in this case in d{n} there is not the bias term to be removed d{i} = (d{i + 1} * nn.W{i} + sparsityError) .* d_act; // Bishop (5.56) else // in this case in d{i} the bias term has to be removed d{i} = (d{i + 1}(:,2:end) * nn.W{i} + sparsityError) .* d_act; end if(nn.dropoutFraction>0) d{i} = d{i} .* [ones(size(d{i},1),1) nn.dropOutMask{i}]; end
這只是實現的內容,代碼中的d{i}就是這一層的delta值,在ufldl中有講的
dW{i}基本就是計算的gradient了,只是后面還要加入一些東西,進行一些修改
具體原理參見論文“Improving Neural Networks with Dropout“ 以及 Autoencoders and Sparsity的內容
nnapplygrads
代碼文件:\NN\nnapplygrads.m
- for i = 1 : (nn.n - 1)
- if(nn.weightPenaltyL2>0)
- dW = nn.dW{i} + nn.weightPenaltyL2 * nn.W{i};
- else
- dW = nn.dW{i};
- end
-
- dW = nn.learningRate * dW;
-
- if(nn.momentum>0)
- nn.vW{i} = nn.momentum*nn.vW{i} + dW;
- dW = nn.vW{i};
- end
-
- nn.W{i} = nn.W{i} - dW;
- end
for i = 1 : (nn.n - 1) if(nn.weightPenaltyL2>0) dW = nn.dW{i} + nn.weightPenaltyL2 * nn.W{i}; else dW = nn.dW{i}; end dW = nn.learningRate * dW; if(nn.momentum>0) nn.vW{i} = nn.momentum*nn.vW{i} + dW; dW = nn.vW{i}; end nn.W{i} = nn.W{i} - dW; end
這個內容就簡單了,nn.weightPenaltyL2 是weight decay的部分,也是nnsetup時可以設置的一個參數
有的話就加入weight Penalty,防止過擬合,然后再根據momentum的大小調整一下,最后改變nn.W{i}即可
nntest
nntest再簡單不過了,就是調用一下nnpredict,在和test的集合進行比較
- function [er, bad] = nntest(nn, x, y)
- labels = nnpredict(nn, x);
- [~, expected] = max(y,[],2);
- bad = find(labels ~= expected);
- er = numel(bad) / size(x, 1);
- end
function [er, bad] = nntest(nn, x, y) labels = nnpredict(nn, x); [~, expected] = max(y,[],2); bad = find(labels ~= expected); er = numel(bad) / size(x, 1); end
nnpredict
代碼文件:\NN\nnpredict.m
- function labels = nnpredict(nn, x)
- nn.testing = 1;
- nn = nnff(nn, x, zeros(size(x,1), nn.size(end)));
- nn.testing = 0;
-
- [~, i] = max(nn.a{end},[],2);
- labels = i;
- end
function labels = nnpredict(nn, x) nn.testing = 1; nn = nnff(nn, x, zeros(size(x,1), nn.size(end))); nn.testing = 0; [~, i] = max(nn.a{end},[],2); labels = i; end
繼續非常簡單,predict不過是nnff一次,得到最后的output~~
max(nn.a{end},[],2); 是返回每一行的最大值以及所在的列數,所以labels返回的就是標號啦
(這個test好像是專門用來test 分類問題的,我們知道nnff得到最后的值即可)
總結
總的來說,神經網絡的代碼比較常規易理解,基本上和
UFLDL中的內容相差不大
只是加入了dropout的部分和denoising的部分
本文的目的也不奢望講清楚這些東西,只是給出一個路線,可以跟著代碼去學習,加深對算法的理解和應用能力