Notes on Convolutional Neural Networks.pdf
3ConvolutionalNeuralNetworks
Typicallyconvolutionallayersareinterspersedwithsub-samplinglayerstoreducecomputationtime
andtograduallybuildupfurtherspatialandconfiguralinvariance.Asmallsub-samplingfactoris
desirablehoweverinordertomaintainspecificityatthesametime.Ofcourse,thisideaisnotnew
buttheconceptisbothsimpleandpowerful.Themammalianvisualcortexandmodelsthereof[12
8,7drawheavilyonthesethemes,andauditoryneurosciencehasrevealedinthepasttenyears
orsothatthesesamedesignparadigmscanhefoundintheprimaryandbeltauditoryareasofthe
cortexinanumberofdifferentanimals16,ll,9].Hierarchicalanalysisandlearningarchitectures
mayyetbethekeytosuccessintheauditorydomain
3.1ConvolutionLayers
Letsmoveforwardwithderivingthebackpropagationupdatesforconvolutionallayersinanetwork
Ataconvolutionlayer,thepreviouslayer'sfeaturemapsareconvolvedwithlearnablekernelsand
putthroughtheactivationfunctiontoformtheoutputfeaturemap.Eachoutputmapmaycombine
convolutionswithmultipleinputmaps.Ingeneral,wehavethat
∑x-*k+b
whereM;representsaselectionofinputmaps,andtheconvolutionisofthe"valid"borderhandling
typewhenimplementedinMATLAB.Somecommonchoicesofinputmapsincludeall-pairsorall
triplets,butwewilldiscusshowonemightlearncombinationsbelow.Eachoutputmapisgivenan
additivebias6,howeverforaparticularoutputmaptheinputmapswillbeconvolvedwithdistinct
kernels.Thatistosay,ifoutputmapjandmapkbothsumoverinputmapi,thenthekernels
appliedtomapiaredifferentforoutputmapsjandki.
3.1.1Computingthegradients
Weassumethateachconvolutionlayerisfollowedbyadownsamplinglayer+1.thebackpropa-
gationalgorithmsaysthatinordertocomputethesensitivityforaunitatlayerl,weshouldfirstsum
overthenextlayer'ssensitiviescorrespondingtounitsthatareconnectedtothenodeofinterestin
thecurrentlayerL,andmultiplyeachofthoseconnectionsbytheassociatedweightsdefinedatlayer
e+1.Wethenmultiplythisquantitybythederivativeoftheactivationfunctionevaluatedatthe
currentlayer'spre-activationinputs,u.Inthecaseofaconvolutionallayerfollowedbyadownsam
plinglayer,onepixelinthenextlayer'sassociatedsensitivitymapdcorrespondstoablockofpixels
intheconvolutionallayer'soutputmapThuseachunitinamapatlayerlconnectstoonlyoneunit
inthecorrespondingmapatlayer(+1.Tocomputethesensitivitiesatlayerlefficiently,wecan
upsamplethedownsamplinglayerssensitivitymaptomakeitthesamesizeastheconvolutional
layer'smapandthenjustmultiplytheupsampledsensitivitymapfromlayerl+lwiththeactivation
derivativemapatlayereelement-wise.Theweightsdefinedatadownsamplinglayermapareall
equaltoB(aconstant,seesection3.2),sowejustscalethepreviousstep'sresultby6tofinishthe
omputationof8.Wecanrepeatthesamecomputationforeachmapiintheconvolutionallayer
pairingitwiththecorrespondingmapinthesubsamplinglayer
E+1
f(u;)oup(
whereup()denotesanupsamplingoperationthatsimplytileseachpixelintheinputhorizontall
andverticallyntimesintheoutputifthesubsamplinglayersubsamplesbyafactorofm.Aswe
willdiscussbelow,onepossiblewaytoimplementthisfunctionefficientlyistousetheKronecker
Nowthatwehavethesensitivitiesforagivenmap,wecanimmediatelycomputethebiasgradient
bysimplysummingoveralltheentriesind
aE
∑(6)n
U
Finally,thegradientsforthekernelweightsarecomputedusingbackpropagation,exceptinthiscase
thesameweightsaresharedacrossmanyconnections.We'llthereforesumthegradientsforagiven
weightoveralltheconnectionsthatmentionthisweight,justaswedidforthebiasterm
ak-2(85)wu(ps)uu
where(p-auuisthepatchinx-Ithatwasmultipliedelementwisebyk;duringconvolution
inordertocomputetheelementat(u,v)intheoutputconvolutionmapx.Atfirstglanceitmay
appearthatweneedtopainstakinglykeeptrackofwhichpatchesintheinputmapcorrespondto
whichpixelsintheoutputmap(anditscorrespondingmapofsensitivities),butequation(7)canbe
implementedinasinglelineofMatLaBusingconvolutionoverthevalidregionofoverlap
=m18002x,m1),ai
Herewerotatethe8imageinordertoperformcross-correlationratherthanconvolution,androtate
theoutputbacksothatwhenweperformconvolutioninthefeed-forwardpass,thekernelwillhave
theexpectedorientation
3.2Sub-samplingLayers
Asubsamplinglayerproducesdownsampledversionsoftheinputmaps.IfthereareNinputmaps
thentherewillbeexactlyNoutputmaps,althoughtheoutputmapswillbesmaller.Moreformally
fB;down(xi)+6
wheredown()representsasub-samplingfunction.Typicallythisfunctionwillsumovereach
distinctn-by-nblockintheinputimagesothattheoutputimageisn-timessmalleralongboth
spatialdimensions.EachoutputmapisgivenitsownmultiplicativebiasBandanadditivebiasb
Vecanalsosimplythrowawayeveryothersampleintheimage[10]
3.2.1Computingthegradients
Thedifficultyhereliesincomputingthesensitivitymaps.Onewe'vegotthem,theonlylearnable
parametersweneedtoupdatearethebiasparameters6andb.Wewillassumethatthesubsampling
layers,aresurroundedaboveandbelowbyconvolutionlayers.Ifthelayerfollowingthesubsampling
withthevanillabackpropagationequationsintroducedinsectionsamplinglayercanbecomputed
layerisafullyconnectedlayer,thenthesensitivitymapsforthesub
Whenwetriedtocomputethegradientofakernelinsection3.1.1,wehadtofigureoutwhichpatch
intheinputcorrespondedtoagivenpixelintheoutputmap.Here,wemustfigureoutwhichpatch
inthecurrentlayerssensitivitymapcorrespondstoagivenpixelinthenextlayer'ssensitivitymap
inordertoapplyadeltarecursionthatlookssomethinglikeequation(4).Ofcourse,theweights
multiplyingtheconnectionsbetweentheinputpatchandtheoutputpixelareexactlytheweightsof
the(rotated)convolutionkernel.Thisisagainefficientlyimplementedusingconvolution
2
80(k+),f)
Asbefore,werotatethekerneltomaketheconvolutionfunctionperformcross-correlation.Notice
thatinthiscase,however,werequirethefullconvolutionborderhandling,toborrowagainfrom
MATLAB'Snomenclature.Thissmalldifferenceletsusdealwiththebordercaseseasilyandeffi
ciently,wherethenumberofinputstoaunitatlayere+lisnotthefullsizeofthenxnconvolution
kernel.Inthosecases,the"fullconvolutionwillautomaticallypadthemissinginputswithzeros
Atthispointwerereadytocomputethegradientsforband6.Theadditivebiasisagainjustthe
sumovertheelementsofthesensitivityma
dE
∑(6)
71,1
PatriceSimard'spullingvs"pushingappearstobeunnecessaryifyouuseconvwithzeropaddingto
computethesensitivitiesandgradients
ThemultiplicativebiasBwillofcourseinvolvetheoriginaldown-sampledmapcomputedatthe
currentlayerduringthefeedforwardpass.Forthisreason,itisadvantageoustosavethesemaps
duringthefeedforwardcomputation,sowedonthavetorecomputethemduringbackpropagation
Letsdefine
ThenthegradientforBisgivenby
OE
∑(6%°d
3.3LearningCombinationsofFeatureMaps
Oftentimes,itisadvantageoustoprovideanoutputmapthatinvolvesasumoverseveralconvo
lutionsofdifferentinputmaps.Intheliterature,theinputmapsthatarecombinedtoformagiven
outputmaparetypicallychosenbyhand.Wecan.however,attempttolearnsuchcombinations
duringtraining.Letaijdenotetheweightgiventoinputmapiwhenformingoutputmap3.Then
outputmapjisgivenby
)+b
subjecttotheconstraints
∑
Theseconstraintscanbeenforcedbysettingtheaiivariablesequaltothesoftmaxoverasetof
uNconstrainedunderlyingweightsc
Xp(C
∑kexp(ck)
Becauseeachsetofweightscijforfixedjareindependentofallothersuchsetsforanyother],we
canconsidertheupdatesforasinglemapanddropthesubscriptj.Eachmapisupdatedinthesame
way,exceptwithdifferentjindices
Thederivativeofthesoftmaxfunctionisgivenby
(wherehereOisusedastheKroneckerdelta),whilethederivativeof(1)withrespecttotheai
variablesatlayerlis
dededu
acOutac
∑(6。(x-1+k)
Here,disthesensitivitymapcorrespondingtoanoutputmapwithinputsl.Again,theconvolution
isthevalidtypesothattheresultwillmatchthesizeofthesensitivitymap.Wecannowusethe
chainruletocomputethegradientsoftheerrorfunction(1)withrespecttotheunderlyingweights
aE
c
∑
aldak
aD
OL
k
(10)
3.3.1EnforcingSparsecombination
Wecanalsotrytoimposesparsenessconstraintsonthedistributionofweightsaiforagivenmap
byaddingaregularizationpenaltyQ(a)tothefinalerrorfunction.Indoingso,we'llencourage
someoftheweightstogotozero.Inthatcase,onlyafewinputmapswouldcontributesignificantly
toagivenoutputmap,asopposedtoallofthem.Letswritetheerrorforasinglepatternas
En-n+∑(a)
andfindthecontributionoftheregularizationtermtothegradientfortheweightsCi.Theuser
definedparameterXcontrolsthetradeoffbetweenminimizingthefitofthenetworktothetraining
data,andensuringthattheweightsmentionedintheregularizationtermaresmallaccordingtothe
l-norm.Wewillagainconsideronlytheweightsaiforagivenoutputmapanddropthesubscript
3.First,weneedthat
入sign(a)
(12)
O
everywhereexceptattheorigin.Combiningthisresultwith(&)willallowustoderivethecontribu
tion
∑
dakd
(13)
(14)
ThefinalgradientsfortheweightsCiwhenusingthepenalizederrorfunction(11)canbecomputed
using(13)and(9):
0E0E”a9
C
3.4MakingitFastwithMATLAB
Inanetworkwithalternatingsub-samplingandconvolutionlayersthemaincomputationalbottle
necksare:
1.Duringthefeedforwardpass:downsamplingtheconvolutionallayer'soutputmaps
2.Duringbackpropagation:upsamplingofahighersub-samplinglayersdelta'stomatchthe
sizeofthelowerconvolutionallayersoutputmaps
3.Applicationofthesigmoidanditsderivative
Performingtheconvolutionsduringboththefeedforwardandbackproagationstagesarealsocompu
tationalbottlenecksofcourse,butassumingthe2Dconvolutionroutineisefficientlyimplemented,
thereisn'tmuchwecandoaboutit
OnemightbetemptedhowevertouseMATtLAB'sbuilt-inimageprocessingroutinestohandle
theup-anddown-samplingoperations.Forup-sampling,imresizewilldothejob,butwith
significantoverhead.afasteralternativeistousetheKroneckerproductfunctionkron,withth
matrixtobeupsampledandamatrixofones.Thiscanbeanorderofmagnitudefasterwhen
Itcomestothedown-samplingstepduringthefeedforwardpass,imresizedoesnotprovideth
optiontodownsamplebysummingoverdistinctn-by-nblocks.Thenearest-neighbor"method
willreplaceablockofpixelsbyonlyoneoftheoriginalpixelsintheblock.analternativeisto
applyblkproctoeachdistinctblock,orsomecombinationofim2colandcolfilt.While
bothoftheseoptionsonlycomputeswhat'snecessaryandnothingmore,repeatedcallstotheuser
definedblock-processingfunctionimposessignificantoverhead.Amuchfasteralternativeinthis
caseistoconvolvetheimagewithamatrixofones,andthensimplytakeevery-otherentryusing
standardindexing(i.e.y=x(1:2:end,1:2:end)).Althoughconvolutioninthiscaseactually
computesfourtimesasmanyoutputs(assuming2xdownsampling)aswereallyneed,thismethod
isstill(empirically)anorderofmagnitudeorsofasterthanthepreviouslymentionedapproaches
Mostauthors,itseems,implementthesigmoidactivationfunctionanditsderivativeusinginline
functiondefinitions.Atthetimeofthiswriting"inline"matlabfunctiondefinitionsarenot
atalllikeCmacros,andtakeahugeofamountoftimetoevaluate.Thus,itisoftenworthitto
simplyreplaceallreferencestofandfwiththeactualcode.Thereisofcourseatradeoffbetween
optimizingyourcodeandmaintainingreadability
4PracticalTrainingIssues(Incomplete)
4.1BatchvsOnlineUpdates
Stochasticdescentys.batchlearning
4.2LearningRates
LeCun'sstochasticonlinemethod(diagonalapproxtothehessian).Isitworthit?Viren'sidea:at
leasthaveadifferentrateforeachlayer,becausegradientsatthelowerlayersaresmallerandless
reliable.LeCunmakessimilarstatementsin5]
4.3ChoiceofErrorfunction
Squarederror(MLe),vs.cross-entropy.Thelattercanbemoreeffectiveforsomeclassification
tasks[2
4.4CheckingYourWorkwithFinite-Differences
Finite-differencescanbeanindispensabletoolwhenitcomestimetoverifythatyou'vegotyour
backpropagationimplementation(orderivation)correctItisremarkablyeasytomakemanymis
takesandstillhaveanetworkthatappearstobelearningsomething.Checkingthegradientsyour
codeproducesagainstfinitedifferenceestimatesisonewaytomakesureyoudonthaveanyerrors
Forasingleinputpattern,estimatethegradientsusingsecond-orderfinitedifferences
EE(2+e)-E(
2
andcheckagainstthegradientsyourbackpropagationcodeisreturning.Epsilonshouldbesmall,but
nottoosmalltocausenumericalprecisionproblems.Somethinglikee=10-8couldbeok.Note
thatusingfinitedifferencestotrainthenetworkiswildlyinefficient(i.e.O(W)forWweightsin
thenetwork);theO(w)speedadvantageofbackpropagationiswellworththehassle
References
[1]CM.BishopNeuralNetworksforPatternRecognition",OxfordUniversityPress,NewYork
1995
[2]F.J.HuangandYLeCun.Large-scaleLearningwithSVMandConvolutionalforGenericOb
jectCategorization",In:Proc.2006IEEEComputerSocietyConferenceonComputerVision
andPatternRecognition,vol.1,pp.284-291,2006
3YLeCun,B.Boser,J.Denker,D.Henderson,R.Howard,w.hubbard,andLJackel."Back
propagationappliedtohandwrittenzipcoderecognition,,NeuralComputation,1(4),1989
[4]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner."Gradient-basedlearningappliedtodocument
recognition,ProceedingsoftheIEEE,vol86,pp.2278-2324,November1998
[5]YLeCun,LBottou,G.Orr,andKMuller"EfficientBackProp",in:NeuralNetworks.Tricks
ofthetrade,G.OrrandK.Muller(eds),"Springer",1998
[6]J.PRauscheckerandBTian."Mechanismsandstreamsforprocessingofwhat'andwhere
inauditorycortex,"Proc.Natl.Acad.Sci.USA,97(22),11800-11806,2000
[7]TSerre,M.Kouh,C.Cadieu,U.Knoblich,G.KreimanandTPoggio."ATheoryofobject
Recognition:ComputationsandCircuitsintheFeedforwardPathoftheVentralStreamin
PrimateVisualCortex,CBCLPaper#259/AIMemo#2005-036,MassachusettsInstituteof
Technology,October,2005
[8]T.Serre,A.OlivaandT.Poggio."AFeedforwardArchitectureAccountsforRapidCatego
rization,Proc.Natl.Acad.Sci.USA,(104)15,pp.6424-6429,2007
[9]SShamma."OntheroleofspaceandtimeinauditoryprocessingTRENDSinCognitive
Sciences.Vol5No8.2001
[10P.Y.Simard,DaveSteinkraus,andJohnC.Platt."BestPracticesforConvolutionalNeural
NetworksAppliedtoVisualDocumentAnalysis",ProceedingsoftheInternationalconference
onDocumentAnalysisandRecognition,pp.958-962,2003
[11]F.E.Theunissen,K.Sen,andADoupe,"Spectral-temporalreceptivefieldsofnonlinearaudi
toryneuronsobtainedusingnaturalsounds,J.Neuro.,Vol20,pp.2315-2331,2000
[12]D.Zoccolan,M.Kouh,JDiCarloandTPoggio.Tradeoffbetweenselectivityandtolerance
inmonkeyanteriorinferiortemporalcortex,J.Neurosci.,2007
用户评论