Notes on Convolutional Neural Networks.pdf

上传者：jzwd48459 2020-04-29 03:24:44上传 PDF文件 500kb 热度 31次

3ConvolutionalNeuralNetworks Typicallyconvolutionallayersareinterspersedwithsub-samplinglayerstoreducecomputationtime andtograduallybuildupfurtherspatialandconfiguralinvariance.Asmallsub-samplingfactoris desirablehoweverinordertomaintainspecificityatthesametime.Ofcourse,thisideaisnotnew buttheconceptisbothsimpleandpowerful.Themammalianvisualcortexandmodelsthereof[12 8,7drawheavilyonthesethemes,andauditoryneurosciencehasrevealedinthepasttenyears orsothatthesesamedesignparadigmscanhefoundintheprimaryandbeltauditoryareasofthe cortexinanumberofdifferentanimals16,ll,9].Hierarchicalanalysisandlearningarchitectures mayyetbethekeytosuccessintheauditorydomain 3.1ConvolutionLayers Letsmoveforwardwithderivingthebackpropagationupdatesforconvolutionallayersinanetwork Ataconvolutionlayer,thepreviouslayer'sfeaturemapsareconvolvedwithlearnablekernelsand putthroughtheactivationfunctiontoformtheoutputfeaturemap.Eachoutputmapmaycombine convolutionswithmultipleinputmaps.Ingeneral,wehavethat ∑x-*k+b whereM;representsaselectionofinputmaps,andtheconvolutionisofthe"valid"borderhandling typewhenimplementedinMATLAB.Somecommonchoicesofinputmapsincludeall-pairsorall triplets,butwewilldiscusshowonemightlearncombinationsbelow.Eachoutputmapisgivenan additivebias6,howeverforaparticularoutputmaptheinputmapswillbeconvolvedwithdistinct kernels.Thatistosay,ifoutputmapjandmapkbothsumoverinputmapi,thenthekernels appliedtomapiaredifferentforoutputmapsjandki. 3.1.1Computingthegradients Weassumethateachconvolutionlayerisfollowedbyadownsamplinglayer+1.thebackpropa- gationalgorithmsaysthatinordertocomputethesensitivityforaunitatlayerl,weshouldfirstsum overthenextlayer'ssensitiviescorrespondingtounitsthatareconnectedtothenodeofinterestin thecurrentlayerL,andmultiplyeachofthoseconnectionsbytheassociatedweightsdefinedatlayer e+1.Wethenmultiplythisquantitybythederivativeoftheactivationfunctionevaluatedatthe currentlayer'spre-activationinputs,u.Inthecaseofaconvolutionallayerfollowedbyadownsam plinglayer,onepixelinthenextlayer'sassociatedsensitivitymapdcorrespondstoablockofpixels intheconvolutionallayer'soutputmapThuseachunitinamapatlayerlconnectstoonlyoneunit inthecorrespondingmapatlayer(+1.Tocomputethesensitivitiesatlayerlefficiently,wecan upsamplethedownsamplinglayerssensitivitymaptomakeitthesamesizeastheconvolutional layer'smapandthenjustmultiplytheupsampledsensitivitymapfromlayerl+lwiththeactivation derivativemapatlayereelement-wise.Theweightsdefinedatadownsamplinglayermapareall equaltoB(aconstant,seesection3.2),sowejustscalethepreviousstep'sresultby6tofinishthe omputationof8.Wecanrepeatthesamecomputationforeachmapiintheconvolutionallayer pairingitwiththecorrespondingmapinthesubsamplinglayer E+1 f(u;)oup( whereup()denotesanupsamplingoperationthatsimplytileseachpixelintheinputhorizontall andverticallyntimesintheoutputifthesubsamplinglayersubsamplesbyafactorofm.Aswe willdiscussbelow,onepossiblewaytoimplementthisfunctionefficientlyistousetheKronecker Nowthatwehavethesensitivitiesforagivenmap,wecanimmediatelycomputethebiasgradient bysimplysummingoveralltheentriesind aE ∑(6)n U Finally,thegradientsforthekernelweightsarecomputedusingbackpropagation,exceptinthiscase thesameweightsaresharedacrossmanyconnections.We'llthereforesumthegradientsforagiven weightoveralltheconnectionsthatmentionthisweight,justaswedidforthebiasterm ak-2(85)wu(ps)uu where(p-auuisthepatchinx-Ithatwasmultipliedelementwisebyk;duringconvolution inordertocomputetheelementat(u,v)intheoutputconvolutionmapx.Atfirstglanceitmay appearthatweneedtopainstakinglykeeptrackofwhichpatchesintheinputmapcorrespondto whichpixelsintheoutputmap(anditscorrespondingmapofsensitivities),butequation(7)canbe implementedinasinglelineofMatLaBusingconvolutionoverthevalidregionofoverlap =m18002x,m1),ai Herewerotatethe8imageinordertoperformcross-correlationratherthanconvolution,androtate theoutputbacksothatwhenweperformconvolutioninthefeed-forwardpass,thekernelwillhave theexpectedorientation 3.2Sub-samplingLayers Asubsamplinglayerproducesdownsampledversionsoftheinputmaps.IfthereareNinputmaps thentherewillbeexactlyNoutputmaps,althoughtheoutputmapswillbesmaller.Moreformally fB;down(xi)+6 wheredown()representsasub-samplingfunction.Typicallythisfunctionwillsumovereach distinctn-by-nblockintheinputimagesothattheoutputimageisn-timessmalleralongboth spatialdimensions.EachoutputmapisgivenitsownmultiplicativebiasBandanadditivebiasb Vecanalsosimplythrowawayeveryothersampleintheimage[10] 3.2.1Computingthegradients Thedifficultyhereliesincomputingthesensitivitymaps.Onewe'vegotthem,theonlylearnable parametersweneedtoupdatearethebiasparameters6andb.Wewillassumethatthesubsampling layers,aresurroundedaboveandbelowbyconvolutionlayers.Ifthelayerfollowingthesubsampling withthevanillabackpropagationequationsintroducedinsectionsamplinglayercanbecomputed layerisafullyconnectedlayer,thenthesensitivitymapsforthesub Whenwetriedtocomputethegradientofakernelinsection3.1.1,wehadtofigureoutwhichpatch intheinputcorrespondedtoagivenpixelintheoutputmap.Here,wemustfigureoutwhichpatch inthecurrentlayerssensitivitymapcorrespondstoagivenpixelinthenextlayer'ssensitivitymap inordertoapplyadeltarecursionthatlookssomethinglikeequation(4).Ofcourse,theweights multiplyingtheconnectionsbetweentheinputpatchandtheoutputpixelareexactlytheweightsof the(rotated)convolutionkernel.Thisisagainefficientlyimplementedusingconvolution 2 80(k+),f) Asbefore,werotatethekerneltomaketheconvolutionfunctionperformcross-correlation.Notice thatinthiscase,however,werequirethefullconvolutionborderhandling,toborrowagainfrom MATLAB'Snomenclature.Thissmalldifferenceletsusdealwiththebordercaseseasilyandeffi ciently,wherethenumberofinputstoaunitatlayere+lisnotthefullsizeofthenxnconvolution kernel.Inthosecases,the"fullconvolutionwillautomaticallypadthemissinginputswithzeros Atthispointwerereadytocomputethegradientsforband6.Theadditivebiasisagainjustthe sumovertheelementsofthesensitivityma dE ∑(6) 71,1 PatriceSimard'spullingvs"pushingappearstobeunnecessaryifyouuseconvwithzeropaddingto computethesensitivitiesandgradients ThemultiplicativebiasBwillofcourseinvolvetheoriginaldown-sampledmapcomputedatthe currentlayerduringthefeedforwardpass.Forthisreason,itisadvantageoustosavethesemaps duringthefeedforwardcomputation,sowedonthavetorecomputethemduringbackpropagation Letsdefine ThenthegradientforBisgivenby OE ∑(6%°d 3.3LearningCombinationsofFeatureMaps Oftentimes,itisadvantageoustoprovideanoutputmapthatinvolvesasumoverseveralconvo lutionsofdifferentinputmaps.Intheliterature,theinputmapsthatarecombinedtoformagiven outputmaparetypicallychosenbyhand.Wecan.however,attempttolearnsuchcombinations duringtraining.Letaijdenotetheweightgiventoinputmapiwhenformingoutputmap3.Then outputmapjisgivenby )+b subjecttotheconstraints ∑ Theseconstraintscanbeenforcedbysettingtheaiivariablesequaltothesoftmaxoverasetof uNconstrainedunderlyingweightsc Xp(C ∑kexp(ck) Becauseeachsetofweightscijforfixedjareindependentofallothersuchsetsforanyother],we canconsidertheupdatesforasinglemapanddropthesubscriptj.Eachmapisupdatedinthesame way,exceptwithdifferentjindices Thederivativeofthesoftmaxfunctionisgivenby (wherehereOisusedastheKroneckerdelta),whilethederivativeof(1)withrespecttotheai variablesatlayerlis dededu acOutac ∑(6。(x-1+k) Here,disthesensitivitymapcorrespondingtoanoutputmapwithinputsl.Again,theconvolution isthevalidtypesothattheresultwillmatchthesizeofthesensitivitymap.Wecannowusethe chainruletocomputethegradientsoftheerrorfunction(1)withrespecttotheunderlyingweights aE c ∑ aldak aD OL k (10) 3.3.1EnforcingSparsecombination Wecanalsotrytoimposesparsenessconstraintsonthedistributionofweightsaiforagivenmap byaddingaregularizationpenaltyQ(a)tothefinalerrorfunction.Indoingso,we'llencourage someoftheweightstogotozero.Inthatcase,onlyafewinputmapswouldcontributesignificantly toagivenoutputmap,asopposedtoallofthem.Letswritetheerrorforasinglepatternas En-n+∑(a) andfindthecontributionoftheregularizationtermtothegradientfortheweightsCi.Theuser definedparameterXcontrolsthetradeoffbetweenminimizingthefitofthenetworktothetraining data,andensuringthattheweightsmentionedintheregularizationtermaresmallaccordingtothe l-norm.Wewillagainconsideronlytheweightsaiforagivenoutputmapanddropthesubscript 3.First,weneedthat 入sign(a) (12) O everywhereexceptattheorigin.Combiningthisresultwith(&)willallowustoderivethecontribu tion ∑ dakd (13) (14) ThefinalgradientsfortheweightsCiwhenusingthepenalizederrorfunction(11)canbecomputed using(13)and(9): 0E0E”a9 C 3.4MakingitFastwithMATLAB Inanetworkwithalternatingsub-samplingandconvolutionlayersthemaincomputationalbottle necksare: 1.Duringthefeedforwardpass:downsamplingtheconvolutionallayer'soutputmaps 2.Duringbackpropagation:upsamplingofahighersub-samplinglayersdelta'stomatchthe sizeofthelowerconvolutionallayersoutputmaps 3.Applicationofthesigmoidanditsderivative Performingtheconvolutionsduringboththefeedforwardandbackproagationstagesarealsocompu tationalbottlenecksofcourse,butassumingthe2Dconvolutionroutineisefficientlyimplemented, thereisn'tmuchwecandoaboutit OnemightbetemptedhowevertouseMATtLAB'sbuilt-inimageprocessingroutinestohandle theup-anddown-samplingoperations.Forup-sampling,imresizewilldothejob,butwith significantoverhead.afasteralternativeistousetheKroneckerproductfunctionkron,withth matrixtobeupsampledandamatrixofones.Thiscanbeanorderofmagnitudefasterwhen Itcomestothedown-samplingstepduringthefeedforwardpass,imresizedoesnotprovideth optiontodownsamplebysummingoverdistinctn-by-nblocks.Thenearest-neighbor"method willreplaceablockofpixelsbyonlyoneoftheoriginalpixelsintheblock.analternativeisto applyblkproctoeachdistinctblock,orsomecombinationofim2colandcolfilt.While bothoftheseoptionsonlycomputeswhat'snecessaryandnothingmore,repeatedcallstotheuser definedblock-processingfunctionimposessignificantoverhead.Amuchfasteralternativeinthis caseistoconvolvetheimagewithamatrixofones,andthensimplytakeevery-otherentryusing standardindexing(i.e.y=x(1:2:end,1:2:end)).Althoughconvolutioninthiscaseactually computesfourtimesasmanyoutputs(assuming2xdownsampling)aswereallyneed,thismethod isstill(empirically)anorderofmagnitudeorsofasterthanthepreviouslymentionedapproaches Mostauthors,itseems,implementthesigmoidactivationfunctionanditsderivativeusinginline functiondefinitions.Atthetimeofthiswriting"inline"matlabfunctiondefinitionsarenot atalllikeCmacros,andtakeahugeofamountoftimetoevaluate.Thus,itisoftenworthitto simplyreplaceallreferencestofandfwiththeactualcode.Thereisofcourseatradeoffbetween optimizingyourcodeandmaintainingreadability 4PracticalTrainingIssues(Incomplete) 4.1BatchvsOnlineUpdates Stochasticdescentys.batchlearning 4.2LearningRates LeCun'sstochasticonlinemethod(diagonalapproxtothehessian).Isitworthit?Viren'sidea:at leasthaveadifferentrateforeachlayer,becausegradientsatthelowerlayersaresmallerandless reliable.LeCunmakessimilarstatementsin5] 4.3ChoiceofErrorfunction Squarederror(MLe),vs.cross-entropy.Thelattercanbemoreeffectiveforsomeclassification tasks[2 4.4CheckingYourWorkwithFinite-Differences Finite-differencescanbeanindispensabletoolwhenitcomestimetoverifythatyou'vegotyour backpropagationimplementation(orderivation)correctItisremarkablyeasytomakemanymis takesandstillhaveanetworkthatappearstobelearningsomething.Checkingthegradientsyour codeproducesagainstfinitedifferenceestimatesisonewaytomakesureyoudonthaveanyerrors Forasingleinputpattern,estimatethegradientsusingsecond-orderfinitedifferences EE(2+e)-E( 2 andcheckagainstthegradientsyourbackpropagationcodeisreturning.Epsilonshouldbesmall,but nottoosmalltocausenumericalprecisionproblems.Somethinglikee=10-8couldbeok.Note thatusingfinitedifferencestotrainthenetworkiswildlyinefficient(i.e.O(W)forWweightsin thenetwork);theO(w)speedadvantageofbackpropagationiswellworththehassle References [1]CM.BishopNeuralNetworksforPatternRecognition",OxfordUniversityPress,NewYork 1995 [2]F.J.HuangandYLeCun.Large-scaleLearningwithSVMandConvolutionalforGenericOb jectCategorization",In:Proc.2006IEEEComputerSocietyConferenceonComputerVision andPatternRecognition,vol.1,pp.284-291,2006 3YLeCun,B.Boser,J.Denker,D.Henderson,R.Howard,w.hubbard,andLJackel."Back propagationappliedtohandwrittenzipcoderecognition,,NeuralComputation,1(4),1989 [4]Y.LeCun,L.Bottou,Y.Bengio,andP.Haffner."Gradient-basedlearningappliedtodocument recognition,ProceedingsoftheIEEE,vol86,pp.2278-2324,November1998 [5]YLeCun,LBottou,G.Orr,andKMuller"EfficientBackProp",in:NeuralNetworks.Tricks ofthetrade,G.OrrandK.Muller(eds),"Springer",1998 [6]J.PRauscheckerandBTian."Mechanismsandstreamsforprocessingofwhat'andwhere inauditorycortex,"Proc.Natl.Acad.Sci.USA,97(22),11800-11806,2000 [7]TSerre,M.Kouh,C.Cadieu,U.Knoblich,G.KreimanandTPoggio."ATheoryofobject Recognition:ComputationsandCircuitsintheFeedforwardPathoftheVentralStreamin PrimateVisualCortex,CBCLPaper#259/AIMemo#2005-036,MassachusettsInstituteof Technology,October,2005 [8]T.Serre,A.OlivaandT.Poggio."AFeedforwardArchitectureAccountsforRapidCatego rization,Proc.Natl.Acad.Sci.USA,(104)15,pp.6424-6429,2007 [9]SShamma."OntheroleofspaceandtimeinauditoryprocessingTRENDSinCognitive Sciences.Vol5No8.2001 [10P.Y.Simard,DaveSteinkraus,andJohnC.Platt."BestPracticesforConvolutionalNeural NetworksAppliedtoVisualDocumentAnalysis",ProceedingsoftheInternationalconference onDocumentAnalysisandRecognition,pp.958-962,2003 [11]F.E.Theunissen,K.Sen,andADoupe,"Spectral-temporalreceptivefieldsofnonlinearaudi toryneuronsobtainedusingnaturalsounds,J.Neuro.,Vol20,pp.2315-2331,2000 [12]D.Zoccolan,M.Kouh,JDiCarloandTPoggio.Tradeoffbetweenselectivityandtolerance inmonkeyanteriorinferiortemporalcortex,J.Neurosci.,2007

下载地址

用户评论

更多下载