Packt Machine Learning with Spark
MachineLearningwithSpark
Copyrighto2015PacktPublishi
ing
Allrightsreserved.Nopartofthisbookmaybereproduced,storedinaretrieval
system,ortransmittedinanyformorbyanymeans,withoutthepriorwritten
permissionofthepublisher,exceptinthecaseofbriefquotationsembeddedin
criticalarticlesorreviews
rthasbeenmadeinthepreparationofthisbooktoensuretheaccuracy
oftheinformationpresentedHowevertheinformationcontainedinthisbookis
soldwithoutwarranty,citherexpressorimplied.NeithertheauthornorPacl
Publishinganditsdealersanddistributorswillbeheldliableforanydamages
causedorallegedtobecauseddirectlyorindirectlybythisbook
PacktPublishinghasendeavoredtoprovidetrademarkinformationaboutallofthe
companiesandproductsmentionedinthisbookbytheappropriateuseofcapitals
However,PacktPublishingcannotguaranteetheaccuracyofthisinfori
Firstpublished:February2015
Productionreference:1170215
PublishedbyPacktPublishingltd
Liveryplace
35Liverystreet
Birminghamb32PBUK
ISBN978-178328-8519
www.packtpub.com
CoverimagebyAkshayPaunikar(akshaypaunikar4@gmail.com)
Credits
Author
ProjectCoordinator
NickPentreath
Miltondsouza
Reviewers
Proofreaders
Andreamostosi
SimranBhogal
Haoren
Mariagould
Krishnasankar
AmeeshaGreen
Paulhindle
CommissioningEditor
Rebeccayoue
Indexer
PriyaSane
AcquisitionEditor
Rebeccayoue
Graphics
SheetalAute
ContentDevelopmentEditor
Abhinashsahu
Susmitasabat
Productioncoordinator
Technicaleditors
NiteshThakur
Vivekarora
PankajKadam
CoverWork
NiteshThakur
CopyEdit
KarunaNarayanan
Abouttheauthor
lickPentreathhasabackgroundinfinancialmarkets,machinelearning,and
softwaredevelopment.HehasworkedatGoldmanSachsGroup,Inc;asaresearch
scientistattheonlineadtargetingstart-upCognitiveMatchLimited,London;and
ledthedataScienceandanalyticsteamatmxit,Africaslargestsocialnetwork
Heisacofounderofgraphflow,abigdataandmachinelearningcompanyfocused
onuser-centricrecommendationsandcustomerintelligence.Heispassionateabout
combiningcommercialfocuswithmachinelearningandcutting-edgetechnologyto
buildintelligentsystemsthatlearnfromdatatoaddvaluetothebottomline
NickisamemberoftheApacheSparkProjectManagementCommittee
Acknowledgments
Writingthisbookhasbeenquitearollercoasterrideoverthepastyear,withmany
upsanddowns,latenights,andworkingweekends.Ithasalsobeenextremely
rewardingtocombinemypassionformachinelearningwithmyloveoftheApache
sparkproject,andIhopetobringsomeofthisoutinthisbook
Iwouldliketothankthepacktpublishingteamforalltheirassistancethroughout
thewritingandeditingprocess:Rebecca,Susmita,Sudhir,Amey,Neil,Vivek,
Pankaj,andeveryonewhoworkedonthebook
ThanksalsogotoDeboradonatoatstumbleUponforassistancewithdata-and
legal-relatedqueries
Writingabooklikethiscanbeasomewhatlonelyprocess,soitisincrediblyhelpful
togetthefeedbackofreviewerstounderstandwhetheroneisheadedintheright
direction(andwhatcourseadjustmentsneedtobemade).I'mdeeplygrateful
AndreaMostosi,HaoRen,andKrishnaSankarfortakingthetimetoprovidesuch
detailedandcriticalfeedback
Icouldnothavegottenthroughthisprojectwithouttheunwaveringsupportofall
myfamilyandfriendsespeciallymywonderfulwifeTammywhowillbegladto
havemebackintheeveningsandonweekendsonceagain.Thankyouall!
Finally,thankstoallofyoureadingthis;Ihopeyoufindituseful
Aboutthereviewers
AndreaMostosiisatechnologyenthusiast.Aninnovationloversincehewasa
child,hestartedaprofessionaljobin2003andworkedonseveralprojects,playing
almosteveryroleinthecomputerscienceenvironment.HeiscurrentlytheCtoat
TheFool,acompanythattriestomakesenseofwebandsocialdata.Duringhisfree
time,helikestraveling,running,cooking,biking,andcoding
iwouldliketothankmygeekfriends:Simonem,Danielev,lucat,
LuigiP,Michelen,lucaO,Lucab,DiegoCandFabiob.theyare
thesmartestpeopleiknow,andcomparingmyselfwiththemhas
alwayspushedmetobebetter
Haorenisasoftwaredeveloperwhoispassionateaboutscala,distributed
systems,machinelearning,andApacheSpark.HewasanexchangestudentatEPFL
whenhelearnedaboutScalain2012.HeiscurrentlyworkinginParisasabackend
anddataengineerforclaravista-acompanythatfocusesonhigh-performance
marketing.HisworkresponsibilityistobuildaSpark-basedplatformforpurchase
predictionandanewrecommendersystem
Besidesprogrammingheenjoysrunning,swimming,andplayingbasketballand
badmintonYoucanlearnmoreathisbloghttp://www.invkrh.me
Krishnasankarisachiefdatascientistatblackarrowwhereheisfocusin
ng
onenhancinguserexperienceviainference,intelligence,andinterfaces.earlier
stintsincludeworkingasaprincipalarchitectanddatascientistatTataAmerica
InternationalCorporation,directorofdatascienceatabioinformaticsstart-up
company,andasadistinguishedengineeratCiscoSystems,Inc.Hehasspokenat
variousconferencesaboutdatascience(http://goo.gl/9pyjmh),machinelearning
(http://goo.gl/ssem2r),andsocialmediaanalysis(http://goo.g1/d9ypvq).Ile
hasalsobeenaguestlecturerattheNavalPostgraduateSchool.Hehaswrittenafew
booksonJava,wirelessLansecurity,Web2.0,andnowonSpark.Hisotherpassion
iSLEGOrobotics.EarlierinApril,hewasattheStLouisFlLWorldCompetitionas
arobotsdesignjudge
Www.Packtpub.com
Supportfiles,eBooks,discountoffers,andmore
Forsupportfilesanddownloadsrelatedtoyourbook,pleasevisit
wwwpacktpub.com
DidyouknowthatpacktoffersebookversionsofeverybookpublishedwithPDF
andepuBfilesavailableYoucanupgradetotheebookversionatwww.packtpub
comandasaprintbookcustomer,youareentitledtoadiscountontheebookcopy
Getintouchwithusatservice@packtpubcomformoredetails
Atwww.Packtpub.comyoucanalsoreadacollectionoffreetechnicalarticles
signupforarangeoffreenewslettersandreceiveexclusivediscountsandoffers
onpacktbooksandebooks
PACKTLIB
https://www2.packtpub.ccm/books/subscription/packtlib
DoyouneedinstantsolutionstoyourITquestions?PacktLibisPackt'sonlinedigital
booklibrary.IIere,youcansearch,access,andreadPackt'sentirelibraryofbooks
Whysubscribe?
FullysearchableacrosseverybookpublishedbyPackt
Copyandpaste,print,andbookmarkcontent
Ondemandandaccessibleviaawebbrowser
Freeaccessforpacktaccountholders
IfyouhaveanaccountwithPacktatwww.packtpub.comyoucanusethistoaccess
PacktLibtodayandviewgentirelyfreebooksSimplyuseyourlogincredentialsfor
immediateaccess
Tableofcontents
Preface
Chapter1:GettingUpandRunningwithSpark
7
InstallingandsettingupSparklocally
8
Sparkclusters
10
TheSparkprogrammingmodel
SparkContextandSparkConf
TheSparkshell
12
Resilientdistributeddatasets
14
CreatingRDds
15
Sparkoperations
15
CachingRDDs
18
Broadcastvariablesandaccumulators
19
Thefirststeptoasparkprograminscala
21
ThefirststeptoasparkprograminJava
24
ThefirststeptoaSparkprograminPython
28
GettingSparkrunningonAmazonEC2
30
Launchinganec2Sparkcluster
31
Summary
35
Chapter2:DesigningaMachineLearningSystem
37
IntroducingMovieStream
38
Businessusecasesforamachinelearningsystem
39
Personalization
40
Targetedmarketingandcustomersegmentation
40
Predictivemodelingandanalytics
41
Typesofmachinelearningmodels
41
Thecomponentsofadata-drivenmachinelearningsystem
42
Dataingestionandstorage
42
Datacleansingandtransformation
43
下载地址
用户评论