FastDataProcessingwithSpark2ThirdEdition.pdf
Fast_Data_Processing_with_Spark_2_-_Third_Edition.pdf, Spark快速数据处理,第三版Fast Data Processing with Spark 2Third editionCopyright o 2016 Packt PublishingAll rights reserved. No part of this book may be reproduced, stored in a retrieval system, ortransmitted in any form or by any means, without the prior written permission of thepublisher, except in the case of brief quotations embedded in critical articles or reviewsEvery effort has been made in the preparation of this book to ensure the accuracy of theinformation presented However the information contained in this book is sold withoutwarranty, either express or implied. Neither the authors, nor Packt Publishing, and itsdealers and distributors will be held liable for any damages caused or alleged to be causeddirectly or indirectly by this bookPackt Publishing has endeavored to provide trademark information about all of thecompanies and products mentioned in this book by the appropriate use of capitalsHowever, Packt Publishing cannot guarantee the accuracy of this informationFirst published October 2013Second edition: March 2015Third edition: October 2016Production reference: 1141016Published by Packt Publishing ltdLivery place35 Livery streetBirmingham b 3 2PB, UKISBN978-1-78588-927-1www.pAcktpub.comCreditsAuthorCopy EditorKrishna SankarSafis editingReviewersProject CoordinatorSumit paSuzzane coutinhAlexis roosCommissioning editorProofreaderAkram hussainSafis EditingAcquisition EditorIndexerTushar GuptaTejal daruwale soniContent Development Editor GraphicsNikhil borkarKirk penhaTechnical editorProduction coordinatorMadhunikita Sunil Chindarkar Melwyn D'saAbout the authorKrishna Sankar is a Senior Specialist- AI Data Scientist with volvo cars focusing onAutonomousVehicleshiSearlierstintsincludeChiefDataScientistathttp://cadenttechtv/, Principal Architect/Data Scientist at Tata America Intl. Corp. Director of Data Scienceat a bioinformatics startup, and as a Distinguished Engineer at Cisco. He has been speakingat various conferences including ml tutorials at Strata S]C and London 2016, Spark Summit[goo. g1/ab301D], Strata-Spark Camp, OSCON, Py Con, and PyData, writes about robotsRules of order [goo. g1/5yyRv6] Big Data Analytics-Best of the Worst [goo. gl/ImWCazpredictingNfl,Spark[http://goo.g1/e4kqmd],DataScience[http://goo.gl/9pyjmhjMachineLearning[http://goo.gl/sxf53n],SocialMediaanAlysis[http://goo.gl/d9ypvQ] as well as has been a guest lecturer at the Naval Postgraduate School. His occasionalblogscanbefoundathttps://doubleclix.wordpresscom/.Hisotherpassionisflyingdrones(working towards Drone Pilot License(FAA UAS Pilot)and Lego robotics-youwill find him at the st Louis fll world competition as robots design judgeMy first thanks goes to you, the reader who is taking time to understand the technologies theApache Spark brings to computation and to the developers of the Spark platform. The book reviewersSumit and Alexis did a wonderful and thorough job morphing my rough materials into correctreadable prose. This book is the result of dedicated work by many at Packt, notably nikhil borkar, theContent development editor, who deserves all the credit. Madhunikita, as always, has been theguiding force behind the hard work to bring the materials together, in more than one way. On apersonal note, my bosses at Volvo viz. Petter Horling, Vedad Cajic, Andreas wallin, and matsGustafsson are a constant source of guidance and insights. And of course, my spouse usha and sonKaushik always have an encouraging word; special thanks to Usha's father Mr Natarajan, whosewisdom we all rely upon, and my late mom for her kindnessAbout the reviewersSumit pal has more than 22 years of experience in the software industry in various rolesspanning companies from startups to enterprises. He is a big data, visualization and datascience consultant and a software architect and big data enthusiast and builds end-to-enddata-driven analytic systems. He has worked for Microsoft(SQL server development team),Oracle (olap development team), and verizon(big data analytics team) in a careerspanning 22 years. Currently, he works for multiple clients, advising them on their dataarchitectures and big data solutions and does hands on coding with Spark, Scala, Java, andPython. he has extensive experience in building scalable systems across the stack frommiddle tier, data tier to visualization for analytics applications, using big data and NoSQLatabasesSumit has deep expertise in DataBase Internals, Data Warehouses, Dimensional Modeling,and Data Science with Java and Python and SQL. Sumit started his career being part of SQLServer development team at microsoft in 1996-97 and then as a core server engineer forOracle at their olap development team in burlington ma sumit has also worked atVerizon as an associate director for big data architecture, where he strategized, managedarchitected, and developed platforms and solutions for analytics and machine learningapplications. He has also served as Chief Architect at ModeIN/LeapfrogRX (2006-2013)where he architected the middle tier core analytics platform with open source olapengine(mondrian)on J2EE and solved some complex Dimensional ETl, modeling, andperformance optimization problems. Sumit has Ms and bs in computer scienceAlexis roos (@alexisroos)has over 20 years of software engineering experience withstrong expertise in data science, big data, and application infrastructure. Currently anengineering manager at Salesforce, Alexis is managing a team of backend engineersbuilding entry level Salesforce CRM SalesforcelQ) Prior Alexis designed a comprehensiveUS business graph built from billion of records using Spark, GraphX, Mllib, and Scala atRadius IntelligenceAlexis also worked for Couchbase, Concurrent Inc startups, and Sun Microsystems/Oraclefor over 13 years and several large SIs over in Europe where he built and supported dozensof architectures of distributed applications across a range of verticals includingtelecommunications, healthcare, finance, and government. Alexis holds a master's degree incomputer science with a focus on cognitive science. He has spoken at dozens of conferencesworldwide(including Spark summit, Scala by the Bay, Hadoop Summit, and Java One)aswell as delivered university courses and participated in industry panelswww.paCktpub.comForsupportfilesanddownloadsrelatedtoyourbookpleasevisitwww.packtpub.ComDid you know that packt offers e book versions of every book published, with pdf andepubfilesavailableyoUcanupgradetotheebookversionatwww.packtpub.comandasaprint book customer, you are entitled to a discount on the ebook copy. Get in touch with usatservice@packtpub.comformoredetails.Atwww.packtpub.comyoucanalsoreadacollectionoffreetechnicalarticlessignupforarange of free newsletters and receive exclusive discounts and offers on packt books andeBooKsMapthttps://www.packtpub.com/maptGet the most in-demand software skills with MaptMapt gives you full access to all Packtbooks and video courses, as well as industry-leading tools to help you plan your personaldevelopment and advance your careerWhy subscribe?Fully searchable across every book published by PacktCopy and paste, print, and bookmark contentOn demand and accessible via a web browserTable of contentsPrefaceChapter 1: Installing Spark and Setting Up Your clusterDirectory organization and conventionInstalling the prebuilt distribution678Building Spark from sourceDownloading the source10Compiling the source with MavenCompilation switches13Testing the installation13Spark topologyA single machine15Running Spark on EC2Downloading EC-scripts16Running Spark on C2 with the scripts18Deploying Spark on Elastic MapReduce24Deploying Spark with Chef (opcode)25Deploying Spark on MesosSpark on YARN26Spark standalone modeReferencesSummary32Chapter 2: Using the Spark Shell33The Spark shell33Exiting out of the shell35Using Spark shell to run the book code35Loading a simple text file36Interactively loading data from $340Running the Spark shell in Python43Summary44Chapter 3: Building and Running a Spark Application45Building Spark applications45Data wrangling with iPython46Developing Spark with Eclipse47Developing Spark with other IDEs49Building your Spark job with Maven50Building your Spark job with something else52References52Summary53Chapter 4: Creating a SparkSession Object54Sparksession versus sparkContext54Building a SparkSession object56Spark Context -metadataShared java and scala apis59Python60iPythonReference62Summary63Chapter 5: Loading and Saving Data in Spark64Spark abstractionsRDDS65Data modalities66Data modalities and datasets/dataFrames/RDDs66Loading data into an RDD67Saving your data80References80SummaryChapter 6: Manipulating Your RDD2Manipulating your RDD in Scala and Java82Scala rdd functions93Functions for joining the PairRDD classes94Other PairRDD functions94Double rdd functions96General RDD functions6Java rdd functions99Spark Java function classes99Common java rdd functions100Methods for combining JavaRDDs102Functions on javaPairRDDs102Manipulating your RDD in PythonStandard rdd functions106The PairrdD functions108References110Summary110Chapter 7: Spark 2.0 ConceptsCode and datasets for the rest of the book112Code112IDE112iPython startup and test112Datasets114Car-mileage115Northwind industries sales data115Titanic passenger list115State of the Union speeches by PoTUs115Movie lens dataset116The data scientist and Spark features116Who is this data scientist DevOps person?117The data lake architecture118Data hub118Reporting Hub119Analytics Hub119Spark v2.0 and beyond119Apache Spark- evolution120Apache Spark-the full stack122The art of a big data store- Parquet123Column projection and data partitionCompression124Smart data storage and predicate pushdown124Support for evolving schema124Performance125References125Summary126Chapter 8: Spark SQL127The Spark sQl architecture127Spark sQl how-to in a nutshell128Spark sQl with Spark 2.0129Spark SQL programming130Datasets/Data Frames130SQL access to a simple data table130Handling multiple tables with Spark SQL134Aftermath140References141
用户评论
很好的资源,值得学习下
很不错的资源!!