Practical Data Science
Practical Data Science - A Guide to Building the Technology Stack for Turning Data Lakes into Business AssetsBook Description:Learn how to build a data science technology stack and perform good data science with repeatable methods. You will learn how to turn data lakes into business assets.The dPractical Data Science: A Guide to Building the Technology Stack for Turning DataLakes into business assetsAndreas Francois VermeulenWest Kilbride North Ayrshire, United KingdomISBN-13(pbk):978-1-4842-3053-4ISBN-13( electronic:978-1-4842-3054-1https://doi.org/10.1007/978-1-4842-3054-1Library of congress Control Number: 2018934681Copyright o 2018 by Andreas Francois VermeulenThis work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of thematerial is concerned, specifically the rights of translation, reprinting, reuse ofillustrations, recitation,broadcasting, reproduction on microfilms or in any other physical way, and transmission or informationstorage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology nowknown or hereafter developedTrademarked names, logos, and images may appear in this book. Rather than use a trademark symbol withevery occurrence of a trademarked name, logo, or image, we use the names, logos, and images only in aneditorial fashion and to the benefit of the trademark owner, with no intention of infringement of thetrademarkThe use in this publication of trade names, trademarks, service marks, and similar terms, even if they are notidentified as such, is not to be taken as an expression of opinion as to whether or not they are subject toproprietary rightsWhile the advice and information in this book are believed to be true and accurate at the date of publication,neither the author nor the editors nor the publisher can accept any legal responsibility for any errors oromissions that may be made. The publisher makes no warranty, express or implied, with respect to thematerial contained hereinManaging Director, Apress Media LLC: Welmoed SpahrAcquisitions Editor: Susan McDermottDevelopment Editor: Laura BerensonCoordinating Editor: Rita FernandoCover designed by eStudio CalamarCoverimagedesignedbyFreepik(www.freepik.com)Distributed to the book trade worldwide by springer Science+Business media New york233 Spring Street, 6th Floor, New York, NY 10013. Phone 1-800-SPRINGER, fax(201)348-4505, e-maiorders-ny@springer-sbm.comorvisitwww.springeronline.comApressMedia,LlcisaCaliforniaLlcandthe sole member (owner) is springer Science+ Business Media Finance Inc(SSBM Finance Inc). SSBMFinance Inc is a Delaware corporationForinformationontranslationspleasee-mailrights@apress.com,orvisitwww.apress.com/rights-permissionsApress titles may be purchased in bulk for academic, corporate, or promotional use e Book versions andlicenses are also available for most titles for more information reference our print and ebook bulk saleswebpageatwww.apress.com/bulk-salesAny source code or other supplementary material referenced by the author in this book is available toreadersonGithubviathebooksproductpagelocatedatwww.apress.com/9781484230534.Formoredetailedinformationpleasevisitwww.apress.com/source-codePrinted on acid-free paperTable of contentsAbout the author■■■■■■■■口■■■■About the technical reviewerXVIIAcknowledgments,,,,,,,围,围XXIntroduction maxXiChapter 1: Data Science Technology Stack■■■■■口■Rapid Information Factory Ecosysten..,,.,,,,,,,,,,,…,……1Data science Storage ToolsSchema-on-Write and schema-on-ReadData lake2244Data vaultHubsLinks…15SatellitesData Warehouse bus matrixData Science Processing Tools6Spark.Spark core.mmmemnmmncmnmncmncnnmnmnmnnnmnmcmmnmnn 7Spark SQL7Spark Streaming..m..7MLlib Machine Learning LibraryGraphX.……8Mesos9AkkaCassandraTABLE OF CONTENTSKafka10KafkaKafka streams…10Kafka connect10astic searchScala12Python12MQTT (MQ Telemetry Transport)13What's next?13Chapter 2: Vermeulen-Krennwallner-Hillman-Clarkmmmmmmmmmnmnam 15Windows……15Linux15It's Now time to Meet your customerlermeulen Plc16Krennwallner ag…17Hillman ltd wm 18Clark Ltd mmmmmmmm. 18Processing Ecosystem1920Apache Spark .Apache Mesos21AKka…21Apache Cassandra .KaKa…22Message Queue Telemetry Transport…………….………22EXample Ecosystem…………22Python......………23Is Python Ready?23TABLE OF CONTENTSR26Development Environment.….………27R Packages...................,,28Sample dataIP Addresses data sets.Customer Data Sets. mmmmm 35Logⅰ stics Data Sets..........,,……35Summary…Chapter 3: Layered Framework mm maman39Definition of data science frameworkCross-Industry Standard Process for Data Mining(CRISP-DM)Business UnderstandingData Understanding .mtoData Preparation42Modeling.…........42EVa| uation,,,,……42Deployment43Homogeneous Ontology for Recursive Uniform Scheman…43The Top Layers of a Layered Framework….,.,.,…,…,244The basics for Business layerThe Basics for Utility layer.mm.mmem46The Basics for Operational management layer47The Basics for Audit, Balance, and Control Layer. mmmmmmmmmmmmmmmmmmmm 48The basics for Functional layerLayered Framework for High-Level Data Science and EngineeringWindows51Linux51ummaryTABLE OF CONTENTSChapter 4: Business Layer■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■a■■■■■53Business Layer153The Functional Requirements54The Nonfunctional Requirements…….63Common Pitfalls with Requirements.Engineering a Practical Business Layer…,……81Requirements81Requirements Registry81Traceability Matrix82Summary83Chapter 5: Utility Layer■■■■■■■■■■■■■口■■■■■■■■■■■■■■■■■■■■■■■口■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■口■■■■85Basic Utility design…,87Data Processing Utilities89Maintenance Utilities112Processing Utilities,114Engineering a Practical Utility layer.…..,,,……….,15Maintenance Utility smnnnnanennnmmnnn 115Data UtilityProcessing Utility....,.,,.,.,,,,,,,,,,,,,,,,,,…,116Summary117Chapter 6: Three Management Layers■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■a■■■■■■口■a■119Operational management layer.Processing-Stream Definition and management..,,,,,,,,,…,……119Parameters…120Scheduling121Monitoring mm.e.ee...nt123Communication124Alerting.,,,,,,,,124Audit, Balance, and Control Layer125Audit日面面日面D日面面面日面125Process Tracking…….........130TABLE OF CONTENTSDataProvenance,130Data Lineage. .Balance…131Control131Yoke solutionProducerConsumer…133Directed Acyclic Graph Scheduling134Yoke Example……,,,134Cause-and- Effect Analysis system...,.,,.,.,…,……,140Functional layer141Data science process141Start with a What-if Question .Take a guess at a potential pattern141Gather Observations and Use Them to Produce a Hypothesis.mmnnmmmmnmm.., 142Use real- World evidence to Verify the Hypothesis.,…,……,142Collaborate Promptly and Regularly with Customers and Subject Matter ExpertsAs You Gain Insights…...…142Summary..me.a144Chapter 7: Retrieve Superstepammmaaaaaamammammmma 147Data lakes148Data Swamps1491. Start with Concrete Business Questions1492. Data Governance,1503. Data Quality1724. Audit and version Management172Training the Trainer Model .memnmnmmnnnnnnmmenmennmmmnnmnnn 172Understanding the Business Dynamics of the Data Lake……….,.,.,.…173R Retrieve solution173Vermeulen plc…173TABLE OF CONTENTSKrennwallner ag.…186Hillman ltd188Clark Ltd mmmm 194Actionable business knowledge from Data Lakes202Engineering a Practical Retrieve Superstep...na.,202Vermeulen plcKrennwallner ag,……209Hillman ltd,222Clark ltd259Connecting to other Data Sources261SQLite261Date and time2640 ther databases…266PostgreSQl267Microsoft SQL Servern267MySQL…267Oracle268Microsoft excel,268Apache Spark…269Apache Cassandra,269Apache Hive■面目270Apache hadoop270Amazon s3 Storage.…..................271Amazon redshift271Amaz0 n Web services…272Summary….273Chapter 8: Assess Superstepmmmmmamaamaammammmmmmmmmmm 275Assess Superstep275Errors…276Accept the Error.…276Reject the Error.TABLE OF CONTENTSCorrect the error277Create a default value277Analysis of Data277Completeness277Uniqueness…........................,,,.278Timeliness278validity278Accuracy…......,,,,,,…,278279Practical Actions279Missing Values in Pandas279Engineering a Practical Assess Superstep296Vermeulen plc.mmmm. 296Krennwallner ag339Hillman ltd356Clark Ltd406Summary...,,……,2420Chapter 9: Process Superstep421Data vault422HUbS…422Links422Satellites422ReferenceTime-Person-Object-Location-Event Data Vault…423Time section424Person section0 bject section......................430Location section mmu,,,,,,, 433Event section.…436Engineering a Practical Process Superstep439Time439
下载地址
用户评论