Mining of Massive Datasets
The book is based on Stanford Computer Science course CS246: Mining Massive Datasets (and CS345A: Data Mining).The book, like the course, is designed at the undergraduate computer science level with no formal prerequisites. To support deeper explorations, most of the chapters are supplemented with fPrefaceThis book evolved from material developed over several years by Anand Rajaraman and Joff Ullman for a onc-quarter coursc at Stanford. The courscCS345A. titled "Web Mining was designed as an advanced graduate coursealthough it has become accessible and interesting to advanced undergraduatesWhat the book is aboutAt the highest level of description, this book is about data mining. Howeverit focuses on data mining of very large amounts of data, that is, data so largeit docs not fit in main memory. becausc of the cmphasis on sizc, many of ourexamples are about the Web or data derived from the Web. Further, the booktakes an a gorithmic point of view: da ta. mining is a bout. applying algorit hmsto data, rather than using data to"train"a machine-learning engine of somesort. The principal topics covered are1. Distributed file systems and map-reduce as a tool for creating parallelalgorithms that succeed on very large amounts of data2. Similarity search, including the key techniques of minhashing and localitysensitive hashing3. Data-stream processing and specialized algorithms for dealing with datathat, arrives so fast it must be processed immediately or lost4. The technology of search engines, including Google's Pagerank, link-spamdetection, and the hubs-and-authorities approach5. Frequent-itemset mining, including association rules, market-baskets, theA-Priori Algorithm and its improvements6. Algorithms for clustering very large, high-dimensional datasets7. Two key problems for Web applications: managing advertising and recommendation systemsPREFACEPrerequisitesCS345A. although its number indicates an advanced graduate course, has beenfound accessible by advanced undergraduates and beginning masters studentsIn the future, it is likely that the course will be given a mezzanine-level numberThe prerequisites for CS345A are1. The first course in data base systems, covering application programmingin SQL and other database-related languages such as XQuery2. A sophomore-level course in data structures, algorithms, and discretemat上3. A sophomore-level course in software systems, software engineering, andprogramming languagesExercisesThe book contains extensive exercises, with some for almost every section. Weindicate harder exercises or parts of exercises with an exclamation point. Thehardest exercises have a double exclamation pointSupport on the WebYou can find materials from past offerings of CS345A athttp://infolab.stanfordedu/ullman/mining/mining.htmlThere, you will find slides, homework assignments, project requirements, andIn some cases. examsAcknowledgementsWc would like to thank Foto Afrati and Arun maratha for critical readings ofthe draft of this manuscript. Errors were also reported by Shrey Gupta, MarkStorus, and Roshan Sumbaly. The remaining errors are ours, of courseARJ. D. UPalo Alto. CAApril 2010ContentsData mining1.1 What is Data mining?1.1.1 Statistical Modeling1.1.2 Machine Learning1.1. 3 Computational Approaches to Modeling1.1. 4 Summarization1.15 Feature extraction1.2 Statistical Limits on Data Mining1.2.1 Total Information awareness111223445561.2.2 Bonferronis Principle1.2.3 An Examplc of Bonfcrroni's Principlc1.2. 4 Exercises for Section 1.21. 3 Things usefull to Know1.3.1 Importancc of Words in Documents1.3.2 Hash Functions91.3.3 Indexes101.3.4Sccondaryrags111.3.5 The Base of Natural logarithMs121.3.6 Power Laws1.3.7 Excrciscs for Scction 1.3151. 4 Outline of the Book151.5 Summary of Chapter 1171.6 References for Chapter 1182 Large-Scale File SysteNs alld Map-Reduce192.1 Distributed File Systems202.1.1 Physical Organization of Compute Nodcs202.1.2 Large-Scale File-Systeln OrganizatioN2.2 Map-Reduce222.2.1 Thc Map Task232.2.2 Grouping and AggregatiOn42.2.3 The Reduce Tasks242.2.4 Combi25CONTENTS2.2.5 Details of Map-Reduce ExecutioN252.2.6 Coping with node failures262.3 Algorithms Using Map-Rcducc272.3.1 Matrix-Vector Multiplication by Map-Reduce272.3.2 If the Vector v Cannot Fit in Main Memory2.3.3 Relational-Algebra Operations292.3.4 Computing Selections by Map-reduce322.3.5 Computing Projections by Map-Reduce2.3.6 Union, Intersection, and Difference by Map-Reduce332.3.7 Computing Natural Join by Map-Reduce342.3.8 Generalizing t he Join Algorithm2.3.9 Grouping and Aggregation by Map-Reduce352.3.10 Matrix Multiplication352.3. 11 Matrix Multiplication with One Map-Reduce Step362.3.12 Exercises for Section 2.3372.4 Extensions to Map-Reduce382.4.1 WorkFlow Systems382.4.2 Recursive Extensions to Map-Reduce2.4.3 Pregel422.4.4 Exercises for Section 2.4432.5 Efficiency of Cluster-Computing Algorithms432.5.1 The Communication-Cost Model for ClusterCoinlputinlg2.5. 2 Elapsed Communication Cost462.5.3 Multiway Joins462.5.4 Exercises for Section 2.5492.6 Summary of chapter 22.7 References for Chapter 2523 Finding Similar Items553.1 Applications of Near-Neighbor Search553.1.1 Jaccard similarity of sets563.1.2 Similarity of documents563.1.3 Collaborative filtering as a similar-Sets problem573.1. 4 Exercises for Section 3.1593.2 Shingling of documents593.2. 1 k-Shingles593.2.2 Choosing the Shingle Size3.2.3 Hashing Shingles603.2.4 Shingles built from Words613.2.5 Exercises for Section 3.2623.3 Similarity-Preserving Summaries of s623.3.1 Matrix Representation of Sets623.3.2 Minhashin633.3.3 Minhashing and Jaccard SimilarityCONTENTS113.3.4 Minhash Signatures653.3.5 Computing Minhash Signatures3.3.6 Exercises for Section 3.3673.4 Locality-Sensitive Hashing for Documents693.4. 1 LSH for Minhash Signatures693.1.2 Analysis of the BananTechnique713.4.3 Combining the techniques723.4, 4 Exercises for section 3. 4733.5 Distance Measures3.5.1 Definition of a distance measure743.5.2 Euclidean Distances743.5.3 Jaccard Distance753.5.4 Cosine distance763.5.5 Edit distance773.5.6 Hamming Distance783.5.7 Exercises for Section 3.5793.6 The Theory of Locality-Sensitive Functions803.6.1 Loca. lity-Sensitive Functions3.6.2 Locality-Sensitive Families for Jaccard Distance3.6.3 Amplifying a Locality-Sensitive Family3. 6. 4 Exercises for section 3. 62353.7 LSH Families for Other Distance Measures863.7.1 LSII Families for Lamming Distance867.2 Random Hyperplanes and the Cosine distance3.7.3 Sketches3.7. 4 LSH Families for Euclidean distance893.7.5 More LSH Families for Euclidean Spaces03.7.6 Exercises for Section 3.73.8 Applications of Locality-Scnsitivc Hashing913.8.1 Entity resolution923.8.2 An Entity-Resolution Example923.8.3 Validating Rccord Matches933.8.4 Matching Fingerprints43.8.5 A LSH Family for Fingerprint Matching953.8.6 Similar News Articles3. 7 Exercises for Section 3.8983.9 Methods for High Degrees of Similarity993.9.1 Finding Identical Items993.9.2 Representing Sets as Strings1003.9.3 Length-Based Filterin1003.9.4 Prefix Indexing1013.9.5 Using Position Information1023.9.6 Using position and length in indexes1043.9.7 Exercises for Section 3.91063.10 Summary of Chapter 3107CONTENTS3. 11 References for Chapter 31104 Mining Data Streams1134.1 The Stream Data model1134.1.1 A Data-Stream-Management System1144.1.2 Examples of Stream Sources4.1.3 Stream Queries1164.1.4 Issues in Stream Processing1174.2 Sampling data in a Stream1184.2.1 A Motivating Example1184.2.2 Obtaining a Representative Sample1194.2.3 The General Sampling Problen4.2.4Var4.2.5 Exercises for Section42∴··120..1204.3 Filtering streams1214.3.1 A Motivating Example1214.3.2 Thc bloom filter1224.3.3 Analysis of Bloom Filtering1224.3.4 Exercises for section 4.31234.4 Counting Distinct Elcmcnts in a Strcam1244.4.1 The Coullt-Distinlct Problein1244.4.2 The Flajolet-Martin Algorithm1254.4.3 Combining estimates1264.4.4 Space requireinlelts1264.4.5 Exercises for Section 4.41274.5 Estimating Moments1274.5. 1 Definition of moments1274.5.2e Alon-Matias-Szegedy algorit hm for SecondMoments1284.5.3 Why the Alon-Matias-Szegedy Algorithm Works1294.5.4 Higher-Order Moments..1304.5.5 Dealing With Infinite Streams1304.5.6 Exercises for Section 4.51314. 6 Counting ones in a window1324.6.1 The Cost of Exact Counts1334.6.2 The Datar-Gionis-Indyk-Motwani Algorithm..,,1334.6. 3 Storage Requirements for the dgIm Algorithm1354.6.4 Query Answering in the DGi algorithm.,,,1354.6.5 Maintaining the DGIM Conditions1364.6.6 Reducing the error1374.6.7 Extensions to the Counting of ones1384.6. 8 Exercises for Section 4.61394.7 Decaying Windows1394.7.1 The Problem of Most-Common Elements1394.7.2 Definition of the Decaying Window140CONTENTSIX4.7. 3 Finding the most Popular elenents1414.8 Summary of Chapter 41424.9 Rcfcrcnccs for Chaptcr 41435 Link analvsis1455.1 PageRank.1455.1.1 Early Search Engines and Term Spam1465.1.2 Definition of PageRank1475.1. 3 Structure of the Web1515.1.4 Avoiding Dead Ends..1525.1.5 Spider Traps and Taxation1555.1.6 Using Page Rank in a Search Engine1575.1.7 Exercises for Section 5.11575.2 Efficient Computation of PageRank1595.2.1 Representing Transition Matrices1605.2.2 PageRank Iteration Using Map-Reduce1615.2.3 Use of combiners to Consolidate the result vector1615.2.4 Representing blocks of the Transition Matrix1625.2.5 Other Efficient Approaches to PageRank Iteration ... 1635.2.6 Exercises for Section 5.21655.3 Topic-Sensitive PageRank1655.3.1 Motivation for Topic-Sensitive Page Rank1655.3.2 Biased random Walks1665.3.3 Using Topic-Sensitive Pagerank1675.3.4 Inferring Topics from Words1685.3.5 Exercises for Section 5.31695. 4 Link Spam1695.4.1 Architecture of a spam farm1695.4.2 Analysis of a Spam Farm1715.4.3 Combating Link Spam1725. 4. 4 TrustRank1725.4.5 Spam Mass1735.4.6 Exercises for Section 5.41735.5 Hubs and authorities1745.5.1 The Intuition Behind HIts1745.5.2 Formalizing Hubbiness and Authority1755.5.3 Exercises for Section 5.5.1785.6 Summary of Chapter 55. 7 References for Chapter 51826 Frequent Itemsets1836.1 The Market-Basket Model1846.1.1 Definition of Frequent IteMsets1846.1.2 Applications of Frequent Itemsets1856.1.3 Association Rules187CONTENTS6. 1. 4 Finding Association Rules with High Confidence6.1.5 Exercises for Section 6.11896.2 Markct Baskets and thc A-Priori Algorithm1906.2.1 Representation of Market-Basket Data1916.2.2 Use of Main Memory for Itemset, Counting1926.2.3 Monotonicity of Itemsets.1946.2.4 Tyranny of Counting pairs1946.2.5 The a-Priori algorithm1956.2.6 A-Priori for All Frequent Itemsets1976.2.7 Exercises for Section 6.21986.3 Handling Larger Dat asets in Main Memory2006.3.1 The Algorithm of Park, Chen, and Yu2006.3.2 The Multistage Algorithm2026.3.3 The Multihash Algorithm6.3.4 Exercises for Section 6.32056.4 Limited-Pass algorithms2086.4.1 The Simple, Randomized Algorithm...2086.4.2 Avoiding Errors in Sampling Algorithms2096.4.3 The Algorithm of Savasere, Omiecinski, andNavathe2106.4.4 The SoN Algorithm and Map-Reduce2116.4.5 Toivonen's algorithm2116.4.6 Why Toivonen's Algorithm Works2136.4.7 Exercises for section 6.46.5 Counting Frcqucnt Items in a Strcam2146.5. 1 Sainpling Methods for Streams2146.5.2 Frequent Itemsets in Decaying Windows2156.5. 3 Hybrid Mcthods2166.5.4 Exercises for Section 6.52176.6 Summary of Chapter 62176.7 References for Chapter 62207 Clustering2217.1 Introduction to Clustering Techniques7.1.1 Points, Spaces, and Distances7.1.2 Clustering strategies..2237. 1. 3 The Curse of Dimensionality2247.1. 4 Exercises for Section 7.17.2 Hierarchical Clustering7.2.1 Hierarchical Clustering in a Euclidean Space2267.2.2 Eficiency of Hierarchical Clustering2287.2.3 Alternative Rules for Controlling HeirarchicalClustering2297.2.4 Hierarchical Clustering in Non-Euclidean Spaces2327.2.5 Exercises for Section 7.2233
下载地址
用户评论