大数据 Hadoop MapReduce
MapReduce 基础及算法设计New Task: run thousands of simulationsYou have sets ofparameters forthousands of sma∥simulationsDivide the parametersets among kcomputersf is runs thesimulation andproduces someoutput, apply it toevery itemNow we have a bigdistributed set ofsimulation resultshttp.:/escience.washingtonedu/get-help-now/dave-williams-simulating-muscle-dynamics-cloudw Find the most common word in each documentYou have millionsof documentsDistribute thedocuments among kcomputersf finds the mostcommon word in asingle documentNow we have a bigdistributed list of(doc id, word) pairsUNIVERSITY Of WaShIngtonConsider a slightly more general program to compute theword frequency of every word in a single documentAbridged Declaration of IndependenceA Declaration By the Representatives of the United States of America, in Gcncral Congress AssembledWhen in the course of human events it becomes necessary for a people to advance from that subordination iwhich they have hitherto remained, and to assume among powers of the earth the equal and independent station(people, 2)to which the laws of nature and of nature's god entitle them, a decent respect to the opinions of mankindrequires that they should declare thewhich(government, 6)We hold these truths to be self-evident; that all men are created equal and independent; that from that equalcreation they derive rights inherent and inalienable, among which are the preservation of life, and liberty, and(assume, 1)he pursuit of happiness; that to secure these ends, governments are instituted among men, deriving their justpower from the consent of the governed; that whenever any form of government shall become destructive of(history, 2)these ends, it is the right of the people to alter or to abolish it, and to institute new government, laying it'sfoundation on such principles and organizing it's power in such form, as to them shall seem most likely to effecttheir safety and happiness. Prudence indeed will dictate that governments long established should not bechanged for light and transient causes: and accordingly all experience hath shewn that mankind are moredisposed to suffer while evils are sufferable, than to right themselves by abolishing the forms to which they areaccustomed. But when a long train of abuses and usurpations, begun at a distinguished period, and pursuinginvariably the same object, evinces a design to reduce them to arbitrary power, it is their right, it is their duty, tothrow off such government and to provide new guards for future security. Such has been the patient sufferingsof the colonies; and such is now the necessity which constrains them to expunge their former systems ofgovernment. the history ofhis present majesty is a history of unremitting injuries and usurpationS, among whichno one fact stands single or solitary to contradict the uniform tenor of the rest, all of which have in direct objecthe establishment of an absolute tyranny over these states. To prove this, let facts be submitted to a candidworld, for the truth of which we pledge a faith yet unsullied by falsehoodw Compute the word frequency of 5M documentsYou have millionsof documentsDistribute thedocuments among kcomputersFor each documentf returns a set of(word, freq) pairsNow we have a bigdistributed list ofsets of word freesUNIVERSITY Of WaShIngtonThere's a pattern hereA function that maps a read to a trimmed readA function that maps a tiF F image to a PNG imagea function that maps a set of parameters to a simulation resultA function that maps a document to its most common wordA function that maps a document to a histogram of wordfrequenciesw What if we want to compute the wordfrequency across all documents?IN CONGRESS. JuLy 1.76Cofie imantintons Declaration fs s-a a States of -mericauCe all to moborAniASdr irUS ConstitutionDeclaration ofIndependenceArticles ofConfederation(people, (8(government, 123)(assume, 23)(history, 38)Compute the word frequency across 5M documentsYou have millionsof documentsDistribute thedocuments amongk computersFor each documentmapmapmapmapmapmap return a set of (word,freq) pairsBut we don't want a bunch of little histogramswe want one big histogramNow what?How can we make sure that a single computerhas access to every occurrence of a given wordregardless of which document it appeared in?w Compute the word frequency across 5M documentsDistribute thedocuments among kcomputersFor each documentmapmapmapmapmapmapreturn a set of(word, freq) pairsNow we have a bigdistributed list ofsets of word freqsNow just count thereducereducereducereduceoccurrences of each word3We have ourdistributed histogram
用户评论