TAO: Facebook’s Distributed Data Store for the Social Graph
Facebook’s Distributed Data Store for the Social GraphActions may be encoded either as objects or associ- 3.4 Association Query APIations. Both Cathys comment and Davids like'represent actions taken by a user but only the comment resultsThe starting point for any TAO association query is anin a new object. Associations naturally model actionsoriginating object and an association type. This is thethat can happen at most once or record state transitions,natural result of searching for a specific type of informa-such as the acceptance of an event invitation, while retion about a particular object. Consider the example ineatable actions are better represented as objectsFigure 1. In order to display the Checkin object, theAlthough associations are directed, it is common for application needs to enumerate all tagged users and thean association to be tightly coupled with an inverse edgemost recently added commentsIn this example all of the associations have an inverse A characteristic of the social graph is that most of theexcept for the link of type COMMENT. No inverse data is old, but many of the queries are for the newestedge is required here since the application does not tra-subset. This creation-time locality arises whenever anverse from the comment to the ChECKIn object. Once application focuses on recent items. If the Alice in Fig-the checkin's id is known, rendering Figure l a only re- ure I is a famous celebrity then there might be thousandsquires traversing outbound associations. Discovering theof comments attached to her checkin, but only the mostcheckin object, however, requires the inbound edges or recent ones will be rendered by defaultthat an id is stored in another Facebook systemTAOS association queres are organized around assoThe schemas for object and association types describe ciation lists. We define an association list to be the list ofonly the data contained in instances. They do not impose all associations with a particular idl and atype, arrangedany restrictions on the edge types that can connect to a in descending order by the time fieldparticular node type, or the node types that can terminatean edge type. The same atype is used to represent auAssociation lists:(id1, atype)→anew…aodthorship of the checkin object and the comment object in For example, the list(i, COMMENT)has edges to theFigure 1, for example. Self-edges are allowedexamples comments about i, most recent firstTAOs queries on associations lists3.2 Object APIassoc-get(idl, atype, id2set, high?, low?TAOS object API provides operations to allocate a newreturns all of the associations(id1, atype, id2)andobject and id, and to retrieve, update, or delete the objecttheir time and data, where id2 E id 2set and highassociated with an id. a notable omission is a compare≥time≥low( Gif specified). The optional timeand-set functionality, whose usefulness is substantiallbounds are to improve cacheability for large assoreduced by TAOs eventual consistency semantics. Theciation lists(see§update operation can be applied to a subset of the fieldsassoc-count(idl, atype)-returns the size of the3.3 Association apiassociation list for(id 1, atype), which is the numes of type atype that originate at id1Many edges in the social graph are bidirectional, eiassoc_range(id 1, atype, pos, limit)-returns elther symmetrically like the examples FRIEND rela-ements of the(idl, atype)association list with intionship or asymmetrically like AUTHORED and AUdexi∈pos,pos+ limit)THORED_BY. Bidirectional edges are modeled as twoseparate associations. TAO provides support for keepingassoc_time_range idl, atype, high, low, limit)returns elements from the(id 1, atype) associationassociations in sync with their inverses, by allowing as-list. starting with the first association where timesociation types to be configured with an inverse type. Forsuch associations, creations, updates, and deletions arehigh, returning only edges where time 2 lowautomatically coupled with an operation on the inverseAO enforces a per-atype upper bound(typicallyassociation. Symmetric bidirectional types are their own 6,000)on the actual limit used for an association queryinverses. The association write operations areTo enumerate the elements of a longer association listassoc-add (idl, atype, id2, time,(k-v)*)the client must issue multiple queries, using pos or highAdds or overwrites the association(idl, atype, id2),to specity a starting pointand its inverse(id 1, inv(atype), id2 )if definedFor the example shown in Figure I we can map someassoc_delete(id1, atype, id2 )-Deletes the asso-possible queries to the TAO API as follows:ciation(id1, atype, id2)and the inverse if it exists·“50 most recent comments on alice' s checkin”→assoc_change-type(id1, atype, id, newtype)assoc_range(632, COMMENT, 0. 50)Changes the association(id 1, atype, id2 )to(id1,·“ How many checkins at the Gg bridge?”→newtype, id2), if(id1, atype, id2)existsassoc_count(534, CHECKINUSENIX Association2013 USENIX Annual Technical Conference(USENIX ATC 13)514 TAO Architectureor write. For cache misses and write requests, the servercontacts other caches and/or databasesIn this section we describe the units that make up TAOThe TAO in-memory cache contains objects, associand the multiple layers of aggregation that allow it to ation lists, and association counts. We fill the cache onscale across data centers and geographic regions. TAOdemand and evict items using a least recently used (Lru)is separated into two caching layers and a storage layer.policy. Cache servers understand the semantics of their4.1 Storage Lavercontents and use them to answer queries even if the exactquery has not been previously processed, e.g. a cachedObjects and associations were stored in My SQL at Face- count of zero is sufficient to answer a range querybook even before TAO was built; it was the backing storeWrite operations on an association with an inversefor the original PHP implementation of the API. This may involve two shards, since the forward edge is storedmade it the natural choice for TAOs persistent storageon the shard for idl and the inverse edge is on the shardThe TAo API is mapped to a small set of simple for id2. The tier member that receives the query fromSQL queries, but it could also be mapped efficiently to the client issues an RPC call to the member hosting id2range scans in a non-sQL data storage system such as which will contact the database to create the inverse aSso-LevelDb 3] by explicitly maintaining the required in- ciation. Once the inverse write is complete, the cachingdexes. When evaluating the suitability of a backing store server issues a write to the database for id 1. TAO doesfor TAO. however, it is important to consider the data not provide atomicity between the two updates. Ifaccesses that don t use the API. These include back- failure occurs the forward may exist without an inverseups, bulk import and deletion of data, bulk migrations these hanging associations are scheduled for repair by anfrom one data format to another, replica creation, asyn- asynchronous jobchronous replication, consistency monitoring tools, andoperational debugging. An alternate store would also 4.3 Client Communication Stackhave to provide atomic write transactions, efficient granular writes, and few latency outliersIt is common for hundreds of objects and associationsGiven that TAO needs to handle a far larger volume ofof to be queried while ren dering a Facebook page, which isdata than can be stored on a single MysQL server, welikely to require communication with many cache serversdivide data into logical shards. Each shard is containedin a short period of time. The challenges of the resultingin a logical database. Database servers are responsibleall-to-all communication are similar to those faced by ourfor one or more shards. In practice, the number of shardsmemcache pools. TAO and memcache share most of thear exceeds the number of servers, we tune the shard toclient stack described by Nishtala et al. [21]. The latencyserver mapping to balance load across different hosts. Byof TAo requests can be much higher than those of memdefault all object types are stored in one table, and acache, because TAO requests may access the databaseassociation types in anotherso to avoid head-of-line blocking on multiplexed connecEach object id contains an embedded shard_id thattions we use a protocol with out-of-order responsesidentifies its hosting shard. Objects are bound to a shardfor their entire lifetime. An association is stored on the4.4 Leaders and followersshard of its idl, so that every association query can beIn theory a single cache tier could be scaled to handle anyserved from a single server. Two ids are unlikely to map foreseeable aggregate request rate, so long as shards areto the same server unless they were explicitly colocatedsmall enough. In practice, though, large tiers are probat creation timelematic because they are more prone to hot spots and they4.2 Caching Layerhave a quadratic growth in all-to-all connectionsTo add servers while limiting the maximum tier sizeTAO s cache implements the complete API for clients, we split the cache into two levels: a leader tier and mulhandling all communication with databases. The caching tiple follower tiers. Some of TAO's advantages over alayer consists of multiple cache servers that together lookaside cache architecture(as described in$ 2. 1)relyform a tier. A tier is collectively capable of responding to on having a single cache coordinator per database; thisany TAO request. (We also refer to the set of databases split allows us to keep the coordinators in a single tierin one region as a tier. Each request maps to a single per region. As in the single-tier configuration, each tiercache server using a sharding scheme similar to the one contains a set of cache servers that together are capabledescribed in 8 4.I. There is no requirement that tiers have of responding to any TAO query; that is, every shard inthe same number of hoststhe system maps to one caching server in each tier. LeadClients issue requests directly to the appropriate cache ers(members of the leader tier)behave as described inserver, which is then responsible for completing the read 84.2, reading from and writing to the storage layer. Fol62 2013 USENIX Annual Technical Conference(USENIX ATC 13)USENIX ASSOCiationlowers(members of follower tiers) will instead forwardread misses and writes to a leader. Clients communicateMaster Region for ShardSlave Region for ShardClients FollowersFollowerwith the closest follower tier and never contact leadersLeaderdirectly; if the closest follower is unavailable they failCacheacheover to another nearby follower tierGiven this two-level caching hierarchy, care must betaken to keep TAO caches consistent. Each shard ishosted by one leader, and all writes to the shard gothrough that leader, so it is naturally consistent. Follow-Master ddSlave ders, on the other hand, Imust be explicitly notified of updates made via other follower tiersTAO provides eventual consistency [33, 35] by asynFigure 2: Multi-region TAO configuration. The masterchronously sending cache maintenance messages fromregion sends read misses, writes, and embedded con-the leader to the followers. An object update in the leaderSistency messages to the master database(A). Consisenqueues invalidation messages to each correspondingtency messages are delivered to the slave leader(b)asfollower. The follower that issued the write is updatedthe replication stream updates the slave database. Slavesynchronously on reply from the leader; a version numleader sends writes to the master leader (C) and readber in the cache maintenance message allows it to be igmisses to the replica DB D). The choice of master andnored when it arrives later. Since we cache only conslave is made separately for each shardtiguous prefixes of association lists. invalidating an association might truncate the list and discard many edgesInstead, the leader sends a refill message to notify follow- read misses to be serviced locally. As with the leader/ers about an association write. If a follower has cached follower design, we propagate update notifications asynthe association, then the refill request triggers a query to chronously to maximize performance and availability, atthe leader to update the follower's now-stale association the expense of data freshnesslist.8 6.1 discusses the consistency of this design andThe social graph is tightly interconnected; it is not posalso how it tolerates failuressible to group users so that cross-partition requests areLeaders serialize concurrent writes that arrive from rare. This means that each TAO follower must be localfollowers. Because a single leader mediates all of theto a tier of databases holding a complete multi-petabyterequests for an idl, it is also ideally positioned to protect copy of the social graph. It would be prohibitively exthe database from thundering herds. The leader ensures pensive to provide full replicas in every data centerthat it does not issue concurrent overlapping queries toOur solution to this problem is to choose data centerthe database and also enforces a limit on the maximumlocations that are clustered into only a few regions, wherenumber of pending queries to a shardthe intra-region latency is small (typically less than I mil4.5 Scaling Geographicallylisecond). It is thne complete ceof the social graph per region Figure 2 shows the overallThe leader and followers configuration allows TAO to architecture of the master/slave TAO systemcale to handle a high workload, since read throughputFollowers behave identically in all regions, forwardingscales with the total number of follower servers in all read misses and writes to the local region s leader tiertiers. Implicit in the design, however, is the assumptionLeaders querythe local regions database regardless ofthat the network latencies from follower to leader and whether it is the master or slave. Writes. however areleader to database are low. This assumption is reasonable forwarded by the local leader to the leader that is in theif clients are restricted to a single data center, or even to region with the master database. This means that reada set of data centers in close proximity. It is not true, latency is independent of inter-region latencyhowever, in our production environmentThe master region is controlled separately for eachAs our social networking applications computing and shard, and is automatically switched to recover from thenetwork requirements have grown, we have had to ex- failure of a database. Writes that fail during the switchpand beyond a single geographical location: today, fol- are reported to the client as failed, and are not retriedlower tiers can be thousands of miles apart. In this con- Note that since each cache hosts multiple shards, a serverfiguration, network round trip times can quickly become may be both a master and a slave at the same time. Wethe bottleneck of the overall architecture. Since read prefer to locate all of the master databases in a single remisses by followers are 25 times as frequent as writes in gion. When an inverse association is mastered in a differour workloads, we chose a master/slave architecture that ent region, TAO must traverse an extra inter-region linkrequires writes to be sent to the master, but that allows to forward the inverse writeUSENIX Association2013 USENIX Annual Technical Conference(USENIX ATC 13)53TAO embeds invalidation and refill messages in the bucket is tracked by simply sliding the entries down. Wedatabase replication stream. These messages are deliv- achieve additional memory efficiency by adding a tableered in a region immediately after a transaction has been that maps the each active atype to a 16 bit value. Thisreplicated to a slave database. Delivering such messages lets us map(id1, atype) to a 32-bit count in 14 bytesearlier would create cache inconsistencies, as reading negative entry, which records the absence of any id2 forfrom the local database would provide stale data. At an(id, atype), takes only 10 bytes. This optimizationFacebook TAO and memcache use the same pipeline for allows us to hold about 20% more items in cache for adelivery of invalidations and refills [21]given system configurationIf a forwarded write is successful then the local leaderwill update its cache with the fresh value, even though 5.2 MySQL Mappingthe local slave database probably has not yet been updated by the asynchronous replication stream. In thisRecall that we divide the space of objects and associ-case followers will receive two invalidates or refills from ations into shards. Each shard is assigned to a logicalthe write. one that is sent when the write succeeds andMySQL database that has a table for objects and a tableone that is sent when the write's transaction is replicatedlicated for associations. All of the fields of an object are serialto the local slave databaseized into a single data column This approach allowsTAO's masterslave design ensures that all reads canus to store objects of different types within the same tabe satisfied within a single region, at the expense of po-ble, Objects that benefit from separate data managementtentially returning stale data to clients. As long as a userpolices are stored in separate custom tablesconsistently queries the same follower tier. the user willAssociations are stored similarly to objects, but to sup-typically have a consistent view of TAO state. We discuss port range queries, their tables have an additional indexexceptions to this in the next sectionbased on idl, atype, and time. To avoid potentially expensive SELECT COUNT queries, association counts5 Implementationare stored in a separate tablePrevious sections describe how TAO servers are aggre- 5.3 Cache Sharding and Hot Spotsgated to handle large volumes of data and query ratesThis section details important optimizations for perforShards are mapped onto cache servers within a tier usingmance and storage efficiencyconsistent hashing [15]. This simplifies tier expansionsand request routing. However, this semi-random assign5.1 Caching Serversment of shards to cache servers can lead to load imbalance: some followers will shoulder a larger portion ofTAO's caching layer serves as an intermediary betweenthe request load than others. TAO rebalances load amongclients and the databases. It aggressively caches objectsfollowers with shard cloning, in which reads to a shardand associations to provide good read performanceare served by multiple followers in a tier. ConsistencyTAOS memory management is based on Facebooksmanagement messages for a cloned shard are sent to allcustomized memcached, as described by Nishtala et followers hosting that shardaL. [21]. TAO has a slab allocator that manages slabs ofn our workloads, it is not uncommon for a popularequal size iteMs, a thread-Safe hash table, LRU evictionobject to be queried orders of magnitude more often thanamong items of equal size, and a dynamic slab rebalancerother objects. Cloning can distribute this load acrossthat keeps the LRU eviction ages similar across all typesmany followers, but the high hit rate for these objectsof slabs. a slab item can hold one node or one edge listmakes it worthwhile to place them in a small client-sideTo provide better isolation, TAO partitions the availcache. When a follower responds to a query for a hotable Ram into arenas, selecting the arena by the objectitem, it includes the object or associations access rateor association type. This allows us to extend the cache If the access rate exceeds a certain threshold. the tAolifetime of important types, or to prevent poor cache cit- client caches the data and version. By including the verizen from evicting the data of better-behaved types. soSion nuInber in subsequent queries. the follower can omitfar we have only manually configured arenas to addressthe data in replies if the data has not changed since thespecific problems, but it should be possible to automati-previous version. The access rate can also be used tocally size arenas to improve TAO's overall hit ratethrottle client requests for very hot objectsFor small fixed-size items, such as association countsthe memory overhead of the pointers for bucket items in 5.4 High-Degree Objectsthe main hash table becomes significant. We store thesitems separately, using direct-mapped 8-way associative Many objects have more than 6,000 associations with thecaches that require no pointers. LRU order within each same atype emanating from them, So TAO does not cache54 2013 USENIX Annual Technical Conference (UseNIX ATC 13)USENIX ASSOCiationthe complete association list. It is also common that as-The changeset cannot always be safely applied to thesoc-get queries are performed that have an empty result followers cache contents, because the follower's cache(no edge exists between the specified idl and id 2). Un- may be stale if the refill or invalidate from a second folfortunately, for high-degree objects these queries will al- lower's update has not yet been delivered. We resolvways go to the database, because the queried id2 could this race condition in most cases with a version numberbe in the uncached tail of the association listthat is present in the persistent store and the cache. TheWe have addressed this inefficiency in the cache im- version number is incremented during each update, Soplementation by modifying client code that is observed the follower can safely invalidate its local copy of theto issue problematic queries. One solution to this prob- data if the changeset indicates that its pre-update valuelem is to use assoc- count to choose the query direction, was stale. Version numbers are not exposed to the taosince checking for the inverse edge is equivalent. In clients. In slave regions, this scheme is vulnerable tosome cases where both ends of an edges are high-degree a rare race condition between cache eviction and stornodes, we can also leverage application-domain knowl- age server update propagation. The slave storage serveredge to improve cacheability. Many associations set the may hold an older version of a piece of data than whattime field to their creation time, and many objects in- is cached by the caching server, so if the post-changesetclude their creation time as a field. Since an edge to a entry is evicted from cache and then reloaded from thenode can only be created after the node has been created, database, a client may observe a value go back in timewe can limit the id2 search to associations whose time in a single follower tier. Such a situation can only ocis than the object's creation time. So long as an edge cur if it takes longer for the slave regions storage serverolder than the object is present in cache then this query to receive an update than it does for a cached item to becan be answered directly by a tAo followerevicted from cache, which is rare in practiceAlthough tao does not provide strong consistency for6 Consistency and Fault Toleranceits clients, because it writes to My sOL synchronouslythe master database is a consistent source of truth ThisTwo of the most important requirements for TAO are allows us to provide stronger consistency for the smallavailability and performance. When failures occur we subset of requests that need it. TAO reads may be markedwould like to continue to render facebook. even if theok, even if the critical, in which case they will be proxied to the masterdata is stale. In this section, we describe the consistency region. We could use critical reads during an authentica-model of TAo under normal operation, and how TAo tion process, for example, so that replication lag doesn'tsacrifices consistency under failure modesallow use of stale credentials6.1 Consistency6.2 Failure Detection and handlingUnder normal operation, objects and associations in TAOTAO scales to thousands of machines over multiple geare eventually consistent [33, 35]; after a write, TAo ographical locations, so transient and permanent failguarantees the eventual delivery of an invalidation or re- ures are commonplace. Therefore, it is important thatfill to all tiers. Given a sufficient period of time during tao detect potential failures and route around themwhich external inputs have quiesced, all copies of data TAO servers employ aggressive network timeouts so asin TAO will be consistent and reflect all successful write not to continue waiting on responses that may never ar-operations to all objects and associations. Replication rive. Each TAO server maintains per-destination timelag is usually less than one secondouts. marking hosts as down if there are several consecIn normal operation(at most one failure encountered utive timeouts, and remembering downed hosts so thatby a request TAO provides read-after-write consistency subsequent requests can be proactively aborted. Thiswithin a single tier. TAO synchronously updates the simple failure detector works well, although it does notcache with locally written values by having the maalways preserve full capacity in a brown-out scenarioleader return a changeset when the write is successful. such as bursty packet drops that limit TCP throughputThis changeset is propagated through the slave leader (if Upon detection of a failed server, TAO routes around theany) to the follower tier that originated the write query. failures in a best effort fashion in order to preserve availIf an inverse type is configured for an association, thability and performance at the cost of consistency. Wewrites to associations of that type may affect both the actively probe failed machines to discover when(if)theyid1's and the id2's shard. In these cases, the changeset recover.returned by the master leader contains both updates, andDatabase failures Databases are marked down in athe slave leader(if any and the follower that forwarded global configuration if they crash, if they are taken of-the write must each send the changeset to the id2's shard fline for maintenance, or if they are replicating from ain their respective tiers before returning to the callermaster database and they get too far behind. When aUSENIX Association2013 USENIX Annual Technical Conference(USENiX ATC 13)55master database is down, one of its slaves is automati-read requests 99.8 o write requests0.2cally promoted to be the new master.ssoC-815.7 %0 assoc_add52.5assoc_range40.9%assoc-del8.3When a regions slave database is down cache missesassoc_time-range 2.8 assoc-change_type 0.9are redirected to the TAO leaders in the region hosting theassoc count11.7% obj-add16.5%database master. Since cache consistency messages areobi_get28.9 % obj_update20.7%enbedded in the database's replication stream, howeverobj_delete2.0%they can't be delivered by the primary mechanism During the time that a slave database is down an additional Figure 3: Relative frequencies for client requests to TAobinlog tailer is run on the master database, and the refrom all Facebook products. Reads account for almostfills and invalidates are delivered inter-regionally. When all of the calls to the APIthe slave database comes back up, invalidation and refillmessages from the outage will be delivered again7 Production WorkloadLeader failures: When a leader cache server failsfollowers automatically route read and write requestsFacebook has a single instance of TAO in productionround it. Followers reroute read misses directly toMulti-tenancy in a system such as TAO allows us tothe database, Writes to a failed leader. in contrastamortize operational costs and share excess capacityare rerouted to a random member of the leader's tieramong clients. It is also an important enabler for rapidThis replacement leader performs the write and associproduct innovation, because new applications can link toated actions, such as modifying the inverse association existing data and there is no need to move data or proand sending invalidations to followers. The replacement vision servers as an application grows from one user toleader also enqueues an asynchronous invalidation to the hundreds of millions. Multi-tenancy is especially im-original leader that will restore its consistency. Theseportant for objects, because it allows the entire 64-bit idasvnchronous invalidates are recorded both on the coorspace to be handled uniformly without an extra step todinating node and inserted into the replication streamresolve the otypewhere they are spooled until the leader becomes availThe TAO system contains many follower tiers spreadable. If the failing leader is partially available then folacross several geographic regions. Each region has onelowers may see a stale value until the leaders consiscomplete set of databases, one leader cache tier, and attency is restoredleast two follower tiers. Our TAO deployment continuously processes a billion reads and millions of writesRefill and invalidation failures: Leaders send refillsand invalidations asynchronously. If a follower is unper second. We are not aware of another geographicallydistributed graph data store at this scalereachable, the leader queues the message to disk to bTo characterize the workload that is seen by TAo, wedelivered at a later time. Note that a follower may becaptured a random sample of 6.5 million requests over aleft with stale data if these messages are lost due to per-40 day period. In this section, we describe the results ofmanent leader failure. This problem is solved by a bulkan analysis of that sampleinvalidation operation that invalidates all objects and asAt a high level, our workload shows the followinssociations from a shard _id. After a failed leader box ischaracteristicsreplaced, all of the shards that map to it must be invali-dated in the followers, to restore consistency.reads are much more frequent than writesFollower failures; In the event that a tao followermost edge queries have empty results; andfails, followers in other tiers share the responsibility ofquery frequency, node connectivity, and data sizeserving the failed hosts shards. We configure each TAOhave distributions with long tailsclient with a primary and backup follower tier. In nor- Figure 3 breaks down the load on TAO. Reads domimal operations requests are sent only to the primary. If nate, with only 0.2 %o of requests involving a write. Thethe server that hosts the shard for a particular request has majority of association reads resulted in empty associabeen marked down due to timeouts, then the request is tion lists. Calls to assoc-get found an association onlysent instead to that shards server in the backup tier. Be19.6%0 of the time, 31. 0% of the calls to assoc_range incause failover requests still go to a server that hosts the our trace had a non-empty result, and only 1.9%o of thecorresponding shard, they are fully cacheable and do not calls to assoc-time-range returned any edgesrequire extra consistency work. Read and write requestsFigure 4 shows the distribution of the return valuesfrom the client are failed over in the same way. Note that from assoc count. 45% of calls return zero. among thefailing over between different tiers may cause read-after- non-zero values, although small values are the most comwrite consistency to be violated if the read reaches the mon, 1% of the return values were>>500, 000failover target before the writes refill or invalidateFigure 5 shows the distribution of the number of asso56 2013 USENIX Annual Technical Conference(USENIX ATC 13)USENIX ASSOCiationassociationsobjects L310%310assoc count return value01248242526272820210211212213214215216217aFigure 4: assoc_count frequency in our production envi- Figure 6: The size of the data stored in associations andronment. I o of returned counts were >512Kobjects that were returned by the TAO API. Associationstypically store much less data than objects. The aver100%age association data size was 97. 8 bytes for the 60.5%0assoc_rangeassoc_time_rangeof returned associations that had some data. The averageobject data size was 673 bytes6000C0avg aggregate hit rateg50004000C0300c0of returned assocs2000c01000C0Figure 5: The number of edges returned by assoc_rangeand assoc-time_range queries. 64%0 of the non-empty「T+TT858687888990919293949596979899100results had I edge, 13 of which had a limit of 1follower hit rate(%)Figure 7: Throughput of an individual follower in ourciations returned for range and time-range queries, andproduction environment. Cache misses and writes arethe subset that hit the limit for returned associationsInore expensive than cache hits, so the peak query rateMost range and time range queries had large clientrises with hit rate. Writes are included in this graph assupplied limits. 12% of the queries hadonit95% of the remaining queries had limit 1000. Lessthan 1%c of the return values for queries with a limit >Iactually reached the limitsmaller. However, large values are frequent enough thatAlthough queries for non-existent associations were the system must deal with them efficientlycommon. this is not the case for objects a valid id isonly produced during object creation, so obj-get can only 8 Performancein empty result if the object has been removedor if the objects creation has not yet been replicated to Running a single TAo deployment for all of Facebookthe current region Neither of these cases occurred in allows us to benefit from economies of scale and makesour trace: every object read was successful. This doesnt it easy for new products to integrate with existing pormean that objects were never deleted -it just means that tions of the social graph. In this section, we report on thethere was never an attempt to read a deleted objectperformance of TAO under a real work loadFigure 6 shows the distribution of the data sizes forAvailability: Over a period of 90 days, the fractionTAO query results. 39.5% of the associations queried of failed tAo queries as measured from the web serverby clients contained no data. Our implementation allowswas 4.9x 106 Care must be taken when interpretingobjects to store IMB of data and associations to store this number, since the failure of one tao query might64K of data(although a custom table must be configured prevent the client from issuing another query with a dyfor associations that store more than 255 bytes of data). namic data dependence on the first. TAOs failures mayThe actual size of most objects and associations is much also be correlated with those of other dependent systems.USENIX Association2013 USENIX Annual Technical Conference(USENIX ATC 13 )57hit lat(msec) miss lat (msecread performance and throughput over consistency. Weoperationag99%50%avg99%observed that TAOs slave storage servers lag their masassoc count1.12.528.95.026.218681.02.425.95.814.5143.1ter by less than 1 second during 85% of the tracing win-assoc_ getassoc-range1.12.324.85411.293.6dow, by less than 3 seconds 99 %o of the time, and by lessassoc_time_range1.33.232.85.811.947.2than 10 seconds 99. 8% of the timebj-get1.02427.08.275.3186.4Failover: Follower caches directly contact thedatabase when a leader is unavailable; this failover pathFigure 8: Client-observed TAO latency in milliseconds was used on 0.15% of follower cache misses over ourfor read requests, including client API overheads and net- sample. Failover for write requests involves delegatingwork traversal, separated by cache hits and cache misses. those requests to a random leader, which occurred for0.045 of association and object writes. Slave databaseswere promoted to be the master 0.25 of the time due toremote region40%naster regionplanned maintenance or unplanned downtime30%9 Related Work820%TAO is a geographically distributed eventually consistent graph store optimized for reads. Previous distributedsystems works have explored relaxed consistency, graphdatabases. and read-optimized storage. To our knowl0102030405060708090100110120130edge, TAO is the first to combine all of these techniqueswrite latency(msec)in a single system at large scaleEventual consistency: Terry et al. [33 describeFigure 9: Write latency from clients in the same region eventual consistency, the relaxed consistency modelas database masters, and from a region 58 msec awaywhich is used by tao. werner describes read-after-writeconsistency as a property of some variants of eventualFollower capacity: The peak throughput of a followerGeographically distributed data stores: The Codadepends on its hit rate. Figure 7 shows the highest 15- file system uses data replication to improve performanceminute average throughput we observe in production for and availability in the face of slow or unreliable netour current hardware configuration, which has 144GBworks [29. Unlike Coda, TAO does not allow writesof RAM, 2 Intel Xeon 8 core E5-2660 CPUs running at in portions of the system that are disconnected2.2Ghz with Hyperthreading, and 10 Gigabit ethernetMegastore is a storage system that uses Paxos acrossHit rates and latency: Asof the data collectigeographically distributed data centers to provide strongprocess that was described in$7, we measured latencies consistency guarantees and high availability [5]. Spanin the client application; these measurements include all ner, the next generation globally distributed database denetwork latencies and the time taken to traverse the phpveloped at Google after Megastore, introduces the conTAO client stack. Requests were sampled at the same cept of a time aPi that exposes time uncertainty andrate in all regions. TAOS overall hit rate for reads was leverages that to improve commit throughput and provide96. 490. Figure 8 shows the client-observed latencies for snapshot isolation for reads [8 TAO addresses a veryreads. obj-get has higher miss latencies than the other different use case, providing no consistency guaranteesreads because objects typically have more data(see Fibut handling many orders of magnitude more requestsure 6). assoc_count requests to the persistent store have a Distributed hash tables and key-value systems: Unlarger idl working set than other association queries, and structured key-value systems are an attractive approachhence make poorer use of the database's buffer cacheto scaling distributed storage because data can be easilyTAOs writes are performed synchronously to the maspartitioned and little communication is needed betweenter database, so writes from other regions include an partitions. Amazon,s Dynamo [101 demonstrates hointer-region round trip. Figure 9 compares the latency they can be used in building flexible and robust comin two data centers that are 58. I milliseconds away from mercial systems. Drawing inspiration from dynamoach other (average round trip). Average write latency LinkedIn's Voldemort [4] also implements a distributedn the same region as the master was 12.1 msec; in the key-value store but for a social network. TAO acceptsremote region it was 74.4=58.1 +16.3 mseclower write availability than dynamo in exchange forRplication lagTAOs asynchronous replplication ofavoiding the programming complexities that arise fromwrites between regions is a design trade-off that favors multi-master conflict resolution. The simplicity of key-58 2013 USENIX Annual Technical Conference (USeNiX ATC 13)USENIX ASSOCiation
下载地址
用户评论