GFS === What problem does this paper address? Need to store/access massive data sets efficiently Want to use really cheap (i.e., unreliable) hardware Spread data over many unreliable machines => failure is the norm What kind of data was gfs used for? Map/reduce input output (not intermediary files) Copy of web, web index, etc. Producer/consumer queues What is architecture? One master server (with replicated state) Many chunk servers (hundreds, maybe thousands) Broken into different racks; less bandwidth between racks Store 64MB chunks of files, identified by globally unique ID Potentially many clients accessing same or different files on cluster What is client interface to GFS? Is it a file system? Not really a FS--just a library applications can use to access storage Sounds like interface is mostly: Read, Write - as usual Append - appends *at least* once, w. gaps and/or inconsistencies Also, presumably: Open, Delete, Snapshot, some readdir equivalent (that calls FindMatchingFiles from Table 6) What are the possible consistency guarantees for a byte range? (Table 1, p. 4) Defined - state equivalent to serial application of client operations Consistent but undefined - will read same data from all replicas of chunk Inconsistent - different chunk replicas have different data How to deal with weird semantics? (sec 2.7.2) Applications should append records with checksums Use checksums to detect gaps or garbage Use unique record IDs to filter out duplicate records, if necessary What does the master actually store? File names stored in prefix-compressed form (details are vague) Ownership, permission, etc. Mappings: File -> Metadata (*), Chunks (*), Locks Chunk -> Version (*), Reference count (*), Replicas, Leases (*) = stored stably in a log Storage: under 64 bytes per file, 64 bytes per 64MB chunk (2.6.1) Chunk replicas not stored persistently--why not? Chunk server is ultimate authority over what it is storing Master restarts? Just ask chunk servers what they store How is master fault tolerant? Replicate state to backups Shadow masters can act as read-only masters while real master unavailable What do chunk servers store? Chunks (one Linux file per chunk) Chunk Metadata: Version number, checksum What happens when client reads a file (2.4)? Ask server for (file, chunk index) Server returns: chunk ID, version, locations of replicas Server may return info for subsequent chunks; client can cache Read chunk from nearest chunk server Determine nearest from IP address because of Google's network topology What happens when client writes a file? (3.1) One of the chunk servers is the *primary* for the chunk This is determined by having a lease from the master server Master increments version number every time it grants a lease Leases are renewed through periodic heartbeats 1. Client asks master who the primary, secondary replicas are If no primary, master grants lease to one before returning 2. Master replies, and client caches reply 3. Client pushes data to the primary and backup servers Uses chain configuration (Figure 3) But at this point data in LRU buffer cache, not in a file Why? To make best use of topology and max out upstream bandwidth What is latency? (3.2) System uses cut-through routing (forward before receiving all data) B-byte write, R replicas, T throughput, L latency: B/T + R*L Could you lower latency? Maybe stripe chunks across replicas and have them reconcile? In ideal world would get: B/T + 2L (client -> all -> all) But in reality could cross top-of-rack switch more, reduce B 4. Client asks primary to write already pushed data Primary assigns consecutive serial number, applies locally 5. Primary broadcasts op and serial number to secondaries 6. Secondaries ack to primary when write done 7. Primary replied to client. Possible replies? Success everywhere, or Error(s) at one or more backups (and success at primary) Means byte range is inconsistent What might cause errors? One or more backups unreachable Data expired from LRU buffer cache before written to file What does client do on error? Repeat 3-7 a few times, then goto 1 How does record get appended multiple times? when client repeats How does gfs master store and manage the namespace? (4.1) Big table mapping: full-pathname -> metadata, read-write lock No data structure actually corresponds to directories But prefix compression avoids storage overhead (Don't need N copies of string "/d1/d2/d3/" for N files in that dir) What locks required to access /d1/d2/d3/leaf? Need read-lock on d{1,2,3}, read- or write- on leaf depending on operation No write lock for file creation! To create /home/user/foo just read-lock /home and /home/user What about rename /home/user to /save/user? Read-lock /home, /save, but write-lock /home/user, /save/user Write locks will prevent concurrent create of /home/user/foo But note (sec 3.1) master revokes primary leases for renamed files How do snapshots work? (3.4, 4.1) Bump reference counts, use copy-on-write Must revoke any lease from primary of chunk that is made copy-on-write What happens when writing copy-on-write chunk? Client asks to write chunk C, master notices refcount > 1 Master asks same chunk servers to copy C into C' with new chunk ID C and C' on same servers, so no network bandwidth required What happens when you snapshot a directory? Take write lock on directory (prevents new files created underneath) Recursively snapshot underlying directories and files What happens when client deletes a file? Renames to temporary name, garbage collected later What happens when chunk server fails? Prioritize chunks to replicate: # missing replicas, not deleted, load-balance What happens when master reboots? Reconstruct state: Ask chunk servers what chunks they have How is startup fast? Checkpoint + operation log Checkpoint is B-tree that can be directly mmapped--nothing to parse What if chunk server has lower version number? Consider it stale--that server no longer has a replica What if chunk server has higher version number? 4.5 Master updates it's own version number; must have crashed before logging it Explain shapes in Figure 3 (p. 12) (a) Theoretical limit increases with clients until inter-switch link saturated Why does slope decrease? Increased chance of two clients hitting same server (b) Theoretical limit 1/3 reads, because data replicated 3 times Real system hits full-duplex network stack artifact Slope decreases for same reason as reads (c) Everyone appending to same file--chunk server of last chunk is bottleneck Real system experiences congestion + variances in client throughput How does master place allocated chunks? (4.2) Try to even out disk utilization Make sure chunks live in at least two racks for availability What happens when chunk server down and misses update? Master assigns version number when it grants primary lease Primary assigns serial number to each chunk mutation So copy will be missing operation, e.g., version number Just discard stale chunk copy and re-replicate Why do chunk servers checksum chunks? 2.7.1, 5.2 Disk interface problem resulted in corruption. Good idea anyway. Note incremental checksum technique makes very cheap How does system recover from missing chunks/chunk servers Master sends periodic heartbeets to chunk servers Heartbeats and replies contain lists of chunk IDs, versions Can detect if chunk server doesn't have a chunk Can also detect is chunk server has chunk it doesn't need This is how garbage collection works Master re-replicates missing chunks Careful not to interfere too much with other workload But prioritizes 2x replication, if only one extant replica of chunk How does replication compare with Erasure coding? What are the potential bottlenecks in gfs? Master? no really - maybe if clients constantly create/delete files Old version required linear scan of namespace, replaced by binary search Don't need asymptotic scalability if master 10^{6} times less data/work! Single file read concurrently by all clients E.g., happened with executable in batch system => use higher replication Weird linux performance artifacts in chunk servers E.g., mmap+page fault, or full duplex network stack artifact Probably mostly fixed in modern linux kernels Do they convince us GFS is the right general approach for large data? Coevolved with applications, resulting in quirky interface Google has since moved on to colossus, but fewer details known But a good study in how "cutting corners" can be a good thing Overall design reasonably simple and system obviously quite useful