GFS === What is the problem? Need to store/access massive data sets efficiently Want to use really cheap (i.e., unreliable) hardware Spread data over many unreliable machines => failure is the norm What is architecture: One master server (with replicated state) Many chunk servers (hundreds, maybe thousands) Broken into different racks; less bandwidth between racks Store 64MB chunks of files, identified by globally unique ID Potentially many clients accessing same or different files on cluster What is client interface to GFS? Is it a file system? Not really a FS--just a library applications can use to access storage Sounds like interface is mostly: Read, Write - as usual Append - appends *at least* once, w. gaps and/or inconsistencies Explain Table 1 (p. 4): Consistent - read same data from all replicas of chunk Defined - state equivalent to serial application of client operations Also, presumably: Open, Delete, Snapshot, some readdir equivalent (FindMatchingFiles) How to deal with weird semantics? (sec 2.7.2) Applications should append records with checksums Use checksums to detect gaps or garbage Use unique record IDs to filter out duplicate records, if necessary What does the master actually store? File names stored in prefix-compressed form (details are vague) Ownership, permission, etc. Mappings: File -> Chunks (*) Chunk -> Version (*), Replicas, Reference count (*), Leases (*) = stored stably in a log Average: 100 bytes per file What do chunk servers store: Chunks (one Linux file per chunk) Chunk Metadata: Version number, checksum What happens when client reads a file (2.4)? Ask server for (file, chunk index) Server returns chunk ID, version, locations of replicas Server may return info for subsequent chunks; client can cache Read chunk from nearest chunk server Nearest easy to determine because of Google's simple network topology What happens when client writes a file? One of the chunk servers is the *primary* for the chunk This is determined by having a lease from the master server Master increments version number every time it grants a lease Leases are renewed through periodic heartbeats Client asks master who the primary and secondary replicas are for chunk Client sends data to all replicas, in chain fashion (Figure 3) Why? To make best use of topology and max out upstream bandwidth Trade-off is high latency... Could you have lower latency? Maybe client stripes chunks across replicas and have them reconcile? Does master ever need to revoke a lease? Yes, for example if renaming or snapshotting a file What happens when client deletes a file? What happens when chunk server fails? Prioritize chunks to replicate: # missing replicas, not deleted, load-balance What happens when master reboots? Reconstruct state: Ask chunk servers what chunks they have What if chunk server has lower version number? Consider it stale--that server no longer has a replica What if chunk server has higher version number? Master updates it's own version number; must have crashed before logging it Explain shapes in Figure 3 (p. 12)