GFS
===

What is the problem?
  Need to store/access massive data sets efficiently
  Want to use really cheap (i.e., unreliable) hardware
  Spread data over many unreliable machines => failure is the norm

What is architecture:
  One master server (with replicated state)
  Many chunk servers (hundreds, maybe thousands)
    Broken into different racks; less bandwidth between racks
    Store 64MB chunks of files, identified by globally unique ID
  Potentially many clients accessing same or different files on cluster

What is client interface to GFS?  Is it a file system?
  Not really a FS--just a library applications can use to access storage
  Sounds like interface is mostly:
    Read, Write - as usual
    Append - appends *at least* once, w. gaps and/or inconsistencies
  Explain Table 1 (p. 4):
    Consistent - read same data from all replicas of chunk
    Defined - state equivalent to serial application of client operations
  Also, presumably:  Open, Delete, Snapshot,
      some readdir equivalent (FindMatchingFiles)

How to deal with weird semantics?  (sec 2.7.2)
  Applications should append records with checksums
  Use checksums to detect gaps or garbage
  Use unique record IDs to filter out duplicate records, if necessary

What does the master actually store?
  File names stored in prefix-compressed form (details are vague)
  Ownership, permission, etc.
  Mappings:
    File -> Chunks (*)
    Chunk -> Version (*), Replicas, Reference count (*), Leases
    (*) = stored stably in a log
  Average: 100 bytes per file

What do chunk servers store:
  Chunks (one Linux file per chunk)
  Chunk Metadata:  Version number, checksum

What happens when client reads a file (2.4)?
  Ask server for (file, chunk index)
    Server returns chunk ID, version, locations of replicas
    Server may return info for subsequent chunks; client can cache
  Read chunk from nearest chunk server
    Nearest easy to determine because of Google's simple network topology

What happens when client writes a file?
  One of the chunk servers is the *primary* for the chunk
    This is determined by having a lease from the master server
    Master increments version number every time it grants a lease
    Leases are renewed through periodic heartbeats
  Client asks master who the primary and secondary replicas are for chunk
  Client sends data to all replicas, in chain fashion (Figure 3)
    Why?  To make best use of topology and max out upstream bandwidth
    Trade-off is high latency... Could you have lower latency?
    Maybe client stripes chunks across replicas and have them reconcile?

Does master ever need to revoke a lease?
  Yes, for example if renaming or snapshotting a file

What happens when client deletes a file?

What happens when chunk server fails?
  Prioritize chunks to replicate:
    # missing replicas, not deleted, load-balance

What happens when master reboots?
  Reconstruct state:  Ask chunk servers what chunks they have
  What if chunk server has lower version number?
    Consider it stale--that server no longer has a replica
  What if chunk server has higher version number?
    Master updates it's own version number; must have crashed before logging it

Explain shapes in Figure 3 (p. 12)