Distributed Systems Lab 2: Replicated File-Store

In this lab, you will build a replicated file-store, based on the simple file-store you worked on in the previous lab. This lab will also introduce you to event-driven programming, which is a convenient programming style for implementing complex network protocols.

This lab involves significantly more work than the previous lab assignment, so you should try to get started as soon as possible. To make sure that you start at least two days early, you must submit your solutions for exercises 1, 2, and 3 of this lab assignment on Feb 13, and solutions for all exercises, including exercise 4, are due on Feb 15. If your first submission on Feb 13 has some remaining bugs, that's fine; we will not deduct any points for bugs that you fix in your final submission.

Part 1: Preparing your development environment

This lab uses an event-driven library called libasync. If you are using one of the cardinal.stanford.edu machines to do this lab, we have pre-installed the library for you, and you can skip this step. Otherwise, you will need to download and install this library on your machine, as follows:

% wget http://www.scs.stanford.edu/07wi-cs244b/dmalloc-5.4.2.tgz
...
% tar zxf dmalloc-5.4.2.tgz
% cd dmalloc-5.4.2
% ./configure && make
...
% su
Password: ****
# make install
...
# exit
% cd ..
% wget http://www.scs.stanford.edu/07wi-cs244b/sfs1.tar.gz
...
% tar zxf sfs1.tar.gz
% cd sfs1
% ./configure --without-sfsuser --without-sfsgroup --with-dmalloc
checking for a BSD-compatible install... /usr/bin/install -c
checking whether build environment is sane... yes
...
config.status: creating config.h
config.status: executing depfiles commands
% make
...
% su
Password: ****
# make install
...
# exit
% 

Download the code for this lab from http://www.scs.stanford.edu/07wi-cs244b/lab2.tar.gz and build it as follows, to make sure you have a functional environment.

% wget http://www.scs.stanford.edu/07wi-cs244b/lab2.tar.gz
...
% tar zxf lab2.tar.gz
% make
PATH=$PATH:/afs/ir.stanford.edu/class/cs244b/sfs/bin rpcc -h -o rep.h rep.x
g++ -I/usr/local/include/sfs -I/afs/ir.stanford.edu/class/cs244b/sfs/include/sfs -g   -c -o clientproxy.o clientproxy.C
PATH=$PATH:/afs/ir.stanford.edu/class/cs244b/sfs/bin rpcc -c -o rep.C rep.x
g++ -I/usr/local/include/sfs -I/afs/ir.stanford.edu/class/cs244b/sfs/include/sfs -g   -c -o rep.o rep.C
g++ -I/usr/local/include/sfs -I/afs/ir.stanford.edu/class/cs244b/sfs/include/sfs -g   -c -o bcast.o bcast.C
g++ clientproxy.o rep.o bcast.o -o clientproxy -L/usr/local/lib/sfs -L/afs/ir.stanford.edu/class/cs244b/sfs/lib/sfs -lsfscrypt -larpc -lasync -lresolv
g++ -I/usr/local/include/sfs -I/afs/ir.stanford.edu/class/cs244b/sfs/include/sfs -g   -c -o cohort.o cohort.C
g++ -I/usr/local/include/sfs -I/afs/ir.stanford.edu/class/cs244b/sfs/include/sfs -g   -c -o execbackend.o execbackend.C
g++ -I/usr/local/include/sfs -I/afs/ir.stanford.edu/class/cs244b/sfs/include/sfs -g   -c -o exitcb.o exitcb.C
g++ cohort.o rep.o execbackend.o bcast.o exitcb.o -o cohort -L/usr/local/lib/sfs -L/afs/ir.stanford.edu/class/cs244b/sfs/lib/sfs -lsfscrypt -larpc -lasync -lresolv
g++ -I/usr/local/include/sfs -I/afs/ir.stanford.edu/class/cs244b/sfs/include/sfs -g   -c -o groupstat.o groupstat.C
g++ groupstat.o rep.o bcast.o -o groupstat -L/usr/local/lib/sfs -L/afs/ir.stanford.edu/class/cs244b/sfs/lib/sfs -lsfscrypt -larpc -lasync -lresolv
g++ -I/usr/local/include/sfs -I/afs/ir.stanford.edu/class/cs244b/sfs/include/sfs -g   -c -o groupleave.o groupleave.C
g++ groupleave.o rep.o bcast.o -o groupleave -L/usr/local/lib/sfs -L/afs/ir.stanford.edu/class/cs244b/sfs/lib/sfs -lsfscrypt -larpc -lasync -lresolv
% ./cohort
fatal: Usage: ./cohort backend-server-path group-udp-port join-cohort:port
% 

As in the previous lab, you may get an error message about a missing libstdc++.so.6 if you are using the cardinal.stanford.edu machines. To fix it, run setenv LD_LIBRARY_PATH /usr/pubsw/lib if you are using csh or tcsh (or if using bash, run export LD_LIBRARY_PATH=/usr/pubsw/lib). You may also want to add this command to your ~/.cshrc or ~/.bash_profile file so that you don't need to run it every time you log in.

Finally, this lab requires that you make a small change to your file-store daemon cfd from the previous lab for it to interface properly with this lab's code. The patch to lab1/cfd.cc is distributed in lab2/lab1.patch, which you should apply it to lab1 and rebuild cfd:

% cd lab1
% patch -p0 < ../lab2/lab1.patch
patching file cfd.cc
% make
g++ -g   -c -o cfd.o cfd.cc
g++ -o cfd cf_svc.o cf_xdr.o cfd.o cfd_ops.o -lpthread
% 

Part 2: Understanding the initial code

This lab will make extensive use of an event-driven library called libasync. The reference materials page includes a handout called Using TCP through sockets which describes this library. Read this handout to get an understand of how to program using libasync.

In this lab, you will take the file-store daemon cfd from the previous lab, and, treating it as a deterministic state-machine, replicate it among many cohorts, which together form a replication group. One of the cohorts will be the primary, responsible for ordering all incoming requests. Other backup cohorts will all execute the same requests in the same order as determined by the primary. When some cohorts fail, a view change operation will allow the remaining cohorts (if a majority of cohorts remains) to choose a new primary and a new set of backups. A detailed explaination of this protocol is described in the Paxos Made Practical handout available on the reference materials page.

The prototype code we have provided you for this lab is capable of forming a replication group consisting of one cohort and executing client requests. The cohort program implements one cohort in a replication group. The first argument is the path to an RPC server which will provide the deterministic state-machine to be replicated. You should specify the path to the cfd daemon from lab1. The second argument is a UDP port number on which the cohort should listen for broadcast RPC requests. This port number should be the same for all of the cohorts in your replication group, but should not conflict with anyone else. (Each cohort also listens on a dynamically-chosen unicast UDP port number for requests directed specifically to it, rather than to the entire group.) The third argument specifies the replication group that this cohort should join. If joining an existing replication group, the argument should be ip-address:port-number to join a specific cohort, or 0:group-port-number to send a broadcast RPC to group-port-number asking to join the group. You can also tell the cohort to start an entirely new replication group, by specifying a third argument of 0:0. When forming a new replication group, the cohort will try to make sure that noone else is already using the UDP port number you have specified, but it's not a fool-proof mechanism.

Because your file-store client, cfc, from lab1 cannot talk to a cohort directly, a proxy is necessary to take a request from cfc, send it to an appropriate cohort, and send the response back to cfc. This proxy program is called clientproxy, and it takes two arguments. The first specifies the TCP port number on which it should listen for requests from cfc, and the second specifies the broadcast UDP port number for your replication group.

Make sure that you can run the provided cohort and clientproxy and execute commands on your file-store daemon using cfc:

% ./cohort ../lab1/cfd 5555 0:0 &
Got backend server running on port 56731
Cohort running on 171.64.15.252:49856
Formed a new synthetic view.
% ./clientproxy 7777 5555 &
clientproxy listening on port 7777
% ../lab1/cfc localhost 7777 mkfile /hello
Got a VIEWINFO response for view 1:12339880554001362870
Using new VIEWINFO as latest, primary 171.64.15.252:49856
% ../lab1/cfc localhost 7777 write /hello world
% ../lab1/cfc localhost 7777 read /hello
world
% 

The asynchromous RPC library we are using, libarpc (which is closely coupled with libasync), provides an easy way of tracing RPC requests, both on the client-side and on the server-side. You can observe the RPCs made by clientproxy by re-starting it as follows. (If you are using bash, omit env from the command.)

% env ACLNT_TRACE=10 ./clientproxy 7777 5555
clientproxy listening on port 7777

Now if you run ../lab1/cfc/localhost 7777 read /hello in another window, clientproxy will dump all of the RPC client requests and responses to stderr:

% env ACLNT_TRACE=10 ./clientproxy 7777 5555
clientproxy listening on port 7777
ACLNT_TRACE: call rep_bcast_prog_1:REP_BCAST_VIEWINFO x=d90f0480
ACLNT_TRACE: reply rep_bcast_prog_1:REP_BCAST_VIEWINFO x=d90f0480
execute_viewinfo REPLY = {
  viewid_t vid = {
    u_int64_t counter = 0x1;
    u_int64_t manager = 0xab400ffc00006fb6;
  };
  net_address_t primary = {
    u_int32_t ipaddr = 0xab400ffc;
    u_int32_t port = 0xc2c0;
  };
};
Got a VIEWINFO response for view 1:12339880554001362870
Using new VIEWINFO as latest, primary 171.64.15.252:49856
ACLNT_TRACE: call rep_prog_1:REP_EXECUTE x=ce05341c
execute_arg ARGS = {
  u_int64_t client = 0xab400ffc00007170;
  u_int64_t rid = 0x0;
  viewid_t vid = {
    u_int64_t counter = 0x1;
    u_int64_t manager = 0xab400ffc00006fb6;
  };
  opaque request<> = [52] {
    24, 5d, 2f, 24, 00, 00, 00, 00,
    00, 00, 00, 02, 00, 06, 71, 4e,
    00, 00, 00, 01, 00, 00, 00, 01,
    00, 00, 00, 00, 00, 00, 00, 00,
    00, 00, 00, 00, 00, 00, 00, 00,
    00, 00, 00, 06, 2f, 68, 65, 6c,
    ...
  };
};
ACLNT_TRACE: reply rep_prog_1:REP_EXECUTE x=ce05341c
execute_res REPLY = {
  bool ok = true;
  opaque reply<> = [40] {
    24, 5d, 2f, 24, 00, 00, 00, 01,
    00, 00, 00, 00, 00, 00, 00, 00,
    00, 00, 00, 00, 00, 00, 00, 00,
    00, 00, 00, 00, 00, 00, 00, 06,
    77, 6f, 72, 6c, 64, 0a, 00, 00
  };
};

The above output shows that clientproxy sends a REP_BCAST_VIEWINFO request from the rep_bcast_prog_1 RPC protocol, receives a response, and then issues a REP_EXECUTE request from the rep_prog_1 RPC protocol. Look through the RPC interface definition in lab2/rep.x to understand what's going on.

Exercise 1. Read through the interface definition in lab2/rep.x and through the Paxos Made Practical handout. Under what circumstances would a cohort not execute a request in a REP_EXECUTE RPC and instead choose to send back a execute_viewinfo structure to the client instead? What might go wrong if the cohort did execute the request? How does the client proceed when it receives a execute_viewinfo structure in a reply?

Place your answers in a file called answers.txt in the lab2/ directory; an empty answers.txt file should already exist there.

You can similarly trace any server RPCs being received by a process by setting ASRV_TRACE=10 in your environment. For example, you could observe the same RPCs in the above example by tracing the RPC server requests and responses at the cohort process:

% env ASRV_TRACE=10 ./cohort ../lab1/cfd 5555 0:0
Got backend server running on port 56731
Cohort running on 171.64.15.252:49856
Formed a new synthetic view.
ASRV_TRACE: serve rep_bcast_prog_1:REP_BCAST_VIEWINFO x=681e702c in4=171.64.15.252:49877
ASRV_TRACE: reply rep_bcast_prog_1:REP_BCAST_VIEWINFO x=681e702c
execute_viewinfo REPLY = {
  viewid_t vid = {
    u_int64_t counter = 0x1;
    u_int64_t manager = 0xab400ffc00007358;
  };
  net_address_t primary = {
    u_int32_t ipaddr = 0xab400ffc;
    u_int32_t port = 0xc2df;
  };
};
...

Part 3: Writing code in libasync

To familiarize yourself with libasync, you will first write a simple program, groupstat, that will allow you to query a replication group for its current status, such as the current view-id and the identifiers and addresses of the primary and any backup cohorts.

You can look at the code in clientproxy.C to see how one starts running libasync and how to issue broadcast and unicast RPC requests. The ptr<aclnt> rep_client::c_ member issues unicast RPC requests, and the ptr<aclnt> rep_client::bc_ member issues broadcast RPCs.

Your groupstat program should start out by broadcasting a REP_BCAST_VIEWINFO RPC to find the primary of the current view, and then send another RPC to the primary to find out the current view state.

As defined, the protocol does not provide a way to ask a cohort for the current view state, so you should add a new RPC to the REP_PROG protocol that will return a view_t with the current view information. You will also need to modify cohort.C, in particular cohort::dispatch, and return the contents of the cohort::view_ member in response to your newly-added RPC request.

Your completed groupstat program's output should look something like this:

% ./groupstat
fatal: Usage: ./groupstat group-udp-port
% ./groupstat 5555
Current view: 1:12340429876023553061
Primary: 12340429876023553061 at 171.66.3.151:34217
% 

Exercise 2. Implement the groupstat program as described above. The program should take as an argument the broadcast UDP port number of the group, and print out the current view-id of the group, and the mid_t identifiers and network addresses of the primary and backup cohorts.

Part 4: More than one cohort in a group

To add a second cohort to an existing replication group, you can run ./cohort ../lab1/cfd 5555 0:5555. This instructs the new cohort to broadcast a REP_BCAST_LETMEIN RPC request to UDP port 5555. When the other cohort receives this RPC request, it will attempt to perform a view change, and include the caller (the newly started cohort) as a member in the new view.

The code provided in this lab is incomplete in that it cannot complete a view change. Read through Section 5 of the Paxos Made Practical handout to understand how a view change operation proceeds at the protocol level. The view change code in the cohort is implemented by cohort::view_change_initiate() in cohort.C. The current code determines the set of underlings that should receive a REP_VIEW_CHANGE request, and sends out these requests, but the callback that handles the responses, cohort::view_change_prepare_cb(), is not implemented. It will be your job to finish this code.

To start with, you should decide what state needs to be kept by a view change manager in the struct view_change_attempt. For example, you may want to keep track of the nodes that have replied to your REP_VIEW_CHANGE and REP_NEW_VIEW requests. You may find the set abstraction useful for this purpose (implemented in set.h). A set is a template class which can keep track of a set of objects of the same type, such as machine identifiers mid_t. Given two sets a and b, you can determine whether one of them contains the majority of elements of another by calling a.contains_majority_of(b). You can also get a set<mid_t> containing the machine IDs of all members of a view v by calling *view2mids(v).

We have already provided an implementation of the REP_VIEW_CHANGE, REP_NEW_VIEW, and REP_INIT_VIEW requests in cohort::dispatch, and your code should invoke them and wait for the appropriate responses.

Whenever your code receives a response to an RPC request, it should make sure that the view change attempt hasn't been aborted (by checking that attempt->aborted is false). You should also make sure that you properly handle the case when you receive an RPC response for a view change that was aborted, but a new view change has been started in the meantime.

Exercise 3. Implement the view change functionality, and make sure you can join a new cohort to an existing replication group. The provided code detects dead view members, and will initiate a view change when one of the cohorts becomes unresponsive. Check that your code forms a new view when you kill the primary cohort in a group of 3 cohorts, and that your file-store state is still there. Make sure your implementation correctly handles the case when the primary of a replication group with only one cohort member initiates a view change and forms a new view with the same one primary cohort member (itself).

For this part, you can omit handling of REP_VIEW_CHANGE replies that have a non-NULL newview field, and you can ignore the include_me flag. Note: if you choose to omit handling of a non-NULL newview field, your code must fail to form a new view in case of a non-NULL newview, for example by aborting the view change operation.

Remember that you can use ASRV_TRACE and ACLNT_TRACE to see what RPC messages are being sent and received by all of your cohorts; this can greatly simplify debugging.

Hand in your solutions to exercises 1, 2, and 3 by the Feb 13 due date.

Part 5: Making it practical

Most real systems need to be periodically taken down for maintenance or migrated from one physical machine to another for performance or load-balancing considerations. Our current prototype implementation does not allow us to gracefully remove a cohort from a replication group, without treating it like a complete cohort failure.

The handout Paxos Made Practical describes a way in which a cohort can indicate its desire to leave a replication group, by replying to a REP_VIEW_CHANGE RPC request with the include_me field set to false. In this part of the lab, you will extend the cohort implementation to set this field when it wants to exit the replication group, and terminate when a new view without it has been fully formed.

To implement this functionality, you will need to first modify the view change code you implemented in the previous part to take into account the include_me flag, and exclude from the new view, whenever possible, any cohorts that indicate their desire to leave the group. Again, refer to the Paxos Made Practical handout for an explaination of when it is acceptable to exclude a cohort from a view.

Second, you should provide a way for the administrator to instruct a specific cohort to leave a group. Add a new RPC request type to ask a cohort to leave the group, and implement the groupleave program, for which we have provided an empty groupleave.C file, which will invoke this RPC on a given cohort. This RPC should start a view change, and the cohort should then reply to any REP_VIEW_CHANGE requests with include_me set to false.

The Paxos Made Practical handout discusses when it is safe for an exiting cohort to cease functioning: only after a new view has been formed and a majority of the new view's cohorts have acknowledged the new view primary's initial REP_REPLICATE request forming the new view. You should come up with a way for a cohort that wants to leave the group to learn the new view's primary. You should also modify the primary to keep track of responses received for the view-forming REP_REPLICATE request, and export this information to any other cohort which is waiting for it to reach a majority. Finally, make sure that the exiting cohort waits for the appropriate condition to hold before calling exit(0) to cease functioning.

Exercise 4. Add support for exiting a replication group, as described above. You should implement a groupleave program and extend the cohort.C code to gracefully leave a replication group when requested by groupleave.

Challenge! (Optional.) Implement one of the two optimizations mentioned in Paxos Made Practical: leases for locally executing read-only requests, or witness cohorts that do not maintain a running instance of the file-store, and simply log updates (but only when the primary or one of the backups is down).

To add support for local read-only execution of requests, you will need to determine whether a given request will modify the file-store state machine or not. One way to do this would be to modify the cohort code to decode each currently-opaque request, and decide whether the request is read-only based on the operation from the CFS protocol. The str2xdr() function will let you unmarshal a buffer into an XDR data structure.

Part 6: Turn in your lab

To turn in your answers for this lab, you must package up your code and answers.txt and submit it to cs244b-staff@scs.stanford.edu by the turn-in deadline. You can package up your code and answers.txt using make submit.tar.gz, and either send the resulting submit.tar.gz file to us as an attachment, or use make turnin to automatically mail it to us:
% make submit.tar.gz
tar -zcf submit.tar.gz Makefile *.[Cchx] answers.txt
% make turnin
tar -zcf submit.tar.gz Makefile *.[Cchx] answers.txt
uuencode submit.tar.gz lab2.tar.gz | mail cs244b-staff@scs.stanford.edu,username@stanford.edu
% 
Make sure that you receive the copy of your submission sent to your email address (username@stanford.edu) if you use make turnin.

Hand in your solutions to all exercises (1, 2, 3, and 4) by the Feb 15 due date.