G22.3250 Lab 3: Caching, asynchronous web proxy

Due date: Monday Feb. 24, 12:25pm. (Free extension to 12am if you show up to class on time.)

Introduction

Now that you have programmed a synchronous web proxy and an asynchronous TCP proxy, you will create an asynchronous caching web proxy. Unlike the TCP proxy, your web proxy will involve interactions between different connections (through the shared cache) as well as more involved treatment of individual connections.

In this handout, we use client to mean an application program that establishes connections for the purpose of sending requests. Typically the client is a web browser (e.g., lynx or Netscape). We use server to mean an application program that accepts connections in order to service requests by sending back responses (e.g., the Apache web server). Note that a proxy can act as both a client and server. Moreover, a proxy could communicate with other proxies (e.g., a cache hierarchy).

Design Criteria

From the first lab, you should already be somewhat familiar with the HTTP/1.0 spec, RFC 1945. However, in the first lab you did not have to worry about caching or HTTP headers. This lab will require you to interpret and possibly rewrite parts of a request message before forwarding it onto the server. In particular, your proxy must have the following properties:

Caching. The proxy must cache web files when RFC 1945 allows it to. Web caches decrease latency and total network load at the cost of sometimes serving stale data.
Non-blocking. Your web proxy must operate asynchronously, so that it can talk to multiple clients and servers concurrently. You'll use the same C++ async library you used in the first two labs.
Transparency. The proxy should be transparent to the client and server. Your design cannot rely on any special modification to the client or server.

Your proxy should function correctly for any HTTP/1.0 GET, POST, or HEAD request. However, you may ignore any references to cookies and authentication, except that you should not cache the results of requests that carry client authorization (see the Authorization header description in section 10.2 of RFC 1945).

Desirable Properties of Your Web Proxy

Your web proxy should tolerate many simultaneous requests. The web proxy will accept connections from multiple clients and forward them using multiple connections to the appropriate servers. No client or server should be able to hang the web proxy by refusing to read or write data on its connection.

You should ensure that your proxy serves cached pages to clients when RFC 1945 allows, and only contacts a server when it has to. RFC 1945 specifies headers that clients and servers may provide to help control caching: Expires, If-Modified-Since, Last-Modified, and Pragma: no-cache. You should make sure that your software obeys these headers. However, you'll find you have a certain amount of freedom in exactly how you decide whether you can serve a cached page to a client, or whether you must re-fetch it from the server.

You'll want to search RFC 1945 for any warnings about ``proxy'' behavior.

The HTTP Protocol

You might want to look back at the first lab to refresh your memory about how HTTP works, how to test it with the telnet and nc utilities, and how to configure your browser to use a web proxy.

In particular, you will want to use nc to see what web browsers send to web proxies. To do this, you might listen on port 8888 as follows:

% nc -l 8888
listening on [any] ... 8888

Then try to retrieve a URL using this port as a proxy:

% env http_proxy=http://localhost:8888/ lynx -source http://www.yahoo.com

You will see netcat print out the request headers:

% nc -l 8888
GET http://www.yahoo.com/ HTTP/1.0
Host: www.yahoo.com
Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.8.2rel.1 libwww-FM/2.14

The first line asks for a file called http://www.yahoo.com using HTTP version 1.0. Look in RFC 1945 for details on the remaining lines.

In order to help you figure out how to make your web proxy work, the class web server machine, www.scs.cs.nyu.edu, is also running a copy of the Squid web proxy on port 3128. You can also use nc to look at the output from this real web proxy. For example, set your browser's proxy to http://www.scs.cs.nyu.edu:3128/ and run the following command:

% nc -l 8888

Now try to retrieve a web page from nc using the web proxy. Say you ran the above command on machine class5. Run:

% env http_proxy=http://www.scs.cs.nyu.edu:3128/ lynx -source http://class5.scs.cs.nyu.edu:8888

nc will show the following request:

% nc -l 8888
GET / HTTP/1.0
Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.8.2rel.1 libwww-FM/2.14
Via: 1.0 ludlow.scs.cs.nyu.edu:3128 (Squid/2.3.STABLE4)
X-Forwarded-For: 204.168.181.41
Host: class5.scs.cs.nyu.edu:8888
Cache-Control: max-age=259200
Connection: keep-alive

Look for differences between the web browser's request and the corresponding proxy request.

Running and testing the proxy

The proxy should take exactly one argument, a port number on which to listen. For example, to run the proxy on port 2000:

% ./webproxy 2000

Getting started

After you have a general understanding of the problem, play with nc and begin your web proxy design. Once you have convinced yourself of the correctness of your design, you should implement the web proxy. You may then discover new problems and need to modify your design appropriately.

For this lab, you will use a modified version of the HTTP/1.0 parser from the first lab, but through a different interface. Do not use the httpreq_parse function, and do not expect the parse function you do use to behave like httpreq_parse.

You will actually use two classes from http.h, httpreq and httpresp. Both are fed data from a method int parse (suio *) that removes lines of HTTP headers from a suio structure. parse returns > 0 on completion (after which any data following the headers, such as the start of the body of a POST request, will still be in the suio structure), 0 if it needs more data to see the complete headers, and -1 on a parse error. Note that unlike the first lab, this interface to the HTTP parsers is stateful--the parser will remove any data it consumes from the suio structure you pass to parse.

Several fields of the httpreq and httpresp structure will be useful to you. The most important field is _headers, which contains the actual headers of an HTTP request or response. The content of _headers is slightly modified so as to make it appropriate to forward on. For example, if the line

   GET http://www.yahoo.com/ HTTP/1.0

is fed to the parse method, headers will actually contain:

   GET / HTTP/1.0

This simplifies the task of forwarding headers on to a remote server. If reqp is an httpreq *, then reqp->_headers->tosuio ()->output (fd) would write (some of) the headers to file descriptor fd.

Note that several of the fields in the parse structure are of type str. Since you were not using str structures in the first lab, these were made private, and accessed through inline functions. For example, _url is private, and accessed by the method:

   const char *url() { return _url.cstr(); }

In this lab, you are free to access the str fields directly, as you may find that more convenient. Feel free to change the private: directive into a public: in http.h.

The following routines may also be useful to you:

void suio_print (suio *, str)
Adds the contents of the string onto the end of the suio structure in such a way that the string will not be garbage collected until the bytes have been removed from the suio structure.
str httperror (int status, str statmsg, str url, str description);
Returns a fully formatted HTTP message (status, headers and all) containing an error message. You can use this to report an error back to the user of the proxy. For example, if tcpconnect fails, you might want to creat an error message and stick it in suio *buf:
```
    suio_print (buf, httperror (503, "Service Unavailable",
                                url, strerror (errno)));
```
void suio::take (suio *uio)
Appends the contents of uio on to the end of this, removing the bytes from uio. As an example, if you wanted to append the body of an HTTP reply (in suio *buf) onto the headers built up in an httpresp structure, you might do the following:
```
    resp->_headers << "\r\n";    // httpresp does not add blank line
    resp->_headers.tosuio ()->take (buf);
```

Handin Procedure

You should hand in a gzipped tarball produced by gmake distcheck as in the previous labs, and a typescript of any tests. The handin directories for this lab are located under ~class/handin/lab3.

There is a tester program, called test-webproxy2, located in ~class/bin. Please be sure to include a run of this tester program in any typescript you turn in. A working web proxy should look like this:

% test-webproxy2 ./webproxy
Basic page retrieval: success
Cached retrieval: success
Non-cachable retrieval: success
Timeouting client: success
%

Requirements

Your proxy should statisfy the following minimum criteria:

Basic:

GET, conditional GET, POST, HEAD request works
Error response works (e.g. Error 404)
Images/Binary file works

Asynchronous I/O:

handles multiple requests simultaneously using libasync
deals with hanging(non-responding) client/server

Caching:

Has memory cache or disk cache (has reasonable size restriction)
Obeys expires header
does not cache the non-cachable and correctly handles "Pragma: no cache" header