G22.3033-010 Lab 3: A Web Proxy

Due date: Thursday, Feb 14.

Introduction

Now that you have programmed an asynchronous TCP proxy, you will create an asynchronous caching web proxy. Unlike the TCP proxy, your web proxy will involve interactions between different connections (through the shared cache) as well as more involved treatment of individual connections.

In this handout, we use client to mean an application program that establishes connections for the purpose of sending requests[3]. Typically the client is a web browser (e.g., lynx or Netscape). We use server to mean an application program that accepts connections in order to service requests by sending back responses (e.g., the Apache web server)[1]. Note that a proxy can act as both a client and server. Moreover, a proxy could communicate with other proxies (e.g., a cache hierarchy).

Design Criteria

The HTTP/1.0 spec, RFC 1945, defines a web proxy as a transparent, trusted intermediary between web clients and web servers for the purpose of making requests on behalf of clients. Requests are serviced internally or by passing them, with possible translation, on to other servers. A proxy must interpret and, if necessary, rewrite a request message before forwarding it. In particular, your proxy must address:

Caching. The proxy must cache web files when RFC 1945 allows it to. Web caches decrease latency and total network load at the cost of sometimes serving stale data[5].
Non-blocking. Your web proxy must operate asynchronously, so that it can talk to multiple clients and servers concurrently. You'll use the same C++ async library you used in the first two labs.
Transparency. The proxy should be transparent to the client and server. Your design cannot rely on any special modification to the client or server.

Your proxy should function correctly for any HTTP/1.0 GET, POST, or HEAD request. However, you may ignore any references to cookies and authentication, except that you should not cache the results of requests that carry client authorization.

Desirable Properties of Your Web Proxy

Your web proxy should tolerate many simultaneous requests. The web proxy will accept connections from multiple clients and forward them using multiple connections to the appropriate servers. No client or server should be able to hang the web proxy by refusing to read or write data on its connection.

You should ensure that your proxy serves cached pages to clients when RFC 1945 allows, and only contacts a server when it has to. RFC 1945 specifies headers that clients and servers may provide to help control caching: Expires, If-Modified-Since, Last-Modified, and Pragma: no-cache. You should make sure that your software obeys these headers. However, you'll find you have a certain amount of freedom in exactly how you decide whether you can serve a cached page to a client, or whether you must re-fetch it from the server.

You'll want to search RFC 1945 for any warnings about ``proxy'' behavior.

The HTTP Protocol

The Hypertext Transfer Protocol (HTTP) is the most commonly used protocol on the web today. For this lab, you will use the somewhat out-of-date version 1.0 of HTTP.

The HTTP protocol assumes a reliable connection and, in current practice, uses TCP to provide this connection. Thus, we can use libasync with TCP sockets just as in the previous labs.

The HTTP protocol is a request/response protocol. When a client opens a connection, it immediately sends its request for a file. A web server then responds with the file or an error message. You can try out the protocol yourself. For example, try:

% telnet www.nyu.edu 80

Then type

GET / HTTP/1.0

followed by two carriage returns. See what you get.

To form the path to the file to be retrieved on a server, the client takes everything after the machine name and port number. For example, http://www.nyu.edu/postcards/ means we should ask for the file /postcards/. If you see a URL with nothing after the machine name and port, then / is assumed (The server determines what page to return when just given /. Typically this default page is index.html or home.html).

On most servers, the HTTP protocol lives on port 80. However, one can specify a different port number in the URL. For example, entering http://www.scs.cs.nyu.edu:3128/ into your favorite web browser connects to the machine www.scs.cs.nyu.edu on port 3128 using the HTTP protocol.

The format of the request for HTTP is quite simple. A request consists of a method followed by arguments, each separated by a space and terminated by a carriage return/linefeed pair. Your web proxy should support three methods: GET, POST, and HEAD[3]. Methods take two arguments: the file to be retrieved and the HTTP version. Additional headers can follow the request. The web proxy will especially care about the following headers: Allow, Date, Expires, From, If-Modified-Since, Pragma: no-cache, Server. However, your proxy must handle the other HTTP/1.0 headers[3]. Fortunately, the web proxy can forward most headers verbatim to the appropriate server. Only a handful of headers require proxy intervention.

Once the request line is received, the web proxy should continue reading the input from the client until it encounters a blank line. The proxy should then fetch the appropriate file and send back a response (usually the file contents) and close the connection.

Using a Web Proxy

To use a web proxy, you must configure your web browser. For lynx, wget, or Mosaic, you must set an environment variable. The following runs lynx with the proxy set to www.scs.cs.nyu.edu port 3128:

% env http_proxy http://www.scs.cs.edu:3128/ lynx

In Netscape, find the Network Preferences and manually setup a proxy. For instance, you can set the HTTP proxy to www.scs.cs.nyu.edu and the port to 3128. Remember to revert your changes. Not all requests will work transparently through the www.scs.cs.nyu.edu proxy. Note also that for security reasons, you cannot use the www.scs.cs.nyu.edu proxy from outside NYU. (For the purposes of a quick test, however, you can bypass security by running your tcpproxy on one of the class machines.)

HTTP in Action!

How does one watch an HTTP request in action? To make a simple HTTP request emulating a browser, you can use telnet. However, telnet does not let you watch incoming TCP connections. For this, you need a more sophisticated tool, such as nc (NetCat). nc lets you read and write data across network connections using UDP or TCP[10]. The class machines already have nc installed. If you need nc for another machine, go to http://www.l0pht.com/~weld/netcat/index.html to download and install it.

To use nc to listen to the network on port 8888, run:

% nc -lp 8888
listening on [any] ... 8888

Now try to retrieve a URL using this port as a proxy:

% env http_proxy=http://localhost:8888/ lynx -source http://www.yahoo.com

You will see netcat print out the request headers:

% nc -lp 8888
GET http://www.yahoo.com/ HTTP/1.0
Host: www.yahoo.com
Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.8.2rel.1 libwww-FM/2.14

The first line asks for a file called http://www.yahoo.com using HTTP version 1.0. Look in RFC 1945 for details on the remaining lines.

The above shows what a web browser sends to a web proxy. Now we'll try to obtain sample data from a real web proxy (www.scs.cs.nyu.edu, port 3128). Set your browser's proxy to http://www.scs.cs.nyu.edu:3128/ and run the following command:

% nc -lp 8888

Now try to retrieve a web page from nc using the web proxy. Say you ran the above command on machine class5. Run:

% env http_proxy=http://www.scs.cs.nyu.edu:3128/ lynx -source http://class5.scs.cs.nyu.edu:8888

nc will show the following request:

% nc -lp 8888

GET / HTTP/1.0
Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.8.2rel.1 libwww-FM/2.14
Via: 1.0 ludlow.scs.cs.nyu.edu:3128 (Squid/2.3.STABLE4)
X-Forwarded-For: 204.168.181.41
Host: class5.scs.cs.nyu.edu:8888
Cache-Control: max-age=259200
Connection: keep-alive

Look for differences between the web browser's request and the corresponding proxy request.

Running and testing the proxy

The proxy should take exactly one argument, a port number on which to listen. For example, to run the proxy on port 2000:

% ./webproxy 2000

To test the proxy, you can use the program test-webproxy (a binary of which should be in ~class/bin, which is in your path). Run test-webproxy with your proxy as an argument:

% test-webproxy ./webproxy
trying to launch webproxy: ./webproxy listen port 1697
Starting server on port 1530....
Test Phase 1: test if cachable pages are cached...Succeeded...
Test Phase 2: test if non-cachable pages are NOT cached...Succeeded...
Test Phase 3: test timeout behavior...Succeeded...
%

The test program runs three tests:

Test Phase 1: test if cachable pages are cached...
This test is to check if you have implemented any caching scheme at all. In the test-webproxy program, the server keeps track of which of the cachable pages have been accessed before and denies access to them the second time. Thus if you have caching in your proxy, all the clients will be able to succeed in getting the cachable pages (first time from the server, second time onwards from your proxy).
Warning: Since the http responses from the server are randomly generated, you need to restart your webproxy every time you run this test if you have running webproxy manually in gdb.
Test Phase 2: test if non-cachable pages are NOT cached...
This test is to check if your cache manager caches anything that's not allowed to be cached. In the test-webproxy program, the server randomly changes the responses to a particular URL (if the type of response associated with the URL is not cachable). The client always checks the response it got from the proxy with the actual response data in the server. Therefore, if your proxy caches anything non-cachable, the response the client got will be different from the server and you will get the following error message:
```
http response (#num) different from the server offset...
```
Specially, if (#num) is:
- 5: Your proxy is caching a http response that has "Pragma: no-cache" header
- 6: Your proxy is caching a http response that is expired
- 7: Your proxy is caching "HTTP/1.0 404 Not Found" response
- 8: Your proxy is caching a response whose corresponding request has "Pragma: no-cache" header
- 9: Your proxy is caching a response whose corresponding request has "Authentication" header
- 10: Your proxy is caching a response which is the result of a "POST" request
Test Phase 3: test timeout behavior...
In the test-webproxy program, the server refuses to read any data from the client has accepted the connection. Your proxy should be able to timeout and disconnect the corresponding client and server.

You can run only one of the test phases by supplying it as an additional argument on the command line:

% test-webproxy ./webproxy 2
trying to launch webproxy: ./webproxy listen port 1652
Starting server on port 1588....
Test Phase 2: test if non-cachable pages are NOT cached...Succeeded...
%

You may wish to run your proxy under the debugger while testing it. You can do so by supplying a port number to test-webproxy instead of a program name. For example:

% ./webproxy 2000 &
[1] 7013
% test-webproxy -d 2000 1
Starting server on port 1625....
Test Phase 1: test if cachable pages are cached...Succeeded...
shutting down proxy..
Killed
%

Administrivia

Where to Start?

Read over some of the suggested literature at the end of this document.

After you have a general understanding of the problem, play with nc and begin your web proxy design. Once you have convinced yourself of the correctness of your design, you should implement the web proxy. Likely you will discover new, fascinating problems and will need to modify your design appropriately.

For this lab, we will provide you with a simple HTTP/1.0 parser to save you the pain of parsing. The files http.h and http.C (also available in ~class/src on the class machines) implement a simple HTTP header parser that is convenient to use with libasync.

http.h defines two useful classes, httpreq and httpresp. Both are fed data from a method int parse (suio *) that removes lines of HTTP headers from a suio structure. parse returns 1 on completion (after which any data following the headers will still be in the suio structure), 0 if it needs more data to see the complete headers, and -1 on a parse error.

The following routines may also be useful:

void suio_print (suio *, str)
Adds the contents of the string onto the end of the suio structure in such a way that the string will not be garbage collected until the bytes have been removed from the suio structure.
str httperror (int status, str statmsg, str url, str description);
Returns a fully formatted HTTP message (status, headers and all) containing an error message. You can use this to report an error back to the user of the proxy. For example, if tcpconnect fails, you might want to creat an error message and stick it in suio *buf:
```
    suio_print (buf, httperror (503, "Service Unavailable",
                                url, strerror (errno)));
```
void suio::take (suio *uio)
Appends the contents of uio on to the end of this, removing the bytes from uio. As an example, if you wanted to append the body of an HTTP reply (in suio *buf) onto the headers built up in an httpresp structure, you might do the following:
```
    resp->headers << "\r\n";    // httpresp does not add blank line
    resp->headers.tosuio ()->take (buf);
```

Handin Procedure

You should hand in a gzipped tarball produced by make dist as in the previous labs. The handin directories for this lab are located under ~class/handin/lab3.

The lab is due by the beginning of class on Wednesday, February 14th.

Requirements

Your proxy should statisfy the following minimum criteria:

GET, conditional GET, POST, HEAD request works
Error response works (e.g. Error 404)
Images/Binary file works

handles multiple requests simultaneously using libasync
deals with hanging(non-responding) client/server

Has memory cache or disk cache (has reasonable size restriction)
Obeys expires header
does not cache the non-cachable and correctly handles "Pragma: no cache" header

References

1

Apache Web Proxy, http://www.apache.org/docs/mod/mod_proxy.html.

2

T. Berners-Lee. Propagation, Replication and Caching on the Web,
http://www.w3.org/Propagation/.

3

T. Berners-Lee, et al. RFC 1945: Hypertext Transfer Protocol - HTTP/1.0, May 1996.

4

CERN Web Proxy, http://www.w3.org/Daemon/User/Proxies/Proxies.html.

5

A. Dingle, T. Partl. Web Cache Coherence,
http://sun3.ms.mff.cuni.cz/~dingle/webcoherence.html, May 1996.

6

SquidCache, http://squid.nlanr.net/Squid/.

7

R. Fielding, et al. RFC 2616: Hypertext Transfer Protocol - HTTP/1.1, June 1999.

8

J. Franks, et al. RFC 2069: An Extension to HTTP : Digest Access Authentication, January 1997.

9

J. C. Mogul, et al. RFC 2145: Use and Interpretation of HTTP Version Numbers, May 1997.

10

Netcat. http://c0re.l0pht.com/~weld/netcat/.

11

D. Wessels. Web Caching Reading List, http://ircache.nlanr.net/Cache/reading.html.