Now that you have programmed a synchronous web proxy and an asynchronous TCP proxy, you will create an asynchronous caching web proxy. Unlike the TCP proxy, your web proxy will involve interactions between different connections (through the shared cache) as well as more involved treatment of individual connections.
In this handout, we use client to mean an application program that establishes connections for the purpose of sending requests. Typically the client is a web browser (e.g., lynx or Netscape). We use server to mean an application program that accepts connections in order to service requests by sending back responses (e.g., the Apache web server). Note that a proxy can act as both a client and server. Moreover, a proxy could communicate with other proxies (e.g., a cache hierarchy).
From the first lab, you should already be somewhat familiar with the HTTP/1.0 spec, RFC 1945. However, in the first lab you did not have to worry about caching or HTTP headers. This lab will require you to interpret and possibly rewrite parts of a request message before forwarding it onto the server. In particular, your proxy must have the following properties:
Your proxy should function correctly for any HTTP/1.0 GET, POST, or
HEAD request. However, you may ignore any references to cookies and
authentication, except that you should not cache the results of
requests that carry client authorization (see the
Authorization
header description in section 10.2 of RFC
1945).
Your web proxy should tolerate many simultaneous requests. The web proxy will accept connections from multiple clients and forward them using multiple connections to the appropriate servers. No client or server should be able to hang the web proxy by refusing to read or write data on its connection.
You should ensure that your proxy serves cached pages to clients when RFC 1945 allows, and only contacts a server when it has to. RFC 1945 specifies headers that clients and servers may provide to help control caching: Expires, If-Modified-Since, Last-Modified, and Pragma: no-cache. You should make sure that your software obeys these headers. However, you'll find you have a certain amount of freedom in exactly how you decide whether you can serve a cached page to a client, or whether you must re-fetch it from the server.
You'll want to search RFC 1945 for any warnings about ``proxy'' behavior.
telnet
and nc
utilities, and how to
configure your browser to use a web proxy.
In particular, you will want to use nc to see what web browsers send to web proxies. To do this, you might listen on port 8888 as follows:
% nc -l 8888 listening on [any] ... 8888
Then try to retrieve a URL using this port as a proxy:
% env http_proxy=http://localhost:8888/ lynx -source http://www.yahoo.comYou will see netcat print out the request headers:
% nc -l 8888 GET http://www.yahoo.com/ HTTP/1.0 Host: www.yahoo.com Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01 Accept-Encoding: gzip, compress Accept-Language: en User-Agent: Lynx/2.8.2rel.1 libwww-FM/2.14
The first line asks for a file called http://www.yahoo.com using HTTP version 1.0. Look in RFC 1945 for details on the remaining lines.
In order to help you figure out how to make your web proxy work, the class web server machine,www.scs.cs.nyu.edu
, is also
running a copy of the Squid web
proxy on port 3128. You can also use nc to look at the
output from this real web proxy. For example, set your browser's
proxy to http://www.scs.cs.nyu.edu:3128/ and run the
following command:
% nc -l 8888
Now try to retrieve a web page from nc using the web proxy. Say you ran the above command on machine class5. Run:
% env http_proxy=http://www.scs.cs.nyu.edu:3128/ lynx -source http://class5.scs.cs.nyu.edu:8888nc will show the following request:
% nc -l 8888 GET / HTTP/1.0 Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01 Accept-Encoding: gzip, compress Accept-Language: en User-Agent: Lynx/2.8.2rel.1 libwww-FM/2.14 Via: 1.0 ludlow.scs.cs.nyu.edu:3128 (Squid/2.3.STABLE4) X-Forwarded-For: 204.168.181.41 Host: class5.scs.cs.nyu.edu:8888 Cache-Control: max-age=259200 Connection: keep-alive
Look for differences between the web browser's request and the corresponding proxy request.
% ./webproxy 2000
After you have a general understanding of the problem, play with nc and begin your web proxy design. Once you have convinced yourself of the correctness of your design, you should implement the web proxy. You may then discover new problems and need to modify your design appropriately.
For this lab, you will use a modified version of the HTTP/1.0
parser from the first lab, but through a
different interface. Do not use the
httpreq_parse
function, and do not expect the parse
function you do use to behave like httpreq_parse
.
You will actually use two classes from http.h, httpreq
and httpresp
. Both are fed data from a method int
parse (suio *) that removes lines of HTTP headers from a
suio structure. parse returns > 0 on completion
(after which any data following the headers, such as the start of the
body of a POST request, will still be in the suio structure),
0 if it needs more data to see the complete headers, and -1 on a parse
error. Note that unlike the first lab, this interface
to the HTTP parsers is stateful--the parser will remove any data it
consumes from the suio structure you pass to parse.
Several fields of the httpreq and httpresp structure will be useful to you. The most important field is _headers, which contains the actual headers of an HTTP request or response. The content of _headers is slightly modified so as to make it appropriate to forward on. For example, if the line
GET http://www.yahoo.com/ HTTP/1.0is fed to the parse method, headers will actually contain:
GET / HTTP/1.0This simplifies the task of forwarding headers on to a remote server. If reqp is an httpreq *, then reqp->_headers->tosuio ()->output (fd) would write (some of) the headers to file descriptor fd.
Note that several of the fields in the parse structure are of type str. Since you were not using str structures in the first lab, these were made private, and accessed through inline functions. For example, _url is private, and accessed by the method:
const char *url() { return _url.cstr(); }In this lab, you are free to access the str fields directly, as you may find that more convenient. Feel free to change the private: directive into a public: in http.h.
The following routines may also be useful to you:
Adds the contents of the string onto the end of the suio structure in such a way that the string will not be garbage collected until the bytes have been removed from the suio structure.
Returns a fully formatted HTTP message (status, headers and all) containing an error message. You can use this to report an error back to the user of the proxy. For example, if tcpconnect fails, you might want to creat an error message and stick it in suio *buf:
suio_print (buf, httperror (503, "Service Unavailable", url, strerror (errno)));
Appends the contents of uio on to the end of this, removing the bytes from uio. As an example, if you wanted to append the body of an HTTP reply (in suio *buf) onto the headers built up in an httpresp structure, you might do the following:
resp->_headers << "\r\n"; // httpresp does not add blank line resp->_headers.tosuio ()->take (buf);
There is a tester program, called test-webproxy2
,
located in ~class/bin. Please be sure to include a run of
this tester program in any typescript you turn in. A working web
proxy should look like this:
% test-webproxy2 ./webproxy Basic page retrieval: success Cached retrieval: success Non-cachable retrieval: success Timeouting client: success %
Your proxy should statisfy the following minimum criteria:
Basic:
Asynchronous I/O:
Caching: