G22.3250 Lab 1: Synchronous web proxy

Due date: Monday Feb. 3, 12:25pm. (Free extension to 12am if you show up to class on time.)

Handouts: The handout from the first lecture, Using TCP through sockets, will be useful for this lab.

Introduction

In this lab assignment you will write a simple web proxy. A web proxy is a program that reads a request from a browser, forwards that request to a web server, reads the reply from the web server, and forwards the reply back to the browser. People typically use web proxies to cache pages for better performance, to modify web pages in transit (e.g. to remove annoying advertisements), or for weak anonymity.

You'll be writing a web proxy to learn about how to structure servers. For this assignment you'll start simple; in particular your proxy need only handle a single connection at a time. It should accept a new connection from a browser, completely handle the request and response for that browser, and then start work on the next connection. (A real web proxy would be able to handle many connections concurrently.)

In this lab, we use client to mean an application program that establishes connections for the purpose of sending requests[3], typically a web browser (e.g., lynx or Netscape). We use server to mean an application program that accepts connections in order to service requests by sending back responses (e.g., the Apache web server). Note that a proxy acts as both a client and server. Moreover, a proxy could communicate with other proxies (e.g., in a cache hierarchy).

Design Requirements

Your proxy will speak a subset of the HTTP/1.0 protocol, which is defined in RFC 1945. You're only responsible for a small subset of HTTP/1.0, so you can ignore most of the spec. You should make sure your proxy satisfies these requirements:

GET requests work.
Images/Binary files are transferred correctly.
Your webproxy should properly handle Full-Requests (RFC 1945, Section 4.1) up to, and including, 65535 bytes. You should close the connection if a Full-Request is larger than that.
You must support URLs with a numerical IP address instead of the server name (e.g. http://216.165.108.9/).
You are not allowed to use fork().
You may not allocate more than 100MB of memory.
You can not have more than 32 open file descriptors.
Your proxy should correctly service each request if possible. If an error occurs, and it is possible for the proxy to continue with subsequent requests, it should close the connection and then proceed to the next request. If an error occurs from which the proxy cannot reasonably recover, the proxy should print an error message on the standard error and call exit(1). There are not many non-recoverable errors; perhaps the only ones are failure of the initial socket(), bind(), listen() calls, or a call to accept(). The proxy should never dump core except in situations beyond your control (e.g. a hardware or operating system failure).

You do not have to worry about correct implementation of any of the following features; just ignore them as best you can:

POST or HEAD requests.
URLs of any type other than http.
HTTP-headers (RFC 1945, Section 4.2).

If your browser can fetch pages and images through your proxy, and your proxy passes the tester (see below), you're done.

HTTP example without a web proxy

HTTP is a request/response protocol that runs over TCP. A client opens a connection to a web server and sends a request for a file; the server responds with some status information and the file contents, and then closes the connection.

You can try out HTTP yourself:

% telnet www.scs.cs.nyu.edu 80[return]
GET / HTTP/1.0[return]
[return]
HTTP/1.1 200 OK
Date: Mon, 27 Jan 2003 03:08:59 GMT
Server: Apache/1.3.26 (Unix) mod_ssl/2.8.10 OpenSSL/0.9.7-beta3
Last-Modified: Wed, 22 Jan 2003 07:24:49 GMT
ETag: "22c11-c33-3e2e4741"
Accept-Ranges: bytes
Content-Length: 3123
Connection: close
Content-Type: text/html

...

The telnet command connects to www.scs.cs.nyu.edu on port 80, the default port for HTTP (web) servers. The line GET / HTTP/1.0 is a request to get the web page /. The blank line ends the header section of the request. The server then locates the web page and sends it back. You should see it on your screen.

To form the path to the file to be retrieved on a server, the client takes everything after the machine name. For example, http://www.scs.cs.nyu.edu/G22.3250/index.html means we should ask for the file /G22.3250/index.html. If you see a URL with nothing after the machine name, then / is assumed---the server figures out what page to return when just given /. Typically this default page is /index.html.

On most servers, the HTTP server lives on port 80. However, one can specify a different port number in the URL. For example, typing http://glimpse.cs.arizona.edu:1994 in your browser will tell it to find a web server on port 1994 of glimpse.cs.arizona.edu. (This server may not exist.)

HTTP (request) example with a web proxy

Before you can do this example, you need to tell your web browser how to use a web proxy. For lynx, wget, or Mosaic, you must set an environment variable. The following runs lynx with the proxy set to www.scs.cs.nyu.edu port 3128:

 % env http_proxy=http://www.scs.cs.nyu.edu:3128/ lynx

In Netscape or Mozilla, Choose ``Edit'' ---> ``Preferences''. Then choose ``Advanced'' ---> ``Proxies''. Click on ``Manual proxy configuration''. Now set the ``HTTP proxy'' to www.scs.cs.nyu.edu and port 3128. Mozilla will now send all HTTP request to this web proxy rather than directly to web servers. Remember to revert your changes! (Not all requests will work transparently through the www.scs.cs.nyu.edu proxy.) Note also that for security reasons, you cannot use the www.scs.cs.nyu.edu proxy from outside NYU.

Now on to actually using the proxy...

You can use the nc command to peak at HTTP requests that a browser sends to a web proxy. nc lets you read and write data across network connections using UDP or TCP[4]. The class machines have nc installed.

First we'll examine the requests that a browser sends to the proxy. We'll use nc to listen on a port and direct our web browser (Lynx) to use that host and port as a proxy. We're going to let nc listen on port 8888 and tell Lynx to use a web proxy on port 8888.

% nc -l 8888

This tells nc to listen on port 8888. Chances are that you will have to choose a different port number than 8888 because someone else may be using that port. Choose a number greater than 1024, less than 65536. Now try, on the same machine, to retrieve a web page port 8888 as a proxy:

% env http_proxy=http://localhost:8888/ lynx -source http://www.yahoo.com

This tells Lynx to fetch http://www.yahoo.com using a web proxy on port 8888, which happens to be our spy friend nc.

Netcat neatly prints out the request headers that Lynx sent:

% nc -l 8888
GET http://www.yahoo.com/ HTTP/1.0
Host: www.yahoo.com
Accept: text/html, text/plain, application/vnd.rn-rn_music_package, application/x-freeamp-theme, audio/mp3, audio/mpeg, audio/mpegurl, audio/scpls, audio/x-mp3, audio/x-mpeg, audio/x-mpegurl, audio/x-scpls, audio/mod, image/*, video/mpeg, video/*
Accept: application/pgp, application/pdf, application/postscript, message/partial, message/external-body, x-be2, application/andrew-inset, text/richtext, text/enriched, x-sun-attachment, audio-file, postscript-file, default, mail-file
Accept: sun-deskset-message, application/x-metamail-patch, application/msword, text/sgml, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6b

The GET request on the first tells the proxy to get file http://www.yahoo.com using HTTP version 1.0. Notice how this request is quite different from the example without a web proxy! The protocol and machine name (http://www.yahoo.com) are now part of the request. In the previous example this part was omitted. Look in RFC 1945 for details on the remaining lines.

HTTP (reply) example with a web proxy

The previous example shows the HTTP request. Now we'll try to see what a real web proxy (www.scs.cs.nyu.edu port 3128) sends to a web server. To achieve this, we use nc to be a fake web server. Start the ``fake server'' on a class machine, say class1.scs.cs.nyu.edu with the following command:

class1 1% nc -l 8888

Again, you may have to choose a different number if 8888 turns out to be taken by someone else.

% env http_proxy=http://sure.lcs.mit.edu:3128/ lynx -source http://pain.lcs.mit.edu:8888

Needless to say, you should replace 8888 by whatever port you chose to run nc on. nc will show the following request:

class1 1% nc -l 8888
GET / HTTP/1.0
Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01
Accept-Encoding: gzip, compress
Accept-Language: en
User-Agent: Lynx/2.8.4rel.1 libwww-FM/2.14
Via: 1.0 supervised-residence.lcs.mit.edu:3128 (Squid/2.4.STABLE4)
X-Forwarded-For: 18.26.4.76
Host: pain.lcs.mit.edu:8888
Cache-Control: max-age=259200
Connection: keep-alive

Notice how the web proxy stripped away the http://pain.lcs.mit.edu:8888 part from the request!

Your web proxy

Your web proxy will have to translate between requests that the client makes (the one that starts with ``GET http://machinename'') into requests that the server understands. You will start with some skeletal code containing and HTTP parser, which should make your life easier.

Your web proxy will take a single argument specifying the TCP port on which to listen (so different users on the same class machine can avoid conflicting with each other).

Once the request line has been received, the web proxy should continue reading the input from the client until it encounters a blank line. The proxy should then fetch the URL from the appropriate server, forward the response back to the client, and close the connection. The proxy should forward response data as it arrives, rather than buffering the entire response; this allows the proxy to handle huge responses without running out of memory.

Your web proxy has to support the GET method only [3]. A GET method takes two arguments: the file to be retrieved and the HTTP version. Additional headers may follow the request.

Getting Started

Enough talking. Now do something.

We have provided a skeleton webproxy project directory. It is available in ~class/src/webproxy1.tar.gz. Start by unpacking the source code in your home directory. On the class machines, you can do so with the following commands:

% tar xzf ~class/src/webproxy1.tar.gz
% cd webproxy1
% sh ./setup
automake: configure.in: installing `./install-sh'
automake: configure.in: installing `./mkinstalldirs'
automake: configure.in: installing `./missing'
configure.in: 22: required file `./ltconfig' not found
automake: Makefile.am: installing `./INSTALL'
automake: Makefile.am: installing `./COPYING'
+ autoconf
+ set +x
%

The skeleton source tree contains source files http.C, http.h, and webproxy1.C. The first two files will help you parse HTTP requests. webproxy1.C is a pretty useless server that just prints "synchronous_proxy unimplemented... goodbye". This is printed by a function called synchronous_proxy which you must now implement.

Next, you must configure the software and generate a Makefile--a set of instructions for how to compile the software. For this class, we will use the GNU autoconf and automake tools to generate Makefiles. You will also be linking against the libasync library that is part of SFS. On the class machines, generate the Makefile with the following commands: % setenv DEBUG -g % ./configure --with-sfs=/usr/local/os/sfs-dbg creating cache ./config.cache checking for a BSD compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking whether make sets ${MAKE}... yes checking for working aclocal... found checking for working autoconf... found checking for working automake... found checking for working autoheader... found checking for working makeinfo... found checking host system type... i386-unknown-openbsd3.2 ... updating cache ./config.cache creating ./config.status creating Makefile creating config.h % It is very important that you supply the argument --with-sfs=/usr/local/os/sfs-dbg to ./configure. If you don't, things will appear to work, but you will get a version of libasync without built-in debugging sanity checks. Your assignment will be linked against debugging libraries for grading, so you want to make sure you get the benefit of the sanity checking while testing the software yourself. Once the software is configured, you can build webproxy1 by running gmake. (Note that this is gmake with a g, and not make. At the end of the assignment you will make a software distribution that compiles with any make, but for development you must use gmake which is GNU make.) % gmake c++ -DHAVE_CONFIG_H -I. ... -c /home/c/dm/webproxy1/http.C c++ -DHAVE_CONFIG_H -I. ... -c /home/c/dm/webproxy1/webproxy1.C /bin/sh ./libtool --mode=link c++ ... -o webproxy1 ... mkdir .libs c++ ... -o webproxy1 http.o webproxy1.o ... % That's it! You've now built webproxy1. To test it, type (for example, on class 1): class1 5% ./webproxy1 1234 In another window, you can now type: % telnet class1.scs.cs.nyu.edu 1234 Trying 216.165.109.103... Connected to class3. Escape character is '^]'. synchronous_proxy unimplemented... goodbye Connection closed by foreign host. % Out-of-directory builds Warning: Read this section or your compiles will be infuriatingly slow! It is often useful to compile a program in a different directory from the source code. There are several reasons for this. One may want to compile the same source tree multiple times--for example once with debugging, once without. Using two copies of the same source tree would make it a pain to keep the two compiled versions in sync. Another issue is that C++ object files and executables can get pretty large--especially with debugging information. Thus it is considerably faster to compile on a local disk when the source code is not local. Finally, when you have limited backed-up disk space, there is no reason to waste it on huge C++ executables, since these can always be recreated from the source in the event of a disk crash. Autoconf easily supports compiling in a different directory. You simply need to run the configure script from whatever directory you wish the compile to take place. However, when a source tree is being used for out-of-directory builds, you cannot also perform an in-place build. The following example illustrates how one might compile webproxy1 out-of-directory using local disk space on the machine class2: class2 1% cd webproxy1 class2 2% gmake distclean rm -f config.h rm -f *.tab.c ... rm -f config.status class2 3% mkdir /home/c2/scratch/student class2 4% cd /home/c2/scratch/student class2 5% mkdir webproxy1 class2 6% cd webproxy1 class2 7% setenv DEBUG -g class2 8% ~/webproxy1/configure --with-sfs=/usr/local/os/sfs-dbg creating cache ./config.cache checking for a BSD compatible install... /usr/bin/install -c checking whether build environment is sane... yes ... creating Makefile creating config.h class2 9% gmake c++ -DHAVE_CONFIG_H -I. -I/home/c/student/webproxy1 -I. ... /bin/sh ./libtool --mode=link c++ -g -ansi -Wall -Wsign-compare ... mkdir .libs ... class2 10% (The gmake distclean command cleans up any previous in-directory build, restoring the webproxy1 directory to its pristine state. If you never ran ./configure in that directory, you do not need to run gmake distclean. In fact, gmake disclean will fail in that case, which is fine.) http.C and http.h: an HTTP parser The http.C and http.h files implement an HTTP request parser. http.h defines the class httpreq, and a function httpreq_parse for filling in the fields of the structure. To parse a request, first create a httpreq object. Then, parse the (potentially incomplete) HTTP request by feeding it to: int httpreq_parse (httpreq *resp, const char *buf, size_t len); until the function returns a positive number, indicating that the headers are complete. buf should be the buffer that contains the (potentially incomplete) HTTP request. len is the length of the HTTP request fragment in buf. Notice that parse needs to see the whole request you have read so far. parse returns 1 if the HTTP request is complete, 0 if it needs more data to complete, or -1 on a parse error. parse does not modify the contents of buf. Once parse returns 1, you can call---amongst others---the following methods on the calling httpreq. char *method() The 'type' of request (POST, GET, HEAD) char *host() The destination host short port() The destination port char *path() The filename part of the requested URL char *url() The requested URL Here's a simple program that illustrates the use of httpreq. #include <stdio.h> #include "http.h" int main() { httpreq r; char buf[512]; int ret; // incomplete header strcpy (buf, "GET http://web.mit.edu/index.html"); ret = httpreq_parse (&r, buf, strlen (buf)); printf ("ret %d file %s\n", ret, ret > 0 ? r.path () : "(none)"); // complete header strcat (buf, " HTTP/1.0\r\n\r\n"); ret = httpreq_parse (&r, buf, strlen(buf)); printf ("ret %d file %s\n", ret, ret > 0 ? r.path () : "(none)"); exit (0); } Documentation You may want to read Using TCP through sockets to learn about socket programming in C/C++. See also the BSD IPC manual for an alternate treatment of socket programming. Running and testing the proxy Your proxy program should take exactly one argument, a port number on which to listen. For example, to run the proxy on port 2000: % ./webproxy1 2000 As a first test of the proxy you should attempt to use it to browse the web. Set up your web browswer to use one of the class machines running your proxy as a proxy and experiment with a variety of different pages. When you think your proxy is ready, you can run it against the test program test-webproxy1, our tester, the source of which is in ~class/test-webproxy1.C. Run the tester with your proxy as an argument: % test-webproxy1 ./webproxy1 Note that this may take several minutes to complete. The test program runs the following tests: Ordinary fetch This test is the "normal case". We send a normal HTTP 1.0 GET request and expect the correct web page. Split request This tests splits the HTTP request in two chunks. The first chunk contains a partial HTTP request. The second chunk completes the first after which the tester expects the correct web page contents to come back. Large request The tester does a request of exactly 65535 bytes. Large response The tester fetches a web page larger than the maximum amount of memory available to your web proxy. Zero-size response The tester fetches a web page without a body. Recover after bad connect The tester sends a request with a URL that specifies a false port. Your proxy will attempt to make a connection to a bogus port. Soon thereafter, the tester tries to fetch a valid page to see if your proxy is still doing ok. Malformed request The tester sends an HTTP request that is not syntactically correct. After that, it tries to fetch a valid page to see if it your proxy is still doing ok. Premature client close() The tester sends a partial HTTP request and then closes the connection. After that, it tries to fetch a valid page to see if it your proxy is still doing ok. Infinitely long request The tester swamps your proxy with a request larger than 65535 bytes. The tester expects your proxy to close the connection. After that, it tries to fetch a valid page to see if it your proxy is still doing ok. Stress test The tester stress tests your web proxy with a ruthless combination of ordinary fetches, split requests, malformed requests, and large responses. This may expose memory leaks, unclosed connections, and random other bugs. Collaboration policy You must write all the code you hand in for the programming assignments, except for code that we give you as part of the assigment. You are not allowed to look at anyone else's solution (and you're not allowed to look at solutions from previous years). You may discuss the assignments with other students, but you may not look at or copy each others' code. How/What to hand in You must submit two files: A complete software distribution of the webproxy1 program, A script file showing how you tested the program. To build a software distribution, run the command: % gmake distcheck rm -rf webproxy1-0.0 mkdir webproxy1-0.0 chmod 777 webproxy1-0.0 here=`cd . && pwd`; \ top_distdir=`cd webproxy1-0.0 && pwd`; \ distdir=`cd webproxy1-0.0 && pwd`; \ cd /home/c/dm/webproxy1 \ && automake-1.4 --include-deps --build-dir=$here --srcdir-name=/home/c/dm/webproxy1 --output-dir=$top_distdir --gnu Makefile chmod -R a+r webproxy1-0.0 ... gmake[1]: Leaving directory `/disk/c3/scratch/dm/webproxy1-0.0/=build' rm -rf webproxy1-0.0 ============================================== webproxy1-0.0.tar.gz is ready for distribution ============================================== % To turn in your distribution, copy it to the directory ~class/handin/lab1/username where username is your username: % cp webproxy1-0.0.tar.gz ~class/handin/lab1/`logname`/ % To create a script file, use the script command. When you run script, everything you type gets saved in a file called typescript. Press CTRL-D to finish the script. The typescript should be copied to the same directory as the software distribution. For example: % script Script started, output file is typescript % test-webproxy1 ./webproxy1 ... % ^D Script done, output file is typescript % cp typescript ~class/handin/lab1/`logname`/ % If you have any problems about submission, please contact the instructor. References 1 Apache Web Proxy, http://www.apache.org/docs/mod/mod_proxy.html. 2 T. Berners-Lee, et al. RFC 1945: Hypertext Transfer Protocol - HTTP/1.0, May 1996. 3 CERN Web Proxy, http://www.w3.org/Daemon/User/Proxies/Proxies.html. 4 nc. man page.