In this lab assignment you will write a simple web proxy. A web proxy is a program that reads a request from a browser, forwards that request to a web server, reads the reply from the web server, and forwards the reply back to the browser. People typically use web proxies to cache pages for better performance, to modify web pages in transit (e.g. to remove annoying advertisements), or for weak anonymity.
You'll be writing a web proxy to learn about how to structure servers. For this assignment you'll start simple; in particular your proxy need only handle a single connection at a time. It should accept a new connection from a browser, completely handle the request and response for that browser, and then start work on the next connection. (A real web proxy would be able to handle many connections concurrently.)
In this lab, we use client to mean an application program that establishes connections for the purpose of sending requests[3], typically a web browser (e.g., lynx or Netscape). We use server to mean an application program that accepts connections in order to service requests by sending back responses (e.g., the Apache web server). Note that a proxy acts as both a client and server. Moreover, a proxy could communicate with other proxies (e.g., in a cache hierarchy).
fork()
.exit(1)
. There are not many
non-recoverable errors; perhaps the only ones are failure of the
initial socket()
, bind()
,
listen()
calls, or a call to accept()
. The
proxy should never dump core except in situations beyond your control
(e.g. a hardware or operating system failure).
You do not have to worry about correct implementation of any of the following features; just ignore them as best you can:
If your browser can fetch pages and images through your proxy, and your proxy passes the tester (see below), you're done.
HTTP is a request/response protocol that runs over TCP. A client opens a connection to a web server and sends a request for a file; the server responds with some status information and the file contents, and then closes the connection.
You can try out HTTP yourself:
% telnet www.scs.cs.nyu.edu 80[return] GET / HTTP/1.0[return] [return] HTTP/1.1 200 OK Date: Mon, 27 Jan 2003 03:08:59 GMT Server: Apache/1.3.26 (Unix) mod_ssl/2.8.10 OpenSSL/0.9.7-beta3 Last-Modified: Wed, 22 Jan 2003 07:24:49 GMT ETag: "22c11-c33-3e2e4741" Accept-Ranges: bytes Content-Length: 3123 Connection: close Content-Type: text/html ...
The telnet command connects to www.scs.cs.nyu.edu on port
80, the default port for HTTP (web) servers. The line GET /
HTTP/1.0
is a request to get the web page /
. The
blank line ends the header section of the request. The server then
locates the web page and sends it back. You should see it on your
screen.
To form the path to the file to be retrieved on a server, the client takes everything after the machine name. For example, http://www.scs.cs.nyu.edu/G22.3250/index.html means we should ask for the file /G22.3250/index.html. If you see a URL with nothing after the machine name, then / is assumed---the server figures out what page to return when just given /. Typically this default page is /index.html.
On most servers, the HTTP server lives on port 80. However, one can specify a different port number in the URL. For example, typing http://glimpse.cs.arizona.edu:1994 in your browser will tell it to find a web server on port 1994 of glimpse.cs.arizona.edu. (This server may not exist.)
Before you can do this example, you need to tell your web browser how to use a web proxy. For lynx, wget, or Mosaic, you must set an environment variable. The following runs lynx with the proxy set to www.scs.cs.nyu.edu port 3128:
% env http_proxy=http://www.scs.cs.nyu.edu:3128/ lynx
In Netscape or Mozilla, Choose ``Edit'' ---> ``Preferences''. Then choose ``Advanced'' ---> ``Proxies''. Click on ``Manual proxy configuration''. Now set the ``HTTP proxy'' to www.scs.cs.nyu.edu and port 3128. Mozilla will now send all HTTP request to this web proxy rather than directly to web servers. Remember to revert your changes! (Not all requests will work transparently through the www.scs.cs.nyu.edu proxy.) Note also that for security reasons, you cannot use the www.scs.cs.nyu.edu proxy from outside NYU.
Now on to actually using the proxy...
You can use the nc
command to peak at HTTP requests
that a browser sends to a web proxy. nc lets you read and
write data across network connections using UDP or TCP[4]. The class machines have nc installed.
First we'll examine the requests that a browser sends to the proxy.
We'll use nc to listen on a port and direct our web browser
(Lynx) to use that host and port as a proxy. We're going to let
nc
listen on port 8888 and tell Lynx to use a web proxy
on port 8888.
% nc -l 8888
This tells nc
to listen on port 8888. Chances are that you
will have to choose a different port number than 8888 because someone else may
be using that port. Choose a number greater than 1024, less than 65536. Now
try, on the same machine, to retrieve a web page port 8888 as a proxy:
% env http_proxy=http://localhost:8888/ lynx -source http://www.yahoo.com
This tells Lynx to fetch http://www.yahoo.com
using a
web proxy on port 8888, which happens to be our spy friend
nc
.
Netcat neatly prints out the request headers that Lynx sent:
% nc -l 8888 GET http://www.yahoo.com/ HTTP/1.0 Host: www.yahoo.com Accept: text/html, text/plain, application/vnd.rn-rn_music_package, application/x-freeamp-theme, audio/mp3, audio/mpeg, audio/mpegurl, audio/scpls, audio/x-mp3, audio/x-mpeg, audio/x-mpegurl, audio/x-scpls, audio/mod, image/*, video/mpeg, video/* Accept: application/pgp, application/pdf, application/postscript, message/partial, message/external-body, x-be2, application/andrew-inset, text/richtext, text/enriched, x-sun-attachment, audio-file, postscript-file, default, mail-file Accept: sun-deskset-message, application/x-metamail-patch, application/msword, text/sgml, */*;q=0.01 Accept-Encoding: gzip, compress Accept-Language: en User-Agent: Lynx/2.8.4rel.1 libwww-FM/2.14 SSL-MM/1.4.1 OpenSSL/0.9.6b
The GET request on the first tells the proxy to get file
http://www.yahoo.com using HTTP version 1.0. Notice how this
request is quite different from the example without a web proxy! The
protocol and machine name (http://www.yahoo.com
) are now
part of the request. In the previous example this part was omitted.
Look in RFC 1945 for details on the remaining lines.
The previous example shows the HTTP request. Now we'll try to see what a real web proxy (www.scs.cs.nyu.edu port 3128) sends to a web server. To achieve this, we use nc to be a fake web server. Start the ``fake server'' on a class machine, say class1.scs.cs.nyu.edu with the following command:
class1 1% nc -l 8888
Again, you may have to choose a different number if 8888 turns out to be taken by someone else.
% env http_proxy=http://sure.lcs.mit.edu:3128/ lynx -source http://pain.lcs.mit.edu:8888
Needless to say, you should replace 8888 by whatever port you chose to run
nc
on. nc
will show the following request:
class1 1% nc -l 8888 GET / HTTP/1.0 Accept: text/html, text/plain, text/sgml, video/mpeg, image/jpeg, image/tiff, image/x-rgb, image/png, image/x-xbitmap, image/x-xbm, image/gif, application/postscript, */*;q=0.01 Accept-Encoding: gzip, compress Accept-Language: en User-Agent: Lynx/2.8.4rel.1 libwww-FM/2.14 Via: 1.0 supervised-residence.lcs.mit.edu:3128 (Squid/2.4.STABLE4) X-Forwarded-For: 18.26.4.76 Host: pain.lcs.mit.edu:8888 Cache-Control: max-age=259200 Connection: keep-alive
Notice how the web proxy stripped away the
http://pain.lcs.mit.edu:8888
part from the request!
Your web proxy will have to translate between requests that the client makes (the one that starts with ``GET http://machinename'') into requests that the server understands. You will start with some skeletal code containing and HTTP parser, which should make your life easier.
Your web proxy will take a single argument specifying the TCP port on which to listen (so different users on the same class machine can avoid conflicting with each other).
Once the request line has been received, the web proxy should continue reading the input from the client until it encounters a blank line. The proxy should then fetch the URL from the appropriate server, forward the response back to the client, and close the connection. The proxy should forward response data as it arrives, rather than buffering the entire response; this allows the proxy to handle huge responses without running out of memory.
Your web proxy has to support the GET method only [3]. A GET method takes two arguments: the file to be retrieved and the HTTP version. Additional headers may follow the request.
Enough talking. Now do something.
We have provided a skeleton webproxy project directory. It is available in ~class/src/webproxy1.tar.gz. Start by unpacking the source code in your home directory. On the class machines, you can do so with the following commands:
% tar xzf ~class/src/webproxy1.tar.gz % cd webproxy1 % sh ./setup automake: configure.in: installing `./install-sh' automake: configure.in: installing `./mkinstalldirs' automake: configure.in: installing `./missing' configure.in: 22: required file `./ltconfig' not found automake: Makefile.am: installing `./INSTALL' automake: Makefile.am: installing `./COPYING' + autoconf + set +x %
The skeleton source tree contains source files http.C, http.h, and webproxy1.C. The first two files will help you parse HTTP requests. webproxy1.C is a pretty useless server that just prints "synchronous_proxy unimplemented... goodbye". This is printed by a function called synchronous_proxy which you must now implement.
Next, you must configure the software and generate a Makefile--a set of instructions for how to compile the software. For this class, we will use the GNU autoconf and automake tools to generate Makefiles. You will also be linking against the libasync library that is part of SFS. On the class machines, generate the Makefile with the following commands:% setenv DEBUG -g % ./configure --with-sfs=/usr/local/os/sfs-dbg creating cache ./config.cache checking for a BSD compatible install... /usr/bin/install -c checking whether build environment is sane... yes checking whether make sets ${MAKE}... yes checking for working aclocal... found checking for working autoconf... found checking for working automake... found checking for working autoheader... found checking for working makeinfo... found checking host system type... i386-unknown-openbsd3.2 ... updating cache ./config.cache creating ./config.status creating Makefile creating config.h %It is very important that you supply the argument
--with-sfs=/usr/local/os/sfs-dbg
to
./configure
. If you don't, things will appear to
work, but you will get a version of libasync without built-in
debugging sanity checks. Your assignment will be linked against
debugging libraries for grading, so you want to make sure you get the
benefit of the sanity checking while testing the software yourself.
Once the software is configured, you can build webproxy1 by running
gmake
. (Note that this is gmake
with
a g, and not make
. At the end of the assignment
you will make a software distribution that compiles with any make, but
for development you must use gmake
which is GNU make.)
% gmake c++ -DHAVE_CONFIG_H -I. ... -c /home/c/dm/webproxy1/http.C c++ -DHAVE_CONFIG_H -I. ... -c /home/c/dm/webproxy1/webproxy1.C /bin/sh ./libtool --mode=link c++ ... -o webproxy1 ... mkdir .libs c++ ... -o webproxy1 http.o webproxy1.o ... %That's it! You've now built
webproxy1
. To test it,
type (for example, on class 1):
class1 5% ./webproxy1 1234In another window, you can now type:
% telnet class1.scs.cs.nyu.edu 1234 Trying 216.165.109.103... Connected to class3. Escape character is '^]'. synchronous_proxy unimplemented... goodbye Connection closed by foreign host. %
It is often useful to compile a program in a different directory from the source code. There are several reasons for this. One may want to compile the same source tree multiple times--for example once with debugging, once without. Using two copies of the same source tree would make it a pain to keep the two compiled versions in sync. Another issue is that C++ object files and executables can get pretty large--especially with debugging information. Thus it is considerably faster to compile on a local disk when the source code is not local. Finally, when you have limited backed-up disk space, there is no reason to waste it on huge C++ executables, since these can always be recreated from the source in the event of a disk crash.
Autoconf easily supports compiling in a different directory. You simply need to run the configure script from whatever directory you wish the compile to take place. However, when a source tree is being used for out-of-directory builds, you cannot also perform an in-place build. The following example illustrates how one might compile webproxy1 out-of-directory using local disk space on the machine class2:
class2 1% cd webproxy1 class2 2% gmake distclean rm -f config.h rm -f *.tab.c ... rm -f config.status class2 3% mkdir /home/c2/scratch/student class2 4% cd /home/c2/scratch/student class2 5% mkdir webproxy1 class2 6% cd webproxy1 class2 7% setenv DEBUG -g class2 8% ~/webproxy1/configure --with-sfs=/usr/local/os/sfs-dbg creating cache ./config.cache checking for a BSD compatible install... /usr/bin/install -c checking whether build environment is sane... yes ... creating Makefile creating config.h class2 9% gmake c++ -DHAVE_CONFIG_H -I. -I/home/c/student/webproxy1 -I. ... /bin/sh ./libtool --mode=link c++ -g -ansi -Wall -Wsign-compare ... mkdir .libs ... class2 10%(The
gmake distclean
command cleans up any previous
in-directory build, restoring the webproxy1 directory to its
pristine state. If you never ran ./configure in that
directory, you do not need to run gmake distclean
. In
fact, gmake disclean
will fail in that case, which is
fine.)
The http.C and http.h files implement an HTTP request parser.
http.h defines the class httpreq
, and a function
httpreq_parse
for filling in the fields of the structure.
To parse a request, first create a httpreq
object.
Then, parse the (potentially incomplete) HTTP request by feeding it
to:
int httpreq_parse (httpreq *resp, const char *buf, size_t len);until the function returns a positive number, indicating that the headers are complete.
buf
should be the buffer that
contains the (potentially incomplete) HTTP request. len
is the length of the HTTP request fragment in buf
.
Notice that parse
needs to see the
whole request you have read so far.
parse
returns 1 if the HTTP request is complete, 0 if
it needs more data to complete, or -1 on a parse error.
parse
does not modify the contents of buf
.
Once parse returns 1, you can call---amongst others---the
following methods on the calling httpreq.
Here's a simple program that illustrates the use of
httpreq
.
#include <stdio.h> #include "http.h" int main() { httpreq r; char buf[512]; int ret; // incomplete header strcpy (buf, "GET http://web.mit.edu/index.html"); ret = httpreq_parse (&r, buf, strlen (buf)); printf ("ret %d file %s\n", ret, ret > 0 ? r.path () : "(none)"); // complete header strcat (buf, " HTTP/1.0\r\n\r\n"); ret = httpreq_parse (&r, buf, strlen(buf)); printf ("ret %d file %s\n", ret, ret > 0 ? r.path () : "(none)"); exit (0); }
You may want to read Using TCP through sockets to learn about socket programming in C/C++. See also the BSD IPC manual for an alternate treatment of socket programming.
Your proxy program should take exactly one argument, a port number on which to listen. For example, to run the proxy on port 2000:
% ./webproxy1 2000
As a first test of the proxy you should attempt to use it to browse the web. Set up your web browswer to use one of the class machines running your proxy as a proxy and experiment with a variety of different pages.
When you think your proxy is ready, you can run it against the test program test-webproxy1, our tester, the source of which is in ~class/test-webproxy1.C. Run the tester with your proxy as an argument:
% test-webproxy1 ./webproxy1
Note that this may take several minutes to complete. The test program runs the following tests:
This test is the "normal case". We send a normal HTTP 1.0 GET request and expect the correct web page.
This tests splits the HTTP request in two chunks. The first chunk contains a partial HTTP request. The second chunk completes the first after which the tester expects the correct web page contents to come back.
The tester does a request of exactly 65535 bytes.
The tester fetches a web page larger than the maximum amount of memory available to your web proxy.
The tester fetches a web page without a body.
The tester sends a request with a URL that specifies a false port. Your proxy will attempt to make a connection to a bogus port. Soon thereafter, the tester tries to fetch a valid page to see if your proxy is still doing ok.
The tester sends an HTTP request that is not syntactically correct. After that, it tries to fetch a valid page to see if it your proxy is still doing ok.
The tester sends a partial HTTP request and then closes the connection. After that, it tries to fetch a valid page to see if it your proxy is still doing ok.
The tester swamps your proxy with a request larger than 65535 bytes. The tester expects your proxy to close the connection. After that, it tries to fetch a valid page to see if it your proxy is still doing ok.
The tester stress tests your web proxy with a ruthless combination of ordinary fetches, split requests, malformed requests, and large responses. This may expose memory leaks, unclosed connections, and random other bugs.
You must write all the code you hand in for the programming assignments, except for code that we give you as part of the assigment. You are not allowed to look at anyone else's solution (and you're not allowed to look at solutions from previous years). You may discuss the assignments with other students, but you may not look at or copy each others' code.
% gmake distcheck rm -rf webproxy1-0.0 mkdir webproxy1-0.0 chmod 777 webproxy1-0.0 here=`cd . && pwd`; \ top_distdir=`cd webproxy1-0.0 && pwd`; \ distdir=`cd webproxy1-0.0 && pwd`; \ cd /home/c/dm/webproxy1 \ && automake-1.4 --include-deps --build-dir=$here --srcdir-name=/home/c/dm/webproxy1 --output-dir=$top_distdir --gnu Makefile chmod -R a+r webproxy1-0.0 ... gmake[1]: Leaving directory `/disk/c3/scratch/dm/webproxy1-0.0/=build' rm -rf webproxy1-0.0 ============================================== webproxy1-0.0.tar.gz is ready for distribution ============================================== %To turn in your distribution, copy it to the directory
~class/handin/lab1/username
where
username is your username:
% cp webproxy1-0.0.tar.gz ~class/handin/lab1/`logname`/ %To create a script file, use the
script
command. When
you run script, everything you type gets saved in a file called
typescript. Press CTRL-D to finish the script. The typescript should
be copied to the same directory as the software distribution. For
example:
% script Script started, output file is typescript % test-webproxy1 ./webproxy1 ... % ^D Script done, output file is typescript % cp typescript ~class/handin/lab1/`logname`/ %If you have any problems about submission, please contact the instructor.