Thursday, July 14, 2005

What a Tangled Web

This is one of those posts that won't exactly appeal to anyone I hang out with. To them, it will seem trivially simple, and not worth talking about. I'd agree except I feel like teaching something on the Web, and this is something I found surprising when I first learned about it.

You're reading this blog in your trusty webbrowser. Have you wondered how the webpage arrived there? What is going on that allows you to see this page? A browser is a sophisticated program, but allows you to perform powerful operations rather simply. For example, if you know where my webpage is located, you type in the URL into the browser, and bam, you get a webpage.

What is a URL anyway? In a nutshell, it's a name for some location on the web. To find the location on the web, one needs to consult the equivalent of a directory service. Think of a phone book. It maps names to phone numbers. You use the number to call up whomever. Similarly, there's a service that converts, say, www.blogger.com to some "address" called an IP address.

Then, your browser attempts to connect (this is like making a phone call) to the server. Connect is perhaps not quite the right word. It sends a message to the server, such as, give me a webpage. If the server is a webserver and if the page is legitimate, it sends it back.

Now what's a server? A server is a program running on a computer, connected to the Internet. A server typically has a set of tasks it can perform, such as, get webpages. To understand your request, both your browser (which is called a client) and the server have to agree on how to talk to each other. For browsers and servers, they agree to talk to each other using HTTP which is a protocol.

Perhaps you've heard of people saying that computers are merely 0's and 1's. That's technically true. However, those 0's and 1's mean something. It's just like saying people who communicate only ever use the English alphabet, numbers, and maybe some punctuation. Yes, but you can combine them to make words.

Unlike people, computers are dumb, but extremely quick and precise. So, if you make a very precise request using an agreed upon protocol (such as HTTP), then a server who knows that protocol can respond back. Even if the server understands what is being asked of it by the browser, there's no guarantee that it can provide it. To give an analogy, there are some libraries, mainly university libraries, where you tell the librarian what books you want, and he goes and gets them for you. Suppose you give call numbers of the books you want.

It's possible that someone has already checked out the book, or the call number is bogus. Either way, even by following rules, the librarian, who acts as the server, may not be able to accomodate the result.

How many of you even realize there were servers? Before the days of online shopping, you could easily have imagined that your computer had "access" to webpages elsewhere, and somehow "retrieved" them. However, a more accurate model is that there is something (in this case a server) that processes your request, and sends them back.

Have you ever heard of a web spider? From the sounds of it, a web spider is some 0's and 1's that goes from computer to computer, like a paper boy delivering papers. That's not what it is at all. A web spider is much like your browser. It makes a request to get a webpage from some server. This webpage is sent back. It checks the links in the webpage, and then makes request to other servers to get the contents of the links. The only trick is to avoid getting the same pages over and over, which can happen if webpage A refers to webpage B which refers to webpage C which refers back to webpage A.

Thus, a webspider only ever operates from your home, and is more like a telemarketer who sits at a phone, and makes calls to other people.

A browser typically makes multiple requests per webpage. For example, suppose you request a webpage. The server is going to send something written in, say, HTML. The browser looks at the content and notices, say, images. The images aren't images. They are URLs of images. The browser must then make another request to get the images based on their URLs.

For example, suppose you have a webpage that shows your cat and dog. The files are called cat.jpg and dog,jpg, and you have some text describing the cat and dog. You give your friend the URL for your webpage. They type it in their browser. They get the file back, but it only contains text written as HTML. It doesn't contain images at all. Instead, it has URLs for those images. The browser recognizes this, and makes more requests to get the images, and then puts it in your browser.

It does this so quick, you often don't notice.

Webpages are often files stored away on some computer, which you can refer to by URL. However, there are times when webpages can't be stored this way. For example, think of Google. It can't know every single Google request you can make. When you ask it to locate information about baby penguins, it does a sophisticated search, determines which webpages are relevant, and returns a webpage that is made on the fly.

There's no close analogy to this. But I'll try to create one. Suppose you go to a CD store. You make a request based on feeling like "sad songs with piano". The guy figures out what songs fit the category, and burns you a CD with those songs. The CD isn't there ahead of time because you can make any unusual request, and there would be no way to have every CD of every possible request. It's more efficient to make the request, figure out the songs, and burn the CD.

These on-the-fly webpages are called dynamic webpages. It's not because they are "dynamic" and exciting. It's because the webpage isn't created until you make the request. And the webpage isn't even saved. There's a temporary page made, and sent to your browser, but it isn't kept permanenly on the server. If you make the same request again, it may have to recompute the webpage.

The kind of communication I've been talking about is called client-server. Your browser is a client. A client makes requests. A server honors the request. A protocol is used to let the client communicate requests to a server, so both sides knows what's going on.

The alternate form of communication model is called peer-to-peer. This is when it's not entirely appropriate to call one side a client, and one side a server, since the computer does both. Perhaps a useful example of this is Bit Torrent. The basic idea behind bit torrent is that you make a request for some file, often it's music or a video, but it can be anything really. As you make the request, some pieces of the file are downloaded on your computer, possibly from many different computers. In the meanwhile, if the request is suitably popular, then the pieces on your computer are being copied to other computers who also want the files you want.

Why is this good? Suppose, at any one time, there are 10,000 people who want to have an article about baby penguins. You are downloading this lengthy article. Since so many people are in the process of downloading it, you can get pieces of files in parallel from many different sources. Perhaps up to 10,000 people could provide you part of the file. Meanwhile, other people can get your pieces.

The great thing is that Bit Torrent does all this for you. You don't have to figure out how to break the pieces of your file down. You are also both client and server. You are requesting a file, but Bit Torrent is also serving it to others. This is what is meant by peer-to-peer.

What's just as amazing is the Internet itself. So you are on your browser at some point in the Internet. The webserver is at another point. What connects the two sides? The Internet! Fortunately, it's not so hard to understand because there's a great analogy. The phone system. You can talk to your friends or to a business or whomever. It's the job of the phone company to provide the service to connect you. It doesn't care what is being sent on the telephone wires (for the most part). Similarly, the Internet doesn't care what is sent along it.

The Internet, interestingly enough, does not have a central command. It is distributed. The reason for this had to do with something you wouldn't expect. War. In particular, nuclear war. It was thought that if the control of the network was not centralized in any one location, then an attack would not knock out the Internet. It could always reroute to sections of the Internet that were still working.

Since there is no central command, people love it, because they see the Internet as the freest kind of speech. This is one reason for the multitudes of blogs.

There's your lesson about the web.

No comments: