5.2 The Web
If you want to communicate with another computer on the Internet then your computer needs to know the answer to three questions: What are you looking for? Where is it? And how do we get there? The computers and software that make up Internet infrastructure can help provide the answers. Let’s look at how it all comes together.
Read the following post by Coralie Mercier from World Wide Web Consortium (W3C) about the contributions of Sir Tim Berners-Lee to our lives.
Rules, Rules, Rules
The following fun video is a quick summary of many of the terms you will encounter in this chapter. The video afterward provides a different style and perspective on the same content. Both are recommended so that the sections beneath are easier to comprehend.
A Packet’s Tale
The URL: “What Are You Looking For?”
When you type an address into a Web browser (sometimes called a URL for uniform resource locator), you’re telling your browser what you’re looking for. Anatomy of a Web Address” describes how to read a typical URL.
The http:// you see at the start of most Web addresses stands for hypertext transfer protocol. A protocol is a set of rules for communication—sort of like grammar and vocabulary in a language like English. The http protocol defines how Web browser and Web servers communicate and is designed to be independent from the computer’s hardware and operating system. It doesn’t matter if messages come from a PC, a Mac, a huge mainframe, or a pocket-sized smartphone; if a device speaks to another using a common protocol, then it will be heard and understood.
The Internet supports lots of different applications, and many of these applications use their own application transfer protocol to communicate with each other. The server that holds your e-mail uses something called SMTP, or simple mail transfer protocol, to exchange mail with other e-mail servers throughout the world. FTP, or file transfer protocol, is used for—you guessed it—file transfer. FTP is how most Web developers upload the Web pages, graphics, and other files for their Web sites. Even the Web uses different protocols. When you surf to an online bank or when you’re ready to enter your payment information at the Web site of an Internet retailer, the http at the beginning of your URL will probably change to https (the “s” is for secure). That means that communications between your browser and server will be encrypted for safe transmission. The beauty of the Internet infrastructure is that any savvy entrepreneur can create a new application that rides on top of the Internet.
Hosts and Domain Names
The next part of the URL in our diagram holds the host and domain name. Think of the domain name as the name of the network you’re trying to connect to, and think of the host as the computer you’re looking for on that network.
Many domains have lots of different hosts. For example, Yahoo!’s main Web site is served from the host named “www” (at the address http://www.yahoo.com), but Yahoo! also runs other hosts including those named “finance” (finance.yahoo.com), “sports” (sports.yahoo.com), and “games” (games.yahoo.com).
Host and Domain Names: A Bit More Complex Than That
While it’s useful to think of a host as a single computer, popular Web sites often have several computers that work together to share the load for incoming requests. Assigning several computers to a host name offers load balancing and fault tolerance, helping ensure that all visits to a popular site like http://www.google.com won’t overload a single computer, or that Google doesn’t go down if one computer fails.
It’s also possible for a single computer to have several host names. This might be the case if a firm were hosting several Web sites on a single piece of computing hardware.
Some domains are also further broken down into subdomains—many times to represent smaller networks or subgroups within a larger organization. For example, the address http://www.rhsmith.umd.edu is a University of Maryland address with a host “www” located in the subdomain “rhsmith” for the Robert H. Smith School of Business. International URLs might also include a second-level domain classification scheme. British URLs use this scheme, for example, with the BBC carrying the commercial (.co) designation—http://www.bbc.co.uk—and the University of Oxford carrying the academic (.ac) designation—http://www.ox.ac.uk. You can actually go 127 levels deep in assigning subdomains, but that wouldn’t make it easy on those who have to type in a URL that long.
Most Web sites are configured to load a default host, so you can often eliminate the host name if you want to go to the most popular host on a site (the default host is almost always named “www”). Another tip: most browsers will automatically add the “http://” for you, too.
Host and domain names are not case sensitive, so you can use a combination of upper and lower case letters and you’ll still get to your destination.
Path Name and File Name
Look to the right of the top-level domain and you might see a slash followed by either a path name, a file name, or both. If a Web address has a path and file name, the path maps to a folder location where the file is stored on the server; the file is the name of the file you’re looking for.
Most Web pages end in “.html,” indicating they are in hypertext markup language. While http helps browsers and servers communicate, html is the language used to create and format (render) Web pages. A file, however, doesn’t need to be .html; Web servers can deliver just about any type of file: Acrobat documents (.pdf), PowerPoint documents (.ppt or .pptx), Word docs (.doc or .docx), JPEG graphic images (.jpg), and—as we’ll see in Chapter 13 “Information Security: Barbarians at the Gateway (and Just About Everywhere Else)”—even malware programs that attack your PC. At some Web addresses, the file displays content for every visitor, and at others (like amazon.com), a file will contain programs that run on the Web server to generate custom content just for you.
You don’t always type a path or file name as part of a Web address, but there’s always a file lurking behind the scenes. A Web address without a file name will load content from a default page. For example, when you visit “google.com,” Google automatically pulls up a page called “index.html,” a file that contains the Web page that displays the Google logo, the text entry field, the “Google Search” button, and so on. You might not see it, but it’s there.
Butterfingers, beware! Path and file names are case sensitive—amazon.com/books is considered to be different from amazon.com/BOOKS. Mistype your capital letters after the domain name and you might get a 404 error (the very unfriendly Web server error code that means the document was not found).
IP Addresses and the Domain Name System: “Where Is It? And How Do We Get There?”
The IP Address
If you want to communicate, then you need to have a way for people to find and reach you. Houses and businesses have street addresses, and telephones have phone numbers. Every device connected to the Internet has an identifying address, too—it’s called an IP (Internet protocol) address.
A device gets its IP address from whichever organization is currently connecting it to the Internet. Connect using a laptop at your university and your school will assign the laptop’s IP address. Connect at a hotel, and the hotel’s Internet service provider lends your laptop an IP address. Laptops and other end-user machines might get a different IP address each time they connect, but the IP addresses of servers rarely change. It’s OK if you use different IP addresses during different online sessions because services like e-mail and Facebook identify you by your username and password. The IP address simply tells the computers that you’re communicating with where they can find you right now. IP addresses can also be used to identify a user’s physical location, to tailor search results, and to customize advertising. See Chapter 14 “Google: Search, Online Advertising, and Beyond” to learn more.
IP addresses are usually displayed as a string of four numbers between 0 and 255, separated by three periods. Want to know which IP address your smartphone or computer is using? Visit a Web site like ip-adress.com (one “d”), whatismyipaddress.com, or ipchicken.com.
The DNS: The Internet’s Phonebook
You can actually type an IP address of a Web site into a Web browser and that page will show up. But that doesn’t help users much because four sets of numbers are really hard to remember.
This is where the domain name service (DNS) comes in. The domain name service is a distributed database that looks up the host and domain names that you enter and returns the actual IP address for the computer that you want to communicate with. It’s like a big, hierarchical set of phone books capable of finding Web servers, e-mail servers, and more. These “phone books” are called nameservers—and when they work together to create the DNS, they can get you anywhere you need to go online.
To get a sense of how the DNS works, let’s imagine that you type www.yahoo.com into a Web browser. Your computer doesn’t know where to find that address, but when your computer connected to the network, it learned where to find a service on the network called a DNS resolver. The DNS resolver can look up host/domain name combinations to find the matching IP address using the “phone book” that is the DNS. The resolver doesn’t know everything, but it does know where to start a lookup that will eventually give you the address you’re looking for. If this is the first time anyone on that network has tried to find “www.yahoo.com,” the resolver will contact one of thirteen identical root nameservers. The root acts as a lookup starting place. It doesn’t have one big list, but it can point you to a nameserver for the next level, which would be one of the “.com” nameservers in our example. The “.com” nameserver can then find one of the yahoo.com nameservers. The yahoo.com nameserver can respond to the resolver with the IP address for www.yahoo.com, and the resolver passes that information back to your computer. Once your computer knows Yahoo!’s IP address, it’s then ready to communicate directly with www.yahoo.com. The yahoo.com nameserver includes IP addresses for all Yahoo!’s public sites: www.yahoo.com, games.yahoo.com, sports.yahoo.com, finance.yahoo.com, and so on.
The system also remembers what it’s done so the next time you need the IP address of a host you’ve already looked up, your computer can pull this out of a storage space called a cache, avoiding all those nameserver visits. Caches are periodically cleared and refreshed to ensure that data referenced via the DNS stays accurate.
Distributing IP address lookups this way makes sense. It avoids having one huge, hard-to-maintain, and ever-changing list. Firms add and remove hosts on their own networks just by updating entries in their nameserver. And it allows host IP addresses to change easily, too. Moving your Web server off-site to a hosting provider? Just update your nameserver with the new IP address at the hosting provider, and the world will invisibly find that new IP address on the new network by using the same old, familiar host/domain name combination. The DNS is also fault-tolerant—meaning that if one nameserver goes down, the rest of the service can function. There are exact copies at each level, and the system is smart enough to move on to another nameserver if its first choice isn’t responding.
- The Internet is a network of networks. Internet service providers connect with one another to share traffic, enabling any Internet-connected device to communicate with any other.
- URLs may list the application protocol, host name, domain name, path name, and file name, in that order. Path and file names are case sensitive.
- A domain name represents an organization. Hosts are public services offered by that organization. Hosts are often thought of as a single computer, although many computers can operate under a single host name and many hosts can also be run off a single computer.
- You don’t buy a domain name but can register it, paying for a renewable right to use that domain name. Domains need to be registered within a generic top-level domain such as “.com” or “.org” or within a two-character country code top-level domain such as “.uk,” “.ly,” or “.md.”
- Registering a domain that uses someone else’s trademark in an attempt to extract financial gain is considered cybersquatting. The United States and other nations have anticybersquatting laws, and ICANN has a dispute resolution system that can overturn domain name claims if a registrant is considered to be cybersquatting.
- Every device connected to the Internet has an IP address. These addresses are assigned by the organization that connects the user to the Internet. An IP address may be assigned temporarily, for use only during that online session.
- We’re running out of IP addresses. The current scheme (IPv4) is being replaced by IPv6, a scheme that will give us many more addresses and additional feature benefits but is not backward compatible with the IPv4 standard. Transitioning to IPv6 will be costly and take time.
- The domain name system is a distributed, fault-tolerant system that uses nameservers to map host/domain name combinations to IP addresses.
Questions and Exercises
- Find the Web page for your school’s information systems department. What is the URL that gets you to this page? Label the host name, domain name, path, and file for this URL. Are there additional subdomains? If so, indicate them, as well.
- Go to a registrar and see if someone has registered your first or last name as a domain name. If so, what’s hosted at that domain? If not, would you consider registering your name as a domain name? Why or why not?
- Investigate cases of domain name disputes. Examine a case that you find especially interesting. Who were the parties involved? How was the issue resolved? Do you agree with the decision?
- Describe how the DNS is fault-tolerant and promotes load balancing. Give examples of other types of information systems that might need to be fault-tolerant and offer load balancing. Why?
- Research DNS poisoning online. List a case, other than the one mentioned in this chapter, where DNS poisoning took place. Which network was poisoned, who were the victims, and how did hackers exploit the poisoned system? Could this exploit have been stopped? How? Whose responsibility is it to stop these kinds of attacks?
- Why is the switch from IPv4 to IPv6 so difficult? What key principles, discussed in prior chapters, are slowing migration to the new standard?
Arnoldy, B., “IP Address Shortage to Limit Internet Access,” USA Today, August 3, 2007.
Bosker, B., “The 11 Most Expensive Domain Names Ever,” The Huffington Post, March 10, 2010.
Davis, J., “Secret Geek A-Team Hacks Back, Defends Worldwide Web,” Wired, Nov. 24, 2008.
Godin, D., “Cache-Poisoning Attack Snares Top Brazilian Bank,” The Register, April 22, 2009.
Hutchinson, J., “ICANN, Verisign Place Last Puzzle Pieces in DNSSEC Saga,” NetworkWorld, May 2, 2010.
Konrad R. and E. Hansen, “Madonna.com Embroiled in Domain Ownership Spat,” CNET, August 21, 2000.
Kotadia, M., “MikeRoweSoft Settles for an Xbox,” CNET, January 26, 2004.
Maney, K., “Tuvalu’s Sinking, But Its Domain Is on Solid Ground,” USA Today, April 27, 2004.
McCullagh, D., “Ethical Treatment of PETA Domain,” Wired, August 25, 2001.
Morson, D., “Apple VP Ive Loses Domain Name Bid,” MacWorld, May 12, 2009.
Shankland, S., “Google Tries to Break IPv6 Logjam by Own Example,” CNET, March 27, 2009.
Streitfeld, D., “Web Site Feuding Enters Constitutional Domain,” The Washington Post, September 11, 2000.
Ward, M., “Internet Approaches Addressing Limit,” BBC News, May 11, 2010.
Publisher Information by University of Minnesota is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.