Homework 3. Java web crawler Homework 3. Java web crawler The problem In this homework, you will write a simple web crawler in Java. As will be seen below, it is so simple that it is not a well-behaved crawler and so it should be used only in controlled environments. It is not suitable for production use on the Internet. Definitions A Restricted URL (RURL) is a URI (see Internet RFC 2396) that satisfies the following constraints. The URI must be absolute. For example, cs131/index.html is not an RURL. The URI most not contain a fragment. For example, http://www.cs.ucla.edu/index.html#financial_aid is not an RURL. The scheme must be http. For example, ftp://ftp.cs.ucla.edu/pub/ is not an RURL. The URI must contain an authority that is a server without any userinfo. For example, http://cs131ta@www.cs.ucla.edu/ is not an RURL. Also, http:/root/classes/cs131 is not an RURL. The URI must not contain a query. For example, http://www.cs.ucla.edu/search.html?q=homework2 is not an RURL. Segments within the URL must not contain parameters. For example, http://www.cs.ucla.edu/cs131;fall2003/index.html is not an RURL. The URI must not contain any opaque components. For example, http:cs131/index.html is not an RURL. The URI must not contain apostrophe characters ('). For example, http://www.cs.ucla.edu/3_o'clock.html is not an RURL. Every RURL is a URL. However, many URLs are not RURLs. Please see Internet RFC 2396 for the difference between URIs and URLs. Assignment Write a web crawler program GetRURLs that takes two arguments. The first argument is required, and is an RURL ROOT; the second is optional, and is a nonnegative integer N specifying the depth, with the default depth being 1. Your web crawler should output (to standard output) a text file listing all RURLs reachable from ROOT via N or fewer links. The output should contain each RURL on a separate line, and should not contain any textual duplicates; order is not important. If N is zero, your web crawler should not retrieve any resources and should simply output ROOT. If the first argument is not an RURL, or if the second argument is given and is not a nonnegative integer that is less than 5, your web crawler should print a diagnostic and should exit with a nonzero status without doing any other work. Your web crawler should use simple text comparison to test whether two URLs are duplicates. For example, it should not assume that http://www.cs.ucla.edu/index.html and http://www.cs.ucla.edu/./index.html are duplicates, even though they happen to identify the same web page. If two RURLs overlap in the input, you should output only the RURL that starts first; if they start at the same location, you should output only the longer of the two. Several of the above examples of non-RURLs contain initial prefixes that are RURLs. In such cases, the initial prefixes should be output. For example, your web crawler should output the RURL http://www.cs.ucla.edu/index.html even when it is an initial prefix of the non-RURL http://www.cs.ucla.edu/index.html#financial_aid. An RURL B is reachable via one link from another RURL A if A identifies a resource that, when considered as a text file, contains an instance of B. Your web crawler should not parse resources as HTML, XML, or anything else: it should merely treat each resource as a text file and look for instances of RURLs in that text file as described above. If your web crawler cannot successfully retrieve a resource, it should print a diagnostic "cannot retrieve U" on standard error, where U is the resource's RURL, and it should otherwise behave as if the corresponding resource is empty (for example, your web crawler should not exit with nonzero status merely because of this problem). Only one such diagnostic should be printed per non-retrievable RURL. Diagnostics may be printed in any order, but may not be interleaved. Your web crawler may accept additional optional arguments that are of use to you during debugging. Your web crawler need not obey A Standard for Robot Exclusion; however, you should be very careful when using your web crawler to avoid excess use of network resources. Your web crawler should not have arbitrary restrictions on the length of input lines. Also, it should not assume that the last input line ends in a newline character. To turn in your assignment, submit a file GetRURLs.java containing your class definitions. The first line of GetRURLs.java should be a comment containing your name and student ID. Make sure that your code works with the Java 1.4.2 installation on SEASnet. The command java -version should output the following text: java version "1.4.2"
Java(TM) 2 Runtime Environment, Standard Edition (build 1.4.2-b28)
Java HotSpot(TM) Client VM (build 1.4.2-b28, mixed mode)
Examples Here is a shell transcript to illustrate the usage. Note that the order of the output lines may differ in your implementation. This transcript assumes that you are running bash, which is a more standard shell than the usual csh that SEASnet defaults to. The trailing ; echo $? causes bash to print the exit status of the previous command. > bash
$ java GetRURLs ftp://www.cs.ucla.edu/index.html; echo $?
GetRURLs: error: root is not an RURL
1
$ java GetRURLs http://www.cs.ucla.edu/index.html 5; echo $?
GetRURLs: error: depth out of range
1
$ java GetRURLs http://www.cs.ucla.edu/index.html ouch; echo $?
GetRURLs: error: depth is not an integer
1
$ java GetRURLs http://www.cs.ucla.edu/classes/fall03/cs131/hw/hw3.html; echo $?
http://cs131ta
http://ftp.rfc-editor.org/in-notes/rfc2396.txt
http://www.cs.ucla.edu/3_o
http://www.cs.ucla.edu/classes/fall03/cs131/hw/hw3-answer.html
http://www.cs.ucla.edu/classes/fall03/cs131/hw/hw3.html
http://www.cs.ucla.edu/classes/winter03/cs131/midterm/2a.txt
http://www.cs.ucla.edu/classes/winter03/cs131/midterm/2q.txt
http://www.cs.ucla.edu/cs131
http://www.cs.ucla.edu/./index.html
http://www.cs.ucla.edu/index.html
http://www.cs.ucla.edu/search.html
http://www.gnu.org/software/bash/
http://www.robotstxt.org/wc/norobots.html
http://www.w3.org/1999/xhtml
0
$ java GetRURLs http://www.cs.ucla.edu/classes/fall03/cs131/hw/hw3.html 2; echo $?
cannot retrieve http://cs131ta
cannot retrieve http://www.cs.ucla.edu/3_o
cannot retrieve http://www.cs.ucla.edu/classes/fall03/cs131/hw/hw3-answer.html
cannot retrieve http://www.cs.ucla.edu/cs131
cannot retrieve http://www.cs.ucla.edu/search.html
http://cs131ta
http://ftp.rfc-editor.org/in-notes/rfc2396.txt
http://www.cs.ucla.edu/3_o
http://www.cs.ucla.edu/classes/fall03/cs131/hw/hw3-answer.html
http://www.cs.ucla.edu/classes/fall03/cs131/hw/hw3.html
http://www.cs.ucla.edu/classes/winter03/cs131/midterm/2a.txt
http://www.cs.ucla.edu/classes/winter03/cs131/midterm/2q.txt
http://www.cs.ucla.edu/cs131
http://www.cs.ucla.edu/./index.html
http://www.cs.ucla.edu/index.html
http://www.cs.ucla.edu/search.html
http://www.gnu.org/software/bash/
http://www.robotstxt.org/wc/norobots.html
http://www.w3.org/1999/xhtml
http://www.chr.ucla.edu/chr/tabs/frameset_main_jobs.html
http://www.cs.ucla.edu/classes/ta/faq.html
http://www.cs.ucla.edu/csd/CS_handbook/tofc.html
http://www.cs.ucla.edu/grad
http://www.cs.ucla.edu/ugrad
http://www.registrar.ucla.edu/schedule/
http://www.seasoasa.ucla.edu/adm_ugrad.html
http://www.seasoasa.ucla.edu/curric.html
… [Standard output truncated in the interest of brevity.]
0
$ #
$ #
$ # The following test cases were added on 2003-11-06.
$ #
$ java GetRURLs http://www.cs.ucla.0rg/index.html; echo $?
GetRURLs: error: root is not an RURL
1
$ java GetRURLs http://10.0.0.0.0/index.html; echo $?
GetRURLs: error: root is not an RURL
1
$ java GetRURLs http://www.cs.ucla.edu/classes/winter03/cs131/midterm/1a.txt 1; echo $?
http://www.cs.ucla.edu/classes/winter03/cs131/midterm/1a.txt
0
$ java GetRURLs http://131.179.128.22:/classes/winter03/cs131/midterm/1a.txt 1; echo $?
http://131.179.128.22:/classes/winter03/cs131/midterm/1a.txt
0
$ java GetRURLs http://www.cs.ucla.edu.:080/classes/winter03/cs131/midterm/1a.txt 1; echo $?
http://www.cs.ucla.edu.:080/classes/winter03/cs131/midterm/1a.txt
0
$ java GetRURLs http://www.cs.ucla.edu:4294967376/classes/winter03/cs131/midterm/1a.txt 1; echo $?
cannot retrieve http://www.cs.ucla.edu:4294967376/classes/winter03/cs131/midterm/1a.txt
http://www.cs.ucla.edu:4294967376/classes/winter03/cs131/midterm/1a.txt
0
$ java GetRURLs http://131.179.128.22:18446744073709551696/classes/winter03/cs131/midterm/1a.txt 1; echo $?
cannot retrieve http://131.179.128.22:18446744073709551696/classes/winter03/cs131/midterm/1a.txt
http://131.179.128.22:18446744073709551696/classes/winter03/cs131/midterm/1a.txt
0
© 2003 Paul Eggert. See copying rules. $Id: hw3.html,v 1.6 2003/11/09 07:55:50 eggert Exp $