XML Tokenizer

You do not need to implement Tokenizer Impl, a .class file is provided for you. When compiling make sure to include the classpath so it can find this .class file.
javac -cp ../.. *.java

At the lowest level of abstraction, an XML file can be viewed as a sequence of characters (after all, XML is just text). At a higher level of abstraction, an XML file can be viewed as a sequence of tokens, where each token is one of the following:

a start tag
an end tag
text data (a sequence of characters that appears between two tags)

We refer to the start tags, end tags, and text data that appear in an XML file as tokens. When writing programs that process XML data, it is necessary to convert the raw characters in the XML file into higher-level tokens such as start tags, end tags, and text data that are suitable for processing by the program. In this situation, it is convenient to have a class that can read the characters from an XML file and return the higher-level tokens that appear in the file. We call such a class an XML tokenizer.

In order to help you implement your XML parser, we have implemented an XML tokenizer class that your parser may use to break XML data into tokens. This tokenizer does much of the hard work required to parse XML and will make writing your parser much easier. The tokenizer class is named TokenizerImpl, and it implements the Tokenizer interface shown below:

public interface Tokenizer {
    static final int BOF = 0;
    static final int START_TAG = 1;
    static final int END_TAG = 2;
    static final int TEXT = 3;
    static final int EOF = 4;

    int nextToken() throws IOException, XMLException;

    int getTokenType();
  
    String getTagName();

    String getAttributeValue(String name);
    int getAttributeCount();
    String getAttributeName(int i);
    String getAttributeValue(int i);

    String getText();
}

To create a TokenizerImpl object, call new TokenizerImpl and pass the constructor an InputStream that contains the XML data that you want to parse. After creating a tokenizer object, your parser can call the tokenizer's nextToken method to extract the tokens from the input stream. The nextToken method returns an integer that tells the caller what kind of token was found next. This value is one of the following token types:

START_TAG: the next token in the file was a start tag
END_TAG: the next token in the file was an end tag
TEXT: the next token in the file was a text string
EOF: this stands for "end of file", which means that there are no more tokens in the file

After nextToken returns, the tokenizer object contains additional information about the token that was just read, which may be accessed by calling the other methods on the Tokenizer interface. These methods are briefly described below:

int getTokenType()
This method returns the type of the current token, which is the same value that was returned from the last call to nextToken. This could be START_TAG, END_TAG, TEXT, or EOF. If getTokenType is called before the first call to nextToken, it returns BOF, which stands for "beginning of file". String getTagName()
This method can only be called if the current tag type is START_TAG or END_TAG. It returns the element name that appeared within the tag. For example, if the token was </student> the returned tag name would be "student".

String getAttributeValue(String name)
This method can only be called if the current tag type is START_TAG. It returns the value of the attribute with the specified name.

int getAttributeCount()
This method can only be called if the current tag type is START_TAG. It returns the number of attributes that appeared in the start tag.

String getAttributeName(int i)
This method can only be called if the current tag type is START_TAG. It returns the name of the tag's ith attribute. The tag's attributes are numbered starting at zero according to the order in which they appear in the file.

String getAttributeValue(int i)
This method can only be called if the current tag type is START_TAG. It returns the value of the tag's ith attribute. The tag's attributes are numbered starting at zero according to the order in which they appear in the file.

String getText()
This method can only be called if the current tag type is TEXT. It returns the string of characters associated with the text token.

Example

We now show an example that demonstrates what tokens the tokenizer would return for an example XML file. Suppose that you have an XML file containing the following student record:

<student id="123456789" gpa='3.56' phone="(801)375-1234">
   <name>Bill White</name>
   <address>300 West 721 North Provo, UT 84604</address>
   <major>Computer Science</major>
</student>

For this file, the tokenizer would return the following tokens, one each time nextToken is called:

Token Type	TagName	Attribute	Attribute Values	Text
`START_TAG`	"student"	[0] "id" [1] "gpa" [2] "phone"	[0] "123456789" [1] '3.56' [2] "(801)375-1234"	-
`TEXT`	-	-	-	"/n "
`START_TAG`	"name"	-	-	-
`TEXT`	-	-	-	"Bill White"
`END_TAG`	"name"	-	-	-
`TEXT`	-	-	-	"/n "
`START_TAG`	"address"	-	-	-
`TEXT`	-	-	-	"300 West 721 North Provo, UT 84604"
`END_TAG`	"address"	-	-	-
`TEXT`	-	-	-	"/n "
`START_TAG`	"major"	-	-	-
`TEXT`	-	-	-	"Computer Science"
`END_TAG`	"major"	-	-	-
`TEXT`	-	-	-	"/n"
`END_TAG`	"student"	-	-	-
`EOF`	-	-	-	-

Note: The Text nodes hold any of the white space that occur between tags.