XML Parser

Note: Projects are to be completed by each student individually (not by groups of students).

After you read through the project description below, you may wish to read the Tips at the bottom of this page.

You can download all of the source files for this project in a zip file here.

Objectives

This project has the following objectives:
  1. Gain experience implementing general tree structures using a first-child/next-sibling representation
  2. Gain experience implementing doubly-linked lists
  3. Gain additional experience writing recursive algorithms
  4. Gain additional experience with inheritance
  5. Develop your ability to successfully complete increasingly complex projects

Pre-Requisites

In order to understand this specification, you need to be familiar with XML and the XML tokenizer that is provided with the project files. These topics are discussed in the following documents, which should be read before continuing:

Introduction to XML
XML Tokenizer

Introduction

Programs that operate on XML-formatted data need tools to help them read and write XML files. For example, the XML tokenizer provided for this project makes it easier to read XML files. While an XML tokenizer is useful, it is still fairly low-level and somewhat tedious to use. We can make processing XML data even easier by building an XML parser. The input to an XML parser is an XML file, and the output is a tree structure stored in memory that represents the structure and content of the original XML document. Programs can then operate on this tree structure to process the information that was contained in the original document. It is much easier to manipulate a tree structure like this than it is to directly use XML-formatted data. Such a tree structure is also easier to use than the stream of tokens that is returned by an XML tokenizer because the tree stores the entire document in memory all at once, making it easy for the program to operate on the document. The operation of an XML parser is demonstrated below.



In addition to reading existing XML files, programs also need to create (or write) new XML files. This is essentially the reverse of the parsing process shown in the previous figure. In this case, a program would create a tree structure in memory that represents the document to be created, and then call an XML printer to take the tree structure and store it to a file in XML format. This process is shown below.



This project requires you to write an XML parser and an XML printer. Your XML parser should use the provided XML tokenizer to read the contents of an XML file and create a tree structure that represents the document. In the other direction, your XML printer will be responsible for taking a document tree and writing it to a file in XML format.

Reading and Writing XML Files

The entry point to your XML parser and XML printer is provided by the ParserFactory class, which is defined as follows:

public class ParserFactory {
   public static Parser createParser() {}
   public static Document createDocument() {}
   public static Element createElement(String tagName) {}
   public static Text createText(String data) {}

}

A program that needs to read an XML file can call ParserFactory.createParser, which returns a Parser object. The Parser object implements the following interface:
public interface Parser {
   Document parse(InputStream input) throws IOException, XMLException;
}

The program calls Parser.parse, passing it an input stream containing the XML file, and the method returns a Document object that stores a tree structure representing the original XML document. The program can then traverse the nodes of the tree to accomplish its intended functionality.

A program that needs to create a new XML file calls ParserFactory.createDocument, which returns a Document object. The returned Document object is the root node of an empty document tree. The program can then create new Element and Text nodes and insert them into the document tree. After creating a document tree with the desired structure, the program calls Document.write to write the document to a file in XML format.

Of course, it is up to you to write classes that implement the Parser and Document interfaces.

Document Trees

A document tree consists of nodes that represent the structure of the document and store all of the information in the document. When reading or writing XML files, each node in the document tree is one of three kinds:
  1. A Document node is at the root and represents the entire XML document
  2. Element nodes represent XML elements. Each Element node stores the elements name and its attribute names and values.
  3. Text nodes represent text content that goes between tags. Each Text node stores a text string.
Document and Element nodes may have child nodes underneath them in the tree, but Text nodes may not.

Each different kind of node is represented by a separate interface.

Document Nodes

public interface Document extends Node {
   void write(OutputStream output) throws IOException;
}

The Document interface represents the root of a document tree. It does not store any real data, but it does provide a method for writing the document to a file. A program can call ParserFactory.createDocument to create a new document, and then call ParserFactory.createElement and ParserFactory.createText to create Element and Text nodes for insertion into the tree. Once a complete tree has been created, it can be written to a file by calling Document.write.

Element Nodes

 
public interface Element extends Node {
   String getTagName();
   void setTagName(String tagName);

   String getAttributeValue(String name);
   void setAttribute(String name, String value);

   int getAttributeCount();
   String getAttributeName(int i);
   String getAttributeValue(int i);
}
The Element interface represents a tree node that stores information about an XML element. It has methods for getting and setting the elements name, as well as for getting and setting its attributes.

Text Nodes

public interface Text extends Node {
   String getData();
   void setData(String data);
}
The Text interface represents a tree node that stores text content that appears between tags in the document. It has methods for getting and setting the text string that is stored by the node.

The Node Interface

You may have noticed that the Document, Element, and Text interfaces all extend an interface named Node. Node defines a bunch of methods that are common to all types of nodes. Rather than duplicating these methods on each interface, they are defined once on Node and then inherited by Document, Element, and Text. The inheritance relationship looks like this:



The methods defined on Node allow us to do three things:
  1. Determine the type of the node (Document, Element, or Text)
  2. Traverse the nodes in the tree
  3. Modify the structure of the tree by inserting, deleting, and replacing nodes in the tree
The Node interface is based on a first-child/next-sibling representation of the document tree, as can be seen below.
public interface Node {
   static final int DOCUMENT_NODE = 1;
   static final int ELEMENT_NODE = 2;
   static final int TEXT_NODE = 3;

   int getType();

   Node getParent();
   Node getFirstChild();
   Node getLastChild();
   Node getNextSibling();
   Node getPreviousSibling();

   void insertChildBefore(Node newChild, Node child);
   void appendChild(Node newChild);
   void removeChild(Node child);
   void replaceChild(Node newChild, Node oldChild);
}

If you have a reference to a Node, you can call the getType method to determine what type of node it is (DOCUMENT_NODE, ELEMENT_NODE, or TEXT_NODE). You can call its getParent, getFirstChild, getLastChild, getNextSibling, and getPreviousSibling methods to access the node's parent and child nodes.Finally, you can modify the nodes list of children by calling the insertChildBefore, appendChild, removeChild, and replaceChild methods. Remember, these methods are available on all types of nodes through inheritance.

In a valid document tree, the Document node at the root will have exactly one child. This child will be an Element node that represents the top-level element in the XML file. The children of an Element node may be either Text nodes or nested Element nodes depending on the contents of the element (the stuff between its start and end tags). It is also possible for an Element to have no children at all, which is the case for empty elements. Text nodes never have children.

Document Tree Example

Here we show an example that demonstrates what the document tree would look like for a particular XML file. Suppose that you have an XML file containing the following student record:
<student id="123456789" gpa='3.56' phone="(801)375-1234">
   <name>Bill White</name>
   <address>300 West 721 North Provo, UT 84604</address>
   <major>Computer Science</major>
</student>

The document tree for this file would look like this: (click on the image to enlarge)

Code Examples

This section provides some code examples that demonstrate how the classes and interfaces described in previous sections work together to let programs read and write XML files.

Parsing an XML File

This method shows how to parse an XML file.
Document readXMLFile(String filename) throws Exception {
   FileInputStream input = new FileInputStream(fileName);
   Parser parser = ParserFactory.createParser();
   Document doc = parser.parse(input);
   input.close();
   return doc;
}

Counting Students

This method counts the number of <student> elements that appear in an XML file named "class-roll.xml".
Document doc = readXMLFile("class-roll.xml");
int count = countStudents(doc);
System.out.println("There are " + count + " students");
 
int countStudents(Node node) {
   int count = 0;
   if (node.getType() == Node.ELEMENT_NODE) {
      Element elem = (Element)node;
      if (elem.getTagName().equals("student")) {
         ++count;
      }
   }
   Node child = node.getFirstChild();
   while (child != null) {
      count += countStudents(child);
      child = child.getNextSibling();
   }
   return count;
}

Creating an XML File

This method shows how to create an XML file with the following contents:
<student id="123456789" gpa='3.56' phone="(801)375-1234">
   <name>Bill White</name>
   <address>300 West 721 North Provo, UT 84604</address>
   <major>Computer Science</major>
</student>

void createStudentFile() throws IOException {
   Document doc = ParserFactory.createDocument();
   Element student = ParserFactory.createElement("student");
   student.setAttribute("id", "123456789");
   student.setAttribute("gpa", "3.56");
   student.setAttribute("phone", "(801)375-1234");
   doc.appendChild(student);
   student.appendChild(ParserFactory.createText("\n "));
   Element name = ParserFactory.createElement("name");
   name.appendChild(ParserFactory.createText("Bill White"));
   student.appendChild(name);
   student.appendChild(ParserFactory.createText("\n "));
   Element address = ParserFactory.createElement("address");
   address.appendChild(ParserFactory.createText("300 West 721 North Provo, UT 84604"));
   student.appendChild(address);
   student.appendChild(ParserFactory.createText("\n "));
   Element major = ParserFactory.createElement("major");
   major.appendChild(ParserFactory.createText("Computer Science"));
   student.appendChild(major);
   student.appendChild(ParserFactory.createText("\n"));
   FileOutputStream output = new FileOutputStream("student-file.xml");
   doc.write(output);
   output.close();
}

Requirement #1: Write a Class that Implements the Parser Interface

The file Parser.java defines the Parser interface. You will also need the file named XMLException.java. You are required to write a class that implements the Parser interface. For example, you might name your class ParserImpl. You should put your implementation class in the cs235.xml package.

For each method on the Parser interface, you should read the comments that accompany the method in Parser.java, and ensure that your implementation of the method works according to the comments.

Your parser should use the provided XML tokenizer to parse the basic units of information from an XML file. As the tokenizer returns the various tokens to the parser, the parser should build up the tree structure that accurately represents the document.

Requirement #2: Detect Unbalanced Tags

Your parser must check to make sure that all of the start tags and end tags in the XML file are properly balanced. When your parser detects unbalanced tags, it should throw an XMLException. This is in addition to the XML errors that are already detected by your XML tokenizer.

Here are some examples of files in which the tags are not properly balanced.
Example 1:

<dog>
</cat>

Example 2:

<dog>
<cat>
</dog>
</cat>

Example 3:

<dog>
<cat>
</cat>

Example 4:

<cat>
</cat>
</dog>

Every start tag must have a corresponding end tag. Every end tag must have a corresponding start tag. Elements must also be properly nested (i.e., an end tag must match the most recent start tag that has not already been closed). When these rules are violated, your parser must throw an XMLException.

Requirement #3: Write Classes that Implement the Document, Element, and Text Interfaces

The Document, Element, and Text interfaces are defined in the files Document.java, Element.java, and Text.java, respectively. You will also need the file named Node.java. You are required to write classes that implement the Document, Element, and Text interfaces. For example, you might name your classes DocumentImpl, ElementImpl, and TextImpl. You should put your implementation classes in the cs235.xml package.

For each method on these interfaces, you should read the comments that accompany the method and ensure that your implementation works according to the comments.

Requirement #4: Implement the Methods on the ParserFactory Class

Implement the factory methods on the ParserFactory class so that they return instances of your implementation classes. This class is found in ParserFactory.java.

Requirement #5: Test Your Code Using the TestDriver Program

The files TestDriver.java and TestNode.java contain a test program that you can use to automatically test your code and ensure that it works properly. The program does not require any command-line arguments, but it does depend on several files whose names end in .xml and .ser which can be downloaded from the project web page. These files must be in the same directory from which you run the TestDriver program.

Note: All specifications on this page are required for pass off. The provided TestDriver does not test the Document class, nor does it test the Node class adequately, but these classes will be tested thoroughly during pass-off. You will need to do your own testing in these situations.

Requirement #6: Pass Off Your Project With a TA

Go through the submission process and meet with a TA for your pass off interview.

Implementation Advice

You are required to implement four interfaces: Parser, Document, Element, and Text. Each of these interfaces should have a corresponding class that implements the interface: ParserImpl, DocumentImpl, ElementImpl, TextImpl. Additionally, it is a good idea to also write a class that implements the Node interface (probably named NodeImpl) that will implement all of the Node methods in one place. You should make DocumentImpl, ElementImpl, and TextImpl subclasses of NodeImpl so that they can inherit all of its method implementations. If you do it this way, your inheritance hierarchy will look like this: You should implement your document trees using a first-child/next-sibling tree representation. The design of the Node interface assumes that this is what you are going to do. Although it is possible to use a representation other than first-child/next-sibling, it would be unwise to do so. Additionally, you will learn more if you use first-child/next-sibling. If you use first-child/next-sibling, your NodeImpl class should look something like this:

class NodeImpl implements Node {

   NodeImpl parent;
   NodeImpl prevSibling;
   NodeImpl nextSibling;
   NodeImpl firstChild;
   NodeImpl lastChild;

   public NodeImpl() {...}
   public int getType() {
      // this should be overridden in DocumentImpl, ElementImpl,
      // and TextImpl to return the appropriate type
      throw new UnsupportedOperationException();
   }

   public Node getParent() {
      return parent;
   }

   public Node getFirstChild() {
      return firstChild;
   }

   public Node getLastChild() {
      return lastChild;
   }

   public Node getNextSibling() {
      return nextSibling;
   }

   public Node getPreviousSibling() {
      return prevSibling;
   }

   public void insertChildBefore(Node newChild, Node child) {...}
   public void appendChild(Node newChild) {...}
   public void removeChild(Node child) {...}
   public void replaceChild(Node newChild, Node oldChild) {...}
}

Tips

  1. One of the trickiest parts of this lab is getting your node manipulation methods to work right. The test/passoff drivers call each of these methods several times on your parse tree in such a way that, if all of these methods are working correctly, the tree should come out identical to how it was before the manipulations. If any of these methods (removeChild(), appendChild(), insertChildBefore(), getLastChild(), getFirstChild(), etc.) does not work perfectly, whole sections of your tree will be obliterated or repeated in the post-manipulations tree.
  2. You'll need your Node classes before you can actually get anywhere with your parser. You can write the parser first, you know the functionality of the nodes based on the interfaces they implement, but you won't be able to compile/run it until you have working Node classes. I recommend writing/testing your Node classes thoroughly before beginning to write the parser.
  3. Make plenty of use of the TestNode class that is provided for download on the Project Description page. It can be a great tool in helping you trace problems in your tree. It has an excellent toString() method that prints out a node and it's children (if you pass it in the node at the root of the tree, it will give a String representation of the whole tree). It also prints out white space characters as their corresponding escape sequences. To create a TestNode, just pass it's constructor any object that is an instance of a class that implements the Node interface. Here's an example of how to use the TestNode class, in this example, doc is an instance of a class that implements the Node and Document interfaces (i.e. it is the root of a document parse tree):

    TestNode myTest = new TestNode(doc);
    System.out.println(myTest.toString());
  4. If you write your Node classes correctly, you should never have to mess with children or sibling pointers in the parser. To add a node to the tree, you should always use one of the methods on the Node classes (like append child).
  5. The basic idea of creating the parse tree is this: Somehow a node is kept track of which is the node that all tokens encountered are being added onto as children. When a start tag token is encountered, a new Element Node is created and that becomes the current node. When an end tag token is encountered, the parent of the current node becomes the current node.
  6. The Tokenizer will throw an IllegalStateException if you try to call a method out of place. For example calling getText() on a StartTag is not needed and impossible. These errors can often occur if you forget to add breaks in your switch statements.