Spelling Checker

Note: Projects are to be completed by each student individually (not by groups of students).

A spelling checker reads the words in a document and outputs a list of misspelled words. For each misspelled word, the program outputs the word followed by a list of the line numbers where the word is found in the document.

The program decides which words are misspelled by using a dictionary of correct words. The program searches the dictionary for each word in the document. If a word from the document cannot be found in the dictionary, the word is added to the list of misspelled words.

After you read this specification, you might want to check out the Tips at the bottom of the page.

Project Files

The following files are required to complete the project. You should download these files with a web browser. You are not allowed to modify these files in any way except for implementing methods that were intentionally left empty. Specifically, interface definitions may not be modified, and all code must remain in the cs235.spellcheck package.

spellcheck-delivery.zip : Contains all the files listed below.
SpellingChecker.java
CheckerFactory.java
SpellCheckGUI.java
TestDriver.java
words.txt
dict.txt
test.txt
short.txt
empty.txt
one-word.txt
bill-of-rights.txt
emancipation-proclamation.txt
gettysburg-address.txt
monroe-doctrine.txt
preamble.txt

NOTE: When downloading these files, use SHIFT+click (or right-click, Save As...) to download a literal copy of these text files. Do not cut-and-paste these files, or you may have problems with the TestDriver.

Dictionary File Format

The spelling checker reads one or more dictionaries of correct words. Each dictionary is a text file that contains a list of words, one word per line as shown in this example.

many
this
dogs
fly
think

Loading dictionaries is additive; if you load 200 dictionaries, all of the (unique) words in all 200 dictionaries will now be in the set of correct words. Loading a new dictionary into the spelling checker should not remove any of the words that were previously in the set of correct words (although new words that are duplicates of old words need not be added).

Document Files

The documents to be spell checked are just regular text files (i.e., sequences of characters). The spelling checker reads in the file to be spell checked and extracts all of the words from the file. Each word in the file is checked against the words in the currently loaded dictionaries in order to detect misspelled words. If a word is not found in the currently loaded dictionaries, it is considered to be misspelled. A document file might contain English prose, or it could just be a list of words. The program does the same thing in either case: 1) extracts the words in the file and 2) checks to see if they're in the dictionary.

How does the program find the words in the document file? Words are defined as sequences of letters that are separated by characters that are not letters. The characters that you need to check for are numbers, white space (spaces, tabs, new lines, etc.), periods, commas, colons, semi-colons, parentheses, apostrophes, hyphens, equals, pluses, percents, slashes, quotes, asterisks, square braces, curly braces, less than, greater than, question marks, exclamation marks, ampersands, vertical bars and underscores. (Look here for a literal list of these non-letter characters/delimiters).

Output Format

The output of the spelling checker is a sorted list of words, one word per line. Use a \n character to indicate a new line. Each word is lower case and followed by a colon and a comma-separated list of line numbers where the word is found in the document. Make sure to remember the space after the comma.

short: 1, 2
this: 1
very: 2

Requirement #1: Write a Class that Implements the `cs235.spellcheck.SpellingChecker` Interface

The file SpellingChecker.java defines an interface named SpellingChecker. SpellingChecker has two methods. The first method is loadDictionary which loads a dictionary file into the spelling checker. More than one dictionary can be loaded into the spelling checker dictionary. The spelling checker uses the union of all the dictionaries that are loaded. The second method is checkDocument which reads the words from a document and checks them against the loaded dictionaries. When you look for a word in the dictionary, the comparison should not be case-sensitive.

You are required to write a class that implements the SpellingChecker interface. You might name your class something like SpellingCheckerImpl. Put your implementation class in the cs235.spellcheck package.

Your spelling checker implementation could use a number of supporting classes such as Dictionary, Document, and Index. You could also use a number of classes from the Java class library such as Set, Map, and List. (NOTE: Set, Map, and List are interfaces. You should use one of the classes that implements them.) A Dictionary object could contain a Set of words. A Document object could have the ability to parse a document into individual words. An Index object could contain a Map of key/value pairs where the key is a word and the value is a List of line numbers. The exact implementation is left up to you.

Requirement #2: Implement the Method in the `cs235.spellcheck.CheckerFactory` class

The file CheckerFactory.java contains a class named CheckerFactory. This class has one method named createSpellingChecker. You must implement this method so that it creates a new instance of your spelling checker class and returns it.

Requirement #3: Test Your Code

You are responsible for testing your code and making sure that it works. To assist your testing effort (and to make it more fun), we have provided a program with a graphical user interface that allows you to interactively load dictionaries and documents, and run your spelling checker to highlight the misspelled words. This program is found in SpellCheckGUI.java.

A test driver program is also provided to help you test your code. The test driver is found in TestDriver.java. The files words.txt, dict.txt, test.txt, short.txt, empty.txt, one-word.txt, bill-of-rights.txt, emancipation-proclamation.txt, gettysburg-address.txt, monroe-doctrine.txt, preamble.txt must be in the current directory when you run the test driver.

Using the test driver should help you get your code working and help you determine when it is ready to be passed off. In fact, you are required to make your code work with the test driver before passing it off. When you pass off your program, a program similar (but not identical) to the test driver will be used to test your code.

Requirement #4: Pass Off Your Project with a TA

Go through the submission process and then meet with a TA for a pass off interview.

Tips

Remember that Java is case-sensitive. If your dictionary is all in lower-case, and your file has some words in upper-case, the upper-case words won't be in the dictionary!
When using the Scanner, you want to tokenize by non-letter characters. You tell the Scanner to delimit by non-letter characters by using the following pattern: "[^A-Za-z]+". See the API for how to use patterns with the Scanner class.
Remember, it should be possible to use the same instantiated SpellingChecker object to check several documents, make sure that misspelled words found in one run of checkDocument() do not show up in the list of misspelled words for subsequent runs unless they really should be there (ie. make sure that you clear your list of misspelled words every time you scan a new document)
Make sure that you never remove words from your dictionary of correct words when a new dictionary file is loaded, all words from all previously loaded dictionary files should still be in your dictionary after the file finishes loading.
Think carefully about how you implement the project. Think about the different abstract data types available to you and which ones are best suited to the task at hand. Just a hint, finding an element in an ArrayList or a Vector takes linear time, while finding them with the contains() method on a set or a map takes O(1) or O(log n) time on average.