CS575

CS575, Artificial Intelligence II
Spring 1999

Assignment 3: Decision Trees

Object

To understand the techniques behind the standard techniques for generating decision trees automatically from data sets of attribute values. To use standard software, based on the C5.0 series.

Domain

We will use recognition of handwritten characters as the domain. The task will be to distinguish handwritten versions of a single character from other characters and/or images.

Procedure

Create raw handwritten characters on the supplied grids. Below is a sample of a few cells from the grid. The first three are positive examples; the last is a negative example.

Submit the grids for scanning (at 300 pixels per inch). A raw image file will be returned to you. It will be a bitmap image in which black pixels are represented by ansunet bit (a 0) and white pixels by an unset bit (a 1). You can then read the raw image into a program to extract the bitmaps for each character. The extraction technique will involve a 'histogram' method, i.e. counting black pixels in each row and column. The crosses at the grid points will be about 30 pixels across, and the lines will be 4 or 5 pixels wide. A histogram of one column of 15 cells will then look like the chart below.

Image55.gif (27292 bytes)

A zero count will indicate white space (the whole column), and by doing the same for the rows of the image, each character in each cell can be isolated.

Save each character from each cell as a separate image file, or at least in a way so that each character can be analyzed. Make sure you have a suitable number of positive and negative examples and make a split between training examples, and examples for testing the decision tree once it has been learned. To be objective, you might choose the split with a random number technique.

Write a program to extract features from each character. These can start with the histogram counts, but can be as simple as 'there are at least n black pixels in this section' to sophisticated features to do with connected sets of black pixels. Try to include many features, but beware of the overfitting problem – picking useless features that will confuse the decision tree algorithm. Each feature will have either a small number of possible values (e.g. yes/no, small/medium/large) or it can be an integer. Consult the documentation for C5.0 for details. The output of this program will be a list of attribute value sets, ready for input to C5.0.

Extract a decision tree from your training data using C5.0. Consult the program's documentation and run the samples provided to get a feel for it. Use a final decision of yes or no.

Test your tree by running the manual tester, or by writing a small program to read in and run the rules produced by C5.0. Again consult the documentation for details.

Deliverables

Your original grid with handwritten characters.

Your program code to extract characters from the grid.

Your program code to produce features for each character.

The data input to C5.0.

The output (as a decision tree) of C5.0, running on your character feature data.

The results of running the decision tree on test characters, both positive and negative.

Due Date

Monday, April 19^th.