CS 167/467 C Programming
Fall 1999
Graduate Project: Document Analysis

 

Goal

To integrate all aspects of the C langauge in one medium-sized project, including design, coding, testing and debugging.

Introduction

Document analysis is an application with the aim of capturing the information in a printed document so that it can be stored and processed in a digital form. There are several phases ot the whole job. These are:

1.      scanning the original to produce a digital image

2.      locating zones in the image that contain text and/or pictures

3.      running an optical character recognition algorithm on the zones that contain text

4.      assembling the processed zones into a standard document format, such as Microsoft Word or Framemaker

The project will be the second phase—zone location. You will be provided with binary files containing images data for several pages containing a mixture of text and pictures. They will vary from very simple—two or three zones of text only, to complex—mixed text and picture zones with arbitrary placement on the page. Below are three such pages.

 

  

Locating zones

We will assume the document images are all black and white and consist of a two-dimensional array of pixels (picture elements). Each pixel is either black or white. A zone is then an area of mostly black pixels surrounded by all white pixels. How to locate such an area is the crucial problem, and how small to allow each zone to be. Many techniques have been tried, but most rely on a fairly simple idea—using vertical and horizontal histograms. Consider a single character, such as the letter ‘a’:

 

 a


 

It consists of many black pixels in a background of white pixels. If the number of pixels in each row and column were counted, then we could draw a histogram of these counts, such as in Figure 1.

 

 

Figure 1 Sample histogram

 

Clearly the boundaries of the black pixel area are given by the change from no black pixels (count = 0) to some black pixels (count > 0). These give the position of the zone boundary.

Although we would not want to give a single character its own zone, we would want to give a paragraph a zone. The vertical histogram for a paragraph looks something like Figure 2. Each line of text has a white region sourrounding it, but the paragraph itself is sourrounded by a larger region of white.

Figure 2 Vertical histogram of four line paragraph

 

The criteria for the boundaries of a zone clearly vary with the size of the characters within the zone, and the spacing between lines.

Your program only has to calculate these histograms, there is no need to display them as charts as in Figures 1 and 2.

Which zones to find

With so much variability, it is easy to see that a program could find zones as big as the whole page—the pagemargins would be the boundaries, and as small as the individual characters. Your program should find the in-between size of zone; so a zone is:

·        a paragraph

·        a picture

·        a title or subtitle (regard these as special paragraphs)

·        nothing else

Designing and writing the program

Please play attention to the design of the program as regards modularity, use of global variables and so on. Points will be awarded for the design as reflected in the source code itself. Your program should be able to read any image file (see below), find the zones in it and print out the pixel coordinates of all zones found.

Input image files

Several files will be made available for testing, as mentioned in the introduction. Each file will be a binary image file consisting of a sequence of pixel values. All files will have the same number of rows and columns, and all files will be organized as a number of rows, starting at the top of the image,  each row consisting of a number of columns of data. Thus, for instance, an image of 10 pixels vertical by 5 pixels horizontal would be stored in the file as a sequence of values:

Row 0: 65, 70, 60, 65, 45

Row 1: 40, 56, 70, 54, 45

Row 2: 0, 0, 10, 0, 0

Row 9: 0, 0, 0, 0, 0

 

The actual number of rows and columns will be the same for each image.

The files will be in binary format, with each pixel value taking up one byte. So, for instance if the image is 10x5 pixels, then the file will have 50 bytes of data, each byte corresponding to one pixel.

Deliverables

1.      The source code with appropriate comments that give good information about your deisgn.

2.      Output, in a suitable format, showing the coordinates of the four corners of each of the zones found by your program for all the sample images.

3.      email with the source code of your program to the grader, login byao.

Grading

·        design: 20%

·        coding: 40%

·        performance (how many images are zoned correctly): 40%

Due Date

December 15th, before 5pm.