CS167 C Programming
Spring 1998

Building a Concordance

Object

To test understanding of arrays and pointers, especially the use of an array of strings (pointers to characters).

Problem Statement

Write a program in C to build a concordance from a piece of arbitrary text. A concordance is a list of all the different words that occur in a piece of text.

Outline

Assuming the concordance will essentially be an array of pointers to characters (i.e. strings), a number of problems will have to be solved:

  1. How many different words to expect.
  2. How to avoid using too much space for each word.
  3. How to recognize a word, i.e. wich characters are in the word, and which are not.
  4. What to do about words split across lines with a hyphen, and
    words separated only by a forward slash, and punctuation in general.
  5. What to do about capital letters, i.e. "this" and "This" are the same word.

The success of your program is dependent on answering these questions correctly, incorporating them into a design, and then writing the appropriate code.

The design

Your design should use at least three functions, including the main function (see the section below).

Space allocation for new words

A pointer to each new word that you read from the input stream must be stored in the array of pointers. Your design should include a separate function to allocate space for the word. I will be sending you the correct code for this operation via email. You should take my code and incorporate it, with suitable comments, into your function. Here is what you will get:


/***********************************************************************
 This function takes a pointer to a character that is assumed to be the
 start of an array.  The length of the string (up to but not including
 the null character at the end) is calculated and space allocated for
 a copy of the string using calloc, a standard library function. The 
 string is then copied into the new space by strcpy, a string library 
 function. Finally, a pointer to the new string is returned. To use this 
 function include both <stdlib.h> and <string.h> in your program.
************************************************************************/
char *AllocateString(char *s) {
    char *t = calloc(strlen(s) + 1, sizeof(char));
    strcpy(t, s);
    return t;
}

Note that you will need to include both stdlib.h and string.h for this to work.

Checking for words already seen

So that the concordance list has no duplicates, you must check the list every time a word is read from the input stream to see whether the word is already present. Use the library function strcmp to do this, by including the head file string.h at the top of your file.

Capital letters

Although these can be handled easily by using you own C code, the library header ctype.h contains a number of useful routines for handling characters. In particular, "tolower" converts a character to lower case, while "toupper" converts it to upper case. In addition, there are testing functions like "isdigit", "isalpha" etc. Type "man ctype" at the UNIX prompt to get a full list of these useful routines.

Input

The input will be from standard input (no file opening and closing is necessary) and can be any piece of text. Do not use a C program as input. Instead you should use a piece of normal English text. We will test your program on a simple piece of text that we will choose.

Output

Your program should print a list of the different words found in the text. e.g.:

1: the
2: cat
3: sat
4: on
5: mat
6: she
7: was
8: sleepy
9: and
10: comfortable
...
...

Deliverables

  1. The source code, with good layout and comments, and
  2. your sample output.
  3. mail your source code to the grader, login hhuang.

Due date

Monday, April 13th. 1998 before 5:00pm.