371lab2

Assignment #2

Learning StarOffice
An Introduction to the Linux Tools : Grep and gawk

CS371 Sping 2000

Date Issued 01/25/'00

Date Due 02/02/'00

1.StarOffice

StarOffice is an X-based integrated office suite for Linux. StarOffice's latest version includes three tools for office use
1. StarWriter is a word processor. It behaves a lot like the Microsoft Windows.
2. StarCalc is a spreadsheet tool. This is similar to Microsoft Excel tool.
3. StarImpress is a presentation graphics package. It is much like the PowerPoint.
StarOffice lets you to save documents in other formats such as ASCII, RTF(Rich Text Format), and HTML, apart from various Office Tool Formats. More Information about the Staroffice and its installation is available at the COG Information site.

2. Regular Expressions in Unix

Regular expression is a pattern which matches some regular (predictable) text . It can be plain text (with case important) e.g. Fox, or it can also be "enhanced" text i.e. it can also contain some special characters, e.g. '^[aeiou].*[aeiou]$' (The regular expression in the example is the string which starts with a consonant , followed by atleast one character and ending with a vowel). These special characters in Unix are called "wild card" characters; they can represent other characters. Their official name is Filename Metacharacters. Due to their special properties, these metacharacters should not be used as regular parts of a file name. You can force them to be used in a file name, but you must then always use quotation marks around the filename to keep the metacharacters from being interpreted as special characters.
Try out the following metacharacters on the command line.

ls f*

This lists all files that begin with the letter f. * is the metacharacter here.

ls *.txt

This lists all files that ended with the suffix .txt.

Metacharacters
Some of the Metacharacters used by the unix shell are:
*         :     Matches zero or more characters
?        :     Matches a single character
[ ]      :     Matches any one of the enclosed characters
\       :     Turns off the special meaning of the metacharacter that follows

Some of the Metacharacters used by ed, vi, grep, sed (note the different meaning of * and ?):
.          :     Matches any single character except new line
*          :     Matches any number (or none) of the single character that immediately precedes it; for e.g., since '.' means any character, '.*' means match any number of any characters.
^         :    Matches the following regular expression at the beginning of the line
$      :    Matches the preceding regular expression at the end of the line
[ ]   :    Matches any one of the enclosed characters
+     :    Matches one of more instances of the preceding regular expression
?        :     Matches zero or one instances of preceding regular expression

3.The grep command

Grep is used to search one or more files for a given character string or pattern and, if desired, replace the character string with another one. Grep originated from a unix text editor that provided a command sequence
g/re/p for global/regular expression/print.
For example, a unix system user can enter on a command line:
grep html index.html
The result would be to display any line in the index.html file that includes the character string html. The entire line is displayed as can be seen below:

sydney:[4] grep html index.html
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
     <A HREF="cs272/index.html">cs272</A>
 M <A HREF="ndr.html">My class</A>
   <A HREF="pict.html">Nature
sydney:[5]

Grep can be used simply as a way of searching, especially through multiple files. It can also be used to search for word or string occurrences that need to be replaced.

For example, the command
grep if sample.c
prints all the lines which contains the word "if" in the file "sample.c".

Grep allows the string argument to be specified as a regular expression, which is a way of specifying a string that allows certain metacharacters (special keyboard characters such as the period) to stand for other characters or to further define the way the pattern matching should work. For example:
grep '.*hood' essay1

would search the file essay1 and displays every line containing a word with the string hood. The period (dot) and the (*) indicates that any character may precede the string hood any number of times. (The quote marks around the string argument in this example are optional, but quotes are required where the argument is a phrase or contains a blank. For example grep " R " filename explicitly searches for an R with a space on both sides).

An example is shown below:

sydney:[13] grep '.*intf' maxline.c
printf ("Hello! This is an output of the program maxline.c\n");
fprintf (stdout,"%s\n", save);
sydney:[14]

In the above example, all lines containing the string intf in the file maxline.c, is printed on the screen.

For a more detailed help on grep, you can look through the man pages : www.elcafe.com/man/man1/grep.1.html

or type the command

sydney[2]% man grep

A few more examples :

grep scanf *.*
Searches all files and prints out all lines containing the word `scanf'.

grep 'Leach' phone
Searches the file phone for all occurrences of the pattern Leach. Note that the single quotes are optional in this case, but to be on the safe side single quotes should always be used around regular expressions in grep.

grep 'Leach' phone | grep Jacob
Notice that the second grep is not followed by a filename. The first grep takes its input from the file, extracts all the lines that contain the pattern Leach and passes them to the next grep through the pipe. The second grep takes each line from the pipe and extracts only those lines that contain the pattern Jacob. The effect is a two level filtering process.

grep '[Bb]er' phone
The pattern `ber' and `Ber' are picked up by grep.

grep '[A-Z]er' phone
[A-Z] matches only one character. Any string that starts with a capital letter, followed by er will match this pattern.

The egrep command

Egrep is similar to grep but it offers certain advanced features that grep does not support. It supports multiple search patterns, the items of which are separated by pipe symbol "|".
For example:
egrep `Smith|Monica|Brad' phone
The above command traces certain lines in the file phone which contain Smith or Monica or Brad.

More information about the egrep can be found at the man page by typing the command
sydney[2]% man egrep

4.The gawk command

gawk is a programming language designed to make many common information retrieval and text manipulation tasks easy to state and to perform. The basic operation of gawk is to scan a set of input lines in order, searching for lines which match any of a set of patterns which the user has specified. For each pattern, an action can be specified; this action will be performed on each line that matches the pattern. In gawk the patterns may be more general than in grep, and the actions allowed are more involved than merely printing the matching line.
Some examples:
The gawk program
{print $3, $2}
prints the third and second columns of a table in that order.
The program
$2 ~ /A|B|C/
prints all input lines with an A, B, or C in the second field.

The command
gawk program [files]
executes the gawk commands in the string program on the set of named files, or on the standard input if there are no files. The statements can also be placed in a file pfile, and executed by the command
gawk -f pfile [files]

Program Structure
   An gawk program is a sequence of statements of the form:
         pattern   { action }
         pattern   { action }
         ...
Each line of input is matched against each of the patterns in turn. For each pattern that matches, the associated action is executed.   When all the patterns have been tested, the next line is fetched and the matching starts over. Either the pattern or the action may be left out, but not both.   If there is no action for a pattern, the matching line is simply copied to the output. (Thus a line which matches several patterns can be printed several times.) If there is no pattern for an action, then the action is performed for every input line. A line which matches no pattern is ignored. Since patterns and actions are both optional, actions must be enclosed in braces to distinguish them from patterns.

Records and Fields
Awk input is divided into records' terminated by a record separator. The default record separator is a newline, so by default awk processes its input a line at a time. The number of the current record is available in a variable named NR. Each input record is considered to be divided into fields.' Fields are normally separated by white space -- blanks or tabs -- but the input field separator may be changed,as described below. Fields are referred to as $1, $2, and so forth, where $1 is the first field, and $0 is the whole input record itself. Fields may be assigned to. The number of fields in the current record is available in a variable named NF. The variables FS and RS refer to the input field and record separators; they may be changed at any time to any single character. The optional command line argument -Fc may also be used to set FS to the character c. If the record separator is empty, an empty input line is taken as the record separator, and blanks, tabs and newness are treated as field separators. The variable FILENAME contains the name of the current input file.

Printing
An action may have no pattern, in which case the action is executed for all lines. The simplest action is to print some or all of a record; this is accomplished by the awk command print. The awk program
{ print }
prints each record, thus copying the input to the output intact. More useful is to print a field or fields from each record. For instance,
print $2, $1
prints the first two fields in reverse order. Items separated by a comma in the print statement will be separated by the current
output field separator when output. Items not separated by commas will be concatenated, so
print $1 $2
runs the first and second fields together.
The predefined variables NF and NR can be used; for example
{ print NR, NF, $0 }
prints each record preceded by the record number and the number of fields.
Output may be diverted to multiple files; the program
{ print $1 >"foo1"; print $2 >"foo2" }
writes the first field, $1, on the file foo1, and the second field on file foo2. The >> notation can also be used:
print $1 >>"foo"
appends the output to the file foo. (In each case, the output files are created if necessary). The file name can be a variable
or a field as well as a constant; for example,
print $1 >$2
uses the contents of field 2 as a file name.
Naturally there is a limit on the number of output files; currently it is 10.

An example of an awk program is shown below:
        The output of "who" UNIX command has 6 fields.
         sydney:[5] who
     skamenan ttyp0    Sep 6 23:08 (nericos.NMSU.Edu)
     ndatta   ttyp2    Sep 7 13:25 (guinness.CS.NMSU)
     sydney:[6] who | awk '{ print $1, $5 }'
     skamenan 23:08
     ndatta 13:25

Patterns
A pattern in front of an action acts as a selector that determines whether the action is to be executed. A variety of expressions may be used as patterns: regular expressions, arithmetic relational expressions, string valued expressions, and
arbitrary boolean combinations of these.

BEGIN and END
     The special pattern BEGIN matches the beginning of the input, before the first record is read. The pattern END matches
the end of the input, after the last record has been processed. BEGIN and END thus provide a way to gain control before and after
processing, for initialization and wrap-up.
As an example, the field separator can be set to a colon by
    BEGIN     { FS = ":" }
                     ... rest of program ...
Or the input lines may be counted by
END { print NR }
If BEGIN is present, it must be the first pattern; END must be the last if used.

Regular Expressions
    Awk regular expressions include the regular expression forms found in the UNIX text editor ed and grep In addition, awk allows parentheses for grouping, | for alternatives, + for ``one or more'', and ? for ``zero or one'', all as in lex.   Character classes may be abbreviated: [a-zA-Z0-9] is the set of all letters and digits. As an example, the awk program
/[Aa]ho|[Ww]einberger|[Kk]ernighan/
will print all lines which contain any of the names Aho , Weinberger or Kernighan, whether capitalized or not.
    Regular expressions (with the extensions listed above) must be enclosed in slashes, just as in ed and sed. Within a regular
expression, blanks and the regular expression metacharacters are significant.   One can also specify that any field or variable matches a regular expression (or does not match it) with the operators ~ and !~. The program
$1 ~ /[jJ]ohn/
prints all lines where the first field matches john or John.'' Notice that this will also match Johnson, St Johnsbury, and so on. To restrict it to exactly [jJ]ohn, use
$1 ~ /^[jJ]ohn$/
The caret ^ refers to the beginning of a line or field; the dollar sign $ refers to the end.

Relational Expressions
An awk pattern can be a relational expression involving the usual relational operators <, <=, ==, !=, >=, and >. An example
$2 > $1 + 100
which selects lines where the second field is at least 100 greater than the first field. Similarly,
NF % 2 == 0
prints lines with an even number of fields.

Built-in Functions
 Awk provides a length function to compute the length of a string of characters. This program prints each record, preceded by its length:
 {print length, $0}
length by itself is a pseudo-variable which yields the length of the current record; length (argument) is a function which yields the length of its argument, as in the equivalent
 {print length($0), $0}
The argument may be any expression.
 Awk also provides the arithmetic functions sqrt, log, exp, and int, for square root, base e logarithm, exponential, and integer part of their respective arguments. The name of one of these built-in functions, without argument or parentheses, stands for the value of the function on the whole record. The program
length < 10 || length > 20
prints lines whose length is less than 10 or greater than 20.
The function sprintf(f, e1, e2, ...) produces the value of the expressions e1, e2, etc., in the printf format specified by f. Thus, for example,
 x = sprintf("%8.2f %10ld", $1, $2)
sets x to the string produced by formatting the values of $1 and $2.

Variables, Expressions, and Assignments
Awk variables take on numeric (floating point) or string values according to context. For example, in
x = 1
x is clearly a number, while in
x = "smith"
it is clearly a string. Strings are converted to numbers and vice versa whenever context demands it. For instance,
x = "3" + "4"
assigns 7 to x. Strings which cannot be interpreted as numbers in a numerical context will generally have numeric value zero,
but it is unwise to count on this behavior.
Arithmetic is done internally in floating point. The arithmetic operators are +, -, *, /, and % (mod). The C increment ++ and decrement -- operators are also available, and so are the assignment operators +=, -=, *=, /=, and %=. These operators may
all be used in expressions.

Field Variables
Fields in awk share essentially all of the properties of variables -- they may be used in arithmetic or string operations,and may be assigned to. Thus one can replace the first field with a sequence number like this:
{ $1 = NR; print }
or accumulate two fields into a third, like this:
{ $1 = $2 + $3; print $0 }
Field references may be numerical expressions, as in
{ print $i, $(i+1), $(i+n) }
Whether a field is deemed numeric or string depends on context; in ambiguous cases like
if ($1 == $2) ...
fields are treated as strings.

Arrays
Array elements are not declared; they spring into existence by being mentioned. Subscripts may have any non null value,
including non numeric strings. As an example of a conventional numeric subscript, the statement
x[NR] = $0
assigns the current input record to the NR-th element of the array x.

You can find some on-line documentation about awk at
http://intranet/COG/Software/gawk/gawk_toc.html and http://www.maths.tcd.ie/pub/tutorials/awk/tutorial.html
You can also check the manual pages for both grep and awk commands. To check the manual pages, use the man command as shown below.

sydney[3]% man awk

Homework Assignment

The McCabe cyclomatic complexity of the program can be defined in three different ways that yield the same result:

1. The number of regions of the program flow graph correspond to the cyclomatic complexity.
2. Cyclomatic complexity is defined as
    V = E - N + 2
    where E is the number of program flow graph edges and N is the number of flow graph nodes.
3. It can be also defined as
    V = P + 1
   where P is the number of predicate nodes in the program flow graph.
For the example above:
      The flow graph has 3 regions
      V = 8 edges - 7 nodes + 2 = 3
      V = 2 predicate nodes + 1 = 3
    In practice it is enough to count the number of `if' statements, the number of `case' branches in the `switch' statements, the number of conditional expressions (expr ? ... ), and the number of loop statements `for', `while', `do-while'.
    It has been proved by a number of experiments that some important properties of a program (e.g. the number of errors, the effort needed to design the program, the number of tests needed to cover all branches of the program, etc.) are correlated with McCabe
complexity.

The task is to design a program that given a text of C program computes McCabe complexity M of that program, counts the number of source text lines N in the program, and also provides some calculates the number of occurrences of each of the 'if', 'for', 'case', '?', and 'while & do-while (counted together)' expressions.

An example C program for the McCabe complexity is given.

For this Laboratory Assignment do the following:

Write a prototype program in AWK language. For the sake of simplicity don't care about the problem of preventing counting occurrences of 'if', 'for' etc. in comments and string constants. Simply count them together with occurrences in the rest of text. McCabe complexity can appear to be a bit larger in this case, but for the prototype it is O.K. Use the prototype in order to refine the requirements.

Write the Requirements for the program. Describe what is on input, on output, what are the functions of the program, analyze special cases (e.g. occurrences of 'if', 'for' etc. in comments and string constants, multiple occurrences of countable keywords on the same line etc.)
Submit to the TA via e-mail :

1. Prototype program in AWK and example of its execution (input and output obtained) (5 points)

2. Requirement Specification (short document describing the problem) (5 points)

Please direct any questions/comments regarding this assignment to the TA.

371 Home | Hw List |