Assignment #2
Learning  StarOffice
An Introduction to the Linux Tools : Grep and  gawk
CS371 Sping 2000
Date Issued 01/25/'00
     Date Due 02/02/'00
1.StarOffice
StarOffice is an X-based integrated office suite for Linux. StarOffice's latest version includes three tools for office use
 1. StarWriter is  a word processor. It behaves a lot like the Microsoft Windows.
 2. StarCalc is a spreadsheet tool. This is similar to Microsoft Excel tool.
 3.  StarImpress is a presentation graphics package. It is much like the PowerPoint.
        StarOffice lets you to save documents in other formats such as ASCII, RTF(Rich Text Format), and HTML, apart from various Office Tool Formats.  More Information about the Staroffice and its installation  is available at the COG Information site.
 
 
2. Regular Expressions in Unix
 Regular expression is  a pattern which matches some regular (predictable) text . It can be plain text (with case important) e.g. Fox, or  it can also be "enhanced" text i.e. it can also contain some special characters, e.g.    '^[aeiou].*[aeiou]$'   (The regular expression in the  example is the string which starts with a consonant , followed by atleast one character and ending with a vowel). These special characters in Unix  are  called "wild card" characters; they can represent other characters. Their official name is Filename Metacharacters. Due to their special properties, these metacharacters should not be used as regular parts of a file name. You can force them to be used in a file name, but you must then always use quotation marks around the filename to keep the metacharacters from being interpreted as special characters.
Try out the following metacharacters on the command line.
 
ls f*
This lists all files that begin with the letter f* is the metacharacter here.

ls *.txt
This lists all files that ended with the suffix .txt.

Metacharacters
Some of the Metacharacters used by the unix shell are:
        :     Matches zero or more characters
?        :     Matches a single character
[ ]      :     Matches any one of the enclosed characters
\       :     Turns off the special meaning of the metacharacter that follows

Some of the Metacharacters used by ed, vi, grep, sed (note the different meaning of * and ?):
.          :     Matches any single character except new line
*          :     Matches any number (or none) of the single character that immediately precedes it; for e.g., since '.'  means any character, '.*'  means  match any number of any characters.
        :    Matches the following regular expression at the beginning of the line
$      :    Matches the preceding regular expression at the end of the line
[ ]   :    Matches any one of the enclosed characters
+     :    Matches one of more instances of the preceding regular expression
?        :     Matches zero or one instances of preceding regular expression
 
 

3.The grep command
 Grep is used to search one or more files for a given character string or pattern and, if desired, replace the character string with another one. Grep originated from a unix text editor that provided a command sequence
g/re/p for global/regular expression/print.
For example, a unix system user can enter on a command line:
                                                   grep html index.html
 The result would be to display any line in the index.html file that includes the character string html. The entire line is displayed as can be seen below:

sydney:[4] grep html index.html
   <META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<P><FONT SIZE=+1><FONT COLOR="#66FFFF">&nbsp;&nbsp;&nbsp;&nbsp; </FONT><FONT COLOR="#CCFFFF"><A HREF="cs272/index.html">cs272</A></FONT></FONT>
<BR>M <FONT COLOR="#FFFFCC"><A HREF="ndr.html">My class</A></FONT>
<P><FONT COLOR="#CCFFFF"><FONT SIZE=+2>&nbsp;&nbsp; </FONT><FONT SIZE=+1><A HREF="pict.html">Nature
sydney:[5]

Grep can be used simply as a way of searching, especially through multiple files.  It can also be used to search for word or string occurrences that need to be replaced.

For example, the command
grep  if  sample.c
prints all the lines   which contains the word  "if" in the file "sample.c".
 

Grep allows the string argument to be specified as a regular expression, which is a way of specifying a string that allows certain metacharacters (special keyboard characters such as the period) to stand for other characters or to further define the way the pattern matching should work. For example:
                                                   grep '.*hood' essay1

would search the file essay1 and displays every line containing a word with the string hood. The period (dot) and the  (*) indicates that any character may precede the string hood  any number of times. (The quote marks around the string argument in this example are optional, but quotes are required where the argument is a phrase or contains a blank. For example  grep " R "  filename  explicitly searches for an R with a space on both sides).

An example is shown below:

sydney:[13] grep '.*intf' maxline.c
    printf ("Hello! This is an output of the program maxline.c\n");
        fprintf (stdout,"%s\n", save);
sydney:[14]

In the above example, all lines containing the string intf in the file maxline.c, is printed on the screen.

For a more detailed help on grep, you can look through the man pages : www.elcafe.com/man/man1/grep.1.html

or type the command

sydney[2]% man grep

 A few more examples :

grep scanf *.*
Searches  all files  and prints out all lines containing the word `scanf'.

grep 'Leach' phone
 Searches the file phone for all occurrences of the pattern Leach. Note that the single quotes are optional in this case, but to be on the safe side single quotes should always be used around regular expressions in grep.

grep 'Leach' phone | grep Jacob
 Notice that the second grep is not followed by a filename. The first grep takes its input from the file, extracts all the lines that contain the pattern Leach and passes them to the next grep through the pipe. The second grep takes each line from the pipe and extracts only those lines that contain the pattern  Jacob. The effect is a two level filtering process.

grep '[Bb]er' phone
The pattern `ber' and `Ber' are picked up by grep.

grep '[A-Z]er' phone
 [A-Z] matches only one character. Any string that starts with a capital letter, followed by er will match this pattern.
 

The egrep command
Egrep is similar to grep but it offers certain advanced features that grep does not support. It supports multiple search patterns, the items of which are separated by pipe symbol "|".
For example:
 egrep `Smith|Monica|Brad' phone
The above command traces certain lines in the file phone which contain Smith or Monica or Brad.

More information about the egrep can be found at the man page by typing the command
sydney[2]% man egrep
 

4.The gawk command
gawk is a designed to be an easy-to-use progranmming language that lets you work with information stored stored in files .The main strengths of gawk are its capabilities to do the following.
    * Display some or all the contents of a file,selecting rows, coloumns, or fields as necessary..
    * Analyze text for the frequency of words ...etc.
    * prepare formatted output reports based on information in a file.
    * filter text in a very powerful manner.
    * perform calculations with numeric information  from a file.

Records and Fields

     gawk input is divided into records  terminated by a record separator.   The  default  record  separator  is a newline, so by default awk
processes its input a line at a time.  The number  of the current records read so far  is available in a variable named NR. Each   input   record  is  considered  to  be  divided  into fields.  Fields are normally  separated  by  white  space  -- blanks  or  tabs -- but the input field separator may be changed, as described below.  Fields are referred to as  $1,  $2,  and  so forth,  where  $1  is  the first field, and $0 is the whole input  record itself.    The number of  fields in the current record is available in a variable named NF.  The  variable FS refer to the Field  Seprator which is by whitespace by default.  The variable  RS refers to the Record Seperator which is by default a newline.  The variables FS, RS may be  changed at any time to any single character.
So by default gawk uses blank charecters(spaces or tabs) as field seperators unless instructed to use another character

Pattern-Action  Pairs.

The gawk language has a particular format for all instructios. Each command is composed of two parts: a pattern and a corresponding action.
Whenever a pattern is matched, gawk executes the action that matches the pattern.

With gawk, the patterns to be matches are enclosed in  a pair of slashes, and the actions are in a pair of braces:

/pattern1/{action1}
/pattern2/{action2}

This format makes it quite easy to tell where pattern starts and ends, and when the action starts and ends. All gawk programs are sets of
pattern-action pairs. Suppose that you are working on text file. So, a typical patterns might be matching a set of strings , and the actions might be to print out parts of the line that matched.

The pattern-action pair are enclosed in  single quotes in order to differenciate from the file name.
The general gawk command is

gawk '/pattern/{action}'  filename

For example.  consider the following

gawk  '/UNIX/{print $2} '  file2.data

This command searches the file2.data line-by-line,  looking for the key the string UNIX . If it finds UNIX, it prints the second column of that line(record).
 

You can combine more than one pattern-action pair  in a command. For example, the command

gawk '/scan/{print $1}   /rumor/{print $2}'   file1

looks file1 for all occurrences of the pattern "scan" and prints the first column, and then starts at the top of file again and searches for the pattern "rumor" and prints the second column. The search starts atthe top of file each time there is  a new pattern-action pair.

Simple Patterns.

Gawk numbers all the fields in a record. The first field is $1, the second is $2 , and so on  The entire record is $0.  But gawk allows you to ignore the $0 in simple commands. Similarly if there, is no action, gawk copies the entire line  without change.

Therefore, each of the following instructions results in the same output.

gawk '/parker/{print $0}'    file1
gawk  '/parker/{print} '   file1
gawk  '/parker/'    file1

Suppose that you want to do more than  a match a simple charecter string like for  example.

gawk '$2== "foo" { print $3}'  testfile
 

instructs the gawk to compare the second string($2) of each record in testfile and checks to see if it is equal to string foo . If it is , gawk prints the thrid column($3). There are several comparision operators like

Equal to                    ==
Not Equal To            !=
Greater Than            >
Less than                 <
Greater than or
equals to                  >=
Less than or
equals to                  <=
 

For example

gawk '$4 > 100 ' testfile

will display every line in the testfile in which the value in the fourth column is greater than 100.

Similarly normal arthemetic commands are available  plus(+), minus(-), Multiplication(*), Division(/), Exponentiation(^), remainder(%).

you can combine column numbers and math , too.

{print $3/2}

This action divides the number in the third column by 2.

Some arthemetic commands are.

{print $1+$2*$3}

{print ($1+$2)*$3}
 

Strings are rounded by double quotes("").

gawk '$1 != "Tim" {print}'  testfile

This command prints any line in testfile that doesn't have the word Tim in the first column.

You can place  Strings in the print command too.

gawk ' $1 != "tim" { print " the entry for ", $1, "is not Tim. ", $2} '  testfile.
 

In some files the seperator may not be space. like in etc/passwd file where the seperator is the colon(:).

You can change the field seperator on the gawk command line by using the -F option followed by the seperator you want to use:

gawk -F":"  '/parker/{ print } ' /etc/passwd

This command changes the field seperator to colon and searches the etc/passwd file for lines containing the string parker.  The new field seperator is put in quotation marks to avoid any confusion.
 

Output may be diverted to multiple files; the program

 { print $1 >"foo1"; print $2 >"foo2" }

writes the first field, $1, on the  file  foo1,  and  the  second field on file foo2.  The >> notation can also be used:
  print $1 >>"foo"
appends  the  output  to the file foo.  (In each case, the output files are created if necessary).  The file name can be a variable
or a field as well as a constant; for example,
  print $1 >$2
uses the contents of field 2 as a file name.
Naturally  there  is  a limit on the number of output files; currently it is 10.
 

If the data is  not stored in a file , it is supplied to gawk through a pipe(|) .
For example:

 The output of "who" UNIX command has 6 fields.

sydney[2]% who
estahl   tty4     Jan 25 11:57
ppesara  ttyp1    Jan 25 13:35 (bonobo.CS.NMSU.E)

The command

sydney[4]% who |gawk '{print $1,$5}'
estahl 11:57
ppesara 13:35
 

BEGIN and END

     The special pattern  BEGIN  matches  the  beginning  of  the input,  before the first record is read.  The pattern END matches
the end of the input, after the last record has  been  processed. BEGIN and END thus provide a way to gain control before and after
processing, for initialization and wrap-up.
 As an example, the field separator can be set to a colon by
    BEGIN     { FS = ":" }
                     ... rest of program ...
Or the input lines may be counted by
  END  { print NR }
If BEGIN is present, it must be the first pattern;  END  must  be the last if used.

After you have saved all of the pattern -action   pairs in a program, they are called by gawk with the -f option on the command line:

gawk -f script filename

A gawk program is called a script
 

 Regular Expressions
    gawk regular expressions include the regular expression forms found in the UNIX text  editor  ed  and  grep  In addition, gawk allows parentheses for grouping, | for alternatives, + for ``one or more'', and ?  for  ``zero  or one'',  all  as  in  lex.   Character classes may be abbreviated: [a-zA-Z0-9] is the set of all letters and digits.  As an example, the awk program
  /[Aa]ho|[Ww]einberger|[Kk]ernighan/
will  print  all  lines  which  contain any of the names Aho , Weinberger or Kernighan,  whether capitalized or not.
    Regular expressions (with the extensions listed above)  must be  enclosed in slashes, just as in ed and sed.  Within a regular
expression, blanks and the regular expression metacharacters  are significant.   One  can  also  specify that any field or variable matches a regular expression (or does not match it) with  the  operators  ~ and !~.  The program
 $1 ~ /[jJ]ohn/
prints  all  lines  where  the  first  field  matches john or John.''  Notice that this will also  match  JohnsonSt Johnsbury, and so on. To restrict it to exactly [jJ]ohn, use
 $1 ~ /^[jJ]ohn$/
The  caret ^ refers to the beginning of a line or field; the dollar sign $ refers to the end.
 

 Built-in Functions
     gawk  provides a length function to compute the length of a string of characters.  This program prints  each  record,  preceded by its length:
    {print length, $0}
length by itself is a pseudo-variable which yields the length of the current  record;  length (argument)  is  a  function  which yields the length of its argument, as in the equivalent
   {print length($0), $0}
The argument may be any expression.
  gawk  also  provides the arithmetic functions sqrt, log, exp, and int, for square root,  base  e  logarithm,  exponential,  and integer part of their respective arguments. The  name  of one of these built-in functions, without argument or parentheses, stands for the value of the function on  the whole record.  The program
 length < 10 || length > 20
prints lines whose length is less than 10 or greater than 20.
The  function  sprintf(f, e1, e2, ...) produces the value of the expressions e1, e2, etc., in the printf format  specified  by f.  Thus, for example,
   x = sprintf("%8.2f %10ld", $1, $2)
sets  x to the string produced by formatting the values of $1 and $2.

Variables, Expressions, and Assignments
    gawk variables take on numeric  (floating  point)  or  string values according to context.  For example, in
   x = 1
x is clearly a number, while in
  x = "smith"
it  is  clearly  a  string.  Strings are converted to numbers and vice versa whenever context demands it.  For instance,
  x = "3" + "4"
assigns 7 to x.  Strings which cannot be interpreted  as  numbers in  a  numerical  context will generally have numeric value zero,
but it is unwise to count on this behavior.
 Arithmetic is done internally in floating point.  The arithmetic operators are +, -, *, /, and % (mod).  The C increment  ++ and  decrement  --  operators  are also available, and so are the assignment operators +=, -=, *=, /=, and %=.  These operators may
all be used in expressions.

 Field Variables
     Fields  in  gawk  share  essentially all of the properties of variables -- they may be used in arithmetic or string operations,and  may  be  assigned  to.  Thus one can replace the first field with a sequence number like this:
 { $1 = NR; print }
or accumulate two fields into a third, like this:
 { $1 = $2 + $3; print $0 }
Field references may be numerical expressions, as in
{ print $i, $(i+1), $(i+n) }
Whether  a  field is deemed numeric or string depends on context; in ambiguous cases like
 if ($1 == $2) ...
fields are treated as strings.

 Arrays
     Array  elements are not declared; they spring into existence by being mentioned.  Subscripts  may  have  any  non null value,
including  non numeric  strings.  As an example of a conventional numeric subscript, the statement
   x[NR] = $0
assigns the current input record to  the  NR-th  element  of  the array  x.

You can find some on-line documentation about awk at
http://intranet/COG/Software/gawk/gawk_toc.html  and http://www.maths.tcd.ie/pub/tutorials/awk/tutorial.html
You can also check the manual page for  gawk command. To check the manual pages, use the man command as shown below.

sydney[3]% man gawk
 
 
 

Homework Assignment
The McCabe cyclomatic complexity of the program can be defined in three different ways that yield the same result:

1. The number of regions of the program flow graph correspond to the cyclomatic complexity.
2. Cyclomatic complexity is defined as
    V =  E - N + 2
    where E is the number of program flow graph edges and N is the number of flow graph nodes.
3. It can be also defined as
    V = P + 1
   where P is the number of predicate nodes in the program flow graph.
  For the example above:
      The flow graph has 3 regions
      V = 8 edges -  7 nodes + 2 = 3
      V = 2 predicate nodes + 1 = 3
    In practice it is enough to count the number of `if' statements, the number of `case' branches in the `switch' statements, the number of conditional expressions (expr ? ... ), and the number of loop statements `for', `while', `do-while'.
    It has been proved by a number of experiments that some important  properties of a program (e.g. the number of errors, the effort needed to design the program, the number of tests needed to cover all branches of the program, etc.) are correlated with McCabe
complexity.

        The task for this laboratory and the next one  is to design a program that given a text of C program computes McCabe complexity M of that program, counts the number of source text lines N in the program, and also  calculates the number of occurrences of each of the 'if', 'for', 'case', '?',  and 'while ' and 'do-while' expressions. Also calculate the Relative McCabe Complexity which is defined as the ratio of the McCabe Complexity and the  Total number of lines .

Note: Ignore multiple occurences of keywords on the same line.

An example C program for the McCabe complexity is given.

For this Laboratory Assignment do the following:
 


371 Home | Hw List |