Some of the Metacharacters used by ed, vi, grep, sed (note the
different meaning of * and ?):
.
: Matches any single character except new line
*
: Matches any number (or none) of the single character
that immediately precedes it; for e.g., since '.' means any character,
'.*'
means match any number of any characters.
^
: Matches the following regular expression at the beginning
of the line
$
: Matches the preceding regular expression at the end
of the line
[ ] :
Matches any one of the enclosed characters
+ :
Matches one of more instances of the preceding regular expression
?
: Matches zero or one instances of preceding regular
expression
sydney:[4] grep html index.html
<META HTTP-EQUIV="Content-Type"
CONTENT="text/html; charset=iso-8859-1">
<P><FONT SIZE=+1><FONT COLOR="#66FFFF">
</FONT><FONT COLOR="#CCFFFF"><A HREF="cs272/index.html">cs272</A></FONT></FONT>
<BR>M <FONT COLOR="#FFFFCC"><A
HREF="ndr.html">My class</A></FONT>
<P><FONT COLOR="#CCFFFF"><FONT
SIZE=+2> </FONT><FONT SIZE=+1><A HREF="pict.html">Nature
sydney:[5]
Grep can be used simply as a way of searching, especially through multiple files. It can also be used to search for word or string occurrences that need to be replaced.
For example, the command
grep if sample.c
prints all the lines which contains the word "if"
in the file "sample.c".
Grep allows the string argument to be specified as a regular
expression, which is a way of specifying a string that allows certain metacharacters
(special keyboard characters such as the period) to stand for other characters
or to further define the way the pattern matching should work. For example:
grep '.*hood' essay1
would search the file essay1 and displays every line containing a word with the string hood. The period (dot) and the (*) indicates that any character may precede the string hood any number of times. (The quote marks around the string argument in this example are optional, but quotes are required where the argument is a phrase or contains a blank. For example grep " R " filename explicitly searches for an R with a space on both sides).
An example is shown below:
sydney:[13] grep '.*intf' maxline.c
printf ("Hello!
This is an output of the program maxline.c\n");
fprintf (stdout,"%s\n", save);
sydney:[14]
In the above example, all lines containing the string intf in the file maxline.c, is printed on the screen.
For a more detailed help on grep, you can look through the man pages : www.elcafe.com/man/man1/grep.1.html
or type the command
sydney[2]% man grep
A few more examples :
grep scanf *.*
Searches all files and prints out all lines containing
the word `scanf'.
grep 'Leach' phone
Searches the file phone for all occurrences of the pattern
Leach.
Note that the single quotes are optional in this case, but to be on the
safe side single quotes should always be used around regular expressions
in grep.
grep 'Leach' phone | grep Jacob
Notice that the second grep is not followed by a filename.
The first grep takes its input from the file, extracts all the lines
that contain the pattern Leach and passes them to the next grep
through
the pipe. The second grep takes each line from the pipe and extracts
only those lines that contain the pattern Jacob. The effect
is a two level filtering process.
grep '[Bb]er' phone
The pattern `ber' and `Ber' are picked up by grep.
grep '[A-Z]er' phone
[A-Z] matches only one character. Any string that starts with
a capital letter, followed by er will match this pattern.
More information about the egrep can be found at the man page
by typing the command
sydney[2]% man egrep
Records and Fields
gawk input is divided into records
terminated by a record separator. The default record
separator is a newline, so by default awk
processes its input a line at a time. The number of the
current records read so far is available in a variable named NR.
Each input record is considered
to be divided into fields. Fields are normally
separated by white space -- blanks or
tabs -- but the input field separator may be changed, as described below.
Fields are referred to as $1, $2, and so forth,
where $1 is the first field, and $0 is the whole input
record itself. The number of fields in the current
record is available in a variable named NF. The variable FS
refer to the Field Seprator which is by whitespace by default.
The variable RS refers to the Record Seperator which is by default
a newline. The variables FS, RS may be changed at any time
to any single character.
So by default gawk uses blank charecters(spaces or tabs) as field seperators
unless instructed to use another character
Pattern-Action Pairs.
The gawk language has a particular format for all instructios. Each
command is composed of two parts: a pattern and a corresponding action.
Whenever a pattern is matched, gawk executes the action that matches
the pattern.
With gawk, the patterns to be matches are enclosed in a pair of slashes, and the actions are in a pair of braces:
/pattern1/{action1}
/pattern2/{action2}
This format makes it quite easy to tell where pattern starts and ends,
and when the action starts and ends. All gawk programs are sets of
pattern-action pairs. Suppose that you are working on text file. So,
a typical patterns might be matching a set of strings , and the actions
might be to print out parts of the line that matched.
The pattern-action pair are enclosed in single quotes in order
to differenciate from the file name.
The general gawk command is
gawk '/pattern/{action}' filename
For example. consider the following
gawk '/UNIX/{print $2} ' file2.data
This command searches the file2.data line-by-line, looking for
the key the string UNIX . If it finds UNIX, it prints the
second column of that line(record).
You can combine more than one pattern-action pair in a command. For example, the command
gawk '/scan/{print $1} /rumor/{print $2}' file1
looks file1 for all occurrences of the pattern "scan" and prints the first column, and then starts at the top of file again and searches for the pattern "rumor" and prints the second column. The search starts atthe top of file each time there is a new pattern-action pair.
Simple Patterns.
Gawk numbers all the fields in a record. The first field is $1, the second is $2 , and so on The entire record is $0. But gawk allows you to ignore the $0 in simple commands. Similarly if there, is no action, gawk copies the entire line without change.
Therefore, each of the following instructions results in the same output.
gawk '/parker/{print $0}' file1
gawk '/parker/{print} ' file1
gawk '/parker/' file1
Suppose that you want to do more than a match a simple charecter string like for example.
gawk '$2== "foo" { print $3}' testfile
instructs the gawk to compare the second string($2) of each record in testfile and checks to see if it is equal to string foo . If it is , gawk prints the thrid column($3). There are several comparision operators like
Equal to
==
Not Equal To
!=
Greater Than
>
Less than
<
Greater than or
equals to
>=
Less than or
equals to
<=
For example
gawk '$4 > 100 ' testfile
will display every line in the testfile in which the value in the fourth column is greater than 100.
Similarly normal arthemetic commands are available plus(+), minus(-), Multiplication(*), Division(/), Exponentiation(^), remainder(%).
you can combine column numbers and math , too.
{print $3/2}
This action divides the number in the third column by 2.
Some arthemetic commands are.
{print $1+$2*$3}
{print ($1+$2)*$3}
Strings are rounded by double quotes("").
gawk '$1 != "Tim" {print}' testfile
This command prints any line in testfile that doesn't have the word Tim in the first column.
You can place Strings in the print command too.
gawk ' $1 != "tim" { print " the entry for ", $1, "is not Tim. ", $2}
' testfile.
In some files the seperator may not be space. like in etc/passwd file where the seperator is the colon(:).
You can change the field seperator on the gawk command line by using the -F option followed by the seperator you want to use:
gawk -F":" '/parker/{ print } ' /etc/passwd
This command changes the field seperator to colon and searches the
etc/passwd file for lines containing the string parker.
The new field seperator is put in quotation marks to avoid any confusion.
Output may be diverted to multiple files; the program
{ print $1 >"foo1"; print $2 >"foo2" }
writes the first field, $1, on the file foo1, and
the second field on file foo2. The >> notation can also be
used:
print $1 >>"foo"
appends the output to the file foo. (In each
case, the output files are created if necessary). The file name can
be a variable
or a field as well as a constant; for example,
print $1 >$2
uses the contents of field 2 as a file name.
Naturally there is a limit on the number of output
files; currently it is 10.
If the data is not stored in a file , it is supplied to gawk
through a pipe(|) .
For example:
The output of "who" UNIX command has 6 fields.
sydney[2]% who
estahl tty4 Jan 25 11:57
ppesara ttyp1 Jan 25 13:35 (bonobo.CS.NMSU.E)
The command
sydney[4]% who |gawk '{print $1,$5}'
estahl 11:57
ppesara 13:35
BEGIN and END
The special pattern BEGIN matches
the beginning of the input, before the first record
is read. The pattern END matches
the end of the input, after the last record has been processed.
BEGIN and END thus provide a way to gain control before and after
processing, for initialization and wrap-up.
As an example, the field separator can be set to a colon by
BEGIN { FS = ":" }
... rest of program ...
Or the input lines may be counted by
END { print NR }
If BEGIN is present, it must be the first pattern; END
must be the last if used.
After you have saved all of the pattern -action pairs in a program, they are called by gawk with the -f option on the command line:
gawk -f script filename
A gawk program is called a script
Regular Expressions
gawk regular expressions include the regular expression
forms found in the UNIX text editor ed and grep
In addition, gawk allows parentheses for grouping, | for alternatives,
+ for ``one or more'', and ? for ``zero or one'',
all as in lex. Character classes may be abbreviated:
[a-zA-Z0-9] is the set of all letters and digits. As an example,
the awk program
/[Aa]ho|[Ww]einberger|[Kk]ernighan/
will print all lines which contain any
of the names Aho , Weinberger or Kernighan,
whether capitalized or not.
Regular expressions (with the extensions listed
above) must be enclosed in slashes, just as in ed and sed.
Within a regular
expression, blanks and the regular expression metacharacters
are significant. One can also specify that
any field or variable matches a regular expression (or does not match it)
with the operators ~ and !~. The program
$1 ~ /[jJ]ohn/
prints all lines where the first
field matches john or John.'' Notice that this will
also match Johnson, St Johnsbury, and so
on. To restrict it to exactly [jJ]ohn, use
$1 ~ /^[jJ]ohn$/
The caret ^ refers to the beginning of a line or field; the dollar
sign $ refers to the end.
Built-in Functions
gawk provides a length function
to compute the length of a string of characters. This program prints
each record, preceded by its length:
{print length,
$0}
length by itself is a pseudo-variable which yields the length
of the current record; length (argument) is a
function which yields the length of its argument, as in the equivalent
{print length($0), $0}
The argument may be any expression.
gawk also provides the arithmetic functions sqrt,
log, exp, and int, for square root, base e logarithm,
exponential, and integer part of their respective arguments. The
name of one of these built-in functions, without argument or parentheses,
stands for the value of the function on the whole record. The
program
length < 10 || length
> 20
prints lines whose length is less than 10 or greater than 20.
The function sprintf(f, e1, e2, ...) produces the
value of the expressions e1, e2, etc., in the printf format specified
by f. Thus, for example,
x = sprintf("%8.2f
%10ld", $1, $2)
sets x to the string produced by formatting the values
of $1 and $2.
Variables, Expressions, and Assignments
gawk variables take on numeric (floating
point) or string values according to context. For example,
in
x = 1
x is clearly a number, while in
x = "smith"
it is clearly a string. Strings are converted
to numbers and vice versa whenever context demands it. For instance,
x = "3" + "4"
assigns 7 to x. Strings which cannot be interpreted as
numbers in a numerical context will generally have numeric
value zero,
but it is unwise to count on this behavior.
Arithmetic is done internally in floating point. The arithmetic
operators are +, -, *, /, and % (mod). The C increment ++ and
decrement -- operators are also available, and so are
the assignment operators +=, -=, *=, /=, and %=. These operators
may
all be used in expressions.
Field Variables
Fields in gawk share
essentially all of the properties of variables -- they may be used in arithmetic
or string operations,and may be assigned to.
Thus one can replace the first field with a sequence number like this:
{ $1 = NR; print }
or accumulate two fields into a third, like this:
{ $1 = $2 + $3; print $0
}
Field references may be numerical expressions, as in
{ print $i, $(i+1), $(i+n) }
Whether a field is deemed numeric or string depends on
context; in ambiguous cases like
if ($1 == $2) ...
fields are treated as strings.
Arrays
Array elements are not declared; they
spring into existence by being mentioned. Subscripts may
have any non null value,
including non numeric strings. As an example of a
conventional numeric subscript, the statement
x[NR] = $0
assigns the current input record to the NR-th element
of the array x.
You can find some on-line documentation about awk at
http://intranet/COG/Software/gawk/gawk_toc.html
and http://www.maths.tcd.ie/pub/tutorials/awk/tutorial.html
You can also check the manual page for gawk
command. To check the manual pages, use the man command as shown
below.
sydney[3]% man gawk
1. The number of regions of the program flow graph correspond
to the cyclomatic complexity.
2. Cyclomatic complexity is defined as
V = E - N + 2
where E is the number of program flow graph edges
and N is the number of flow graph nodes.
3. It can be also defined as
V = P + 1
where P is the number of predicate nodes in the program
flow
graph.
For the example above:
The flow graph has 3 regions
V = 8 edges - 7 nodes + 2 = 3
V = 2 predicate nodes + 1 = 3
In practice it is enough to count the number of
`if' statements, the number of `case' branches in the `switch' statements,
the number of conditional expressions (expr ? ... ), and the number of
loop statements `for', `while', `do-while'.
It has been proved by a number of experiments that
some important properties of a program (e.g. the number of errors,
the effort needed to design the program, the number of tests needed to
cover all branches of the program, etc.) are correlated with McCabe
complexity.
The task for this laboratory and the next one is to design a program that given a text of C program computes McCabe complexity M of that program, counts the number of source text lines N in the program, and also calculates the number of occurrences of each of the 'if', 'for', 'case', '?', and 'while ' and 'do-while' expressions. Also calculate the Relative McCabe Complexity which is defined as the ratio of the McCabe Complexity and the Total number of lines .
Note: Ignore multiple occurences of keywords on the same line.
An example C program for the McCabe complexity is given.
For this Laboratory Assignment do the following: