Next: Creating output data files Up: Input/Output Previous: Reading input data into Contents Index

How to use DCGs

In this section we describe the use of tokenizers and DCG (definite clause grammar) to produce a very flexible input system. The input routine of the NDI mapper of RiskWise is completely implemented with this method, and we use some of the code in the examples below.

In this approach the data input is split into two parts, a tokenizer and a parser. The tokenizer read the input and splits it into tokens. Each token corresponds to one field in a data item. The parser is used to recognize the structure of the data and to group all data belonging to one item together.

Using these techniques to read data files is a bit of overkill, they are much more powerful and are used for example to read ECLiPSe terms themselves. But, given the right grammar, this process is very fast and extremely easy to modify and extend.

The top level routine read_file opens the file, calls the tokenizer, closes the file and starts the parser. We assume here that at the end of the parsing the complete input stream has been read (third argument in predicate phrase). Normally, we would check the unread part and produce an error message.

:-mode read_file(++,-).
read_file(File,Term):-
        open(File,'read',S),
        tokenizer(S,1,L),
        close(S),
        phrase(file(Term),L,[]).

The tokenizer is a bit complicated, since our NDI data format explicitly mentions end-of-line markers, and does not distinguish between lower and upper case spelling. Otherwise, we might be able to use the built-in tokenizer of ECLiPSe (predicate read_token).

The tokenizer reads one line of the input at a time and returns it as a string. After each line, we insert a end_of_line(N) token into the output with N the current line number. This can be used for meaningful error messages, if the parsing fails (not shown here). We then split the input line into white space separated strings, eliminate any empty strings and return the rest as our tokens.

The output of the tokenizer will be a list of strings intermixed with end-of-line markers.

:-mode tokenizer(++,++,-).
tokenizer(S,N,L):-
        read_string(S,'end_of_line',_,Line),
        !,
        open(string(Line),read,StringStream),
        tokenizer_string(S,N,StringStream,L).
tokenizer(_S,_N,[]).

tokenizer_string(S,N,StringStream,[H|T]):-
        non_empty_string(StringStream,H),
        !,
        tokenizer_string(S,N,StringStream,T).
tokenizer_string(S,N,StringStream,[end_of_line(N)|L]):-
        close(StringStream),
        N1 is N+1,
        tokenizer(S,N1,L).

non_empty_string(Stream,Token):-
        read_string(Stream, " \t\r\n", _, Token1),
        (Token1 = "" ->
            non_empty_string(Stream,Token)
        ;
            Token = Token1
        ).

We now show an example of grammar rules which define one data file of the NDI mapper, the RouterTrafficSample file. The grammar for the file format is shown below:

file              := <file_type_line> 
                     <header_break>
                     [<data_line>]* 
file_type_line    := RouterTrafficSample <newline> 
header_break      := # <newline> 
data_line         := <timestamp> 
                     <router_name> 
                     <tcp_segments_in> 
                     <tcp_segments_out> 
                     <udp_datagrams_in> 
                     <udp_datagrams_in> 
                     <newline> 
timestamp         := <timestamp_ms> 
router_name       := <name_string>  
tcp_segments_in   := integer 
tcp_segments_out  := integer 
udp_datagrams_in  := integer 
udp_datagrams_out := integer

This grammar translates directly into a DCG representation. The start symbol of the grammar is file(X), the argument X will be bound to the parse tree for the grammar. Each rule uses the symbol --> to define a rule head on the left side and its body on the right. All rules for this particular data file replace one non-terminal symbol with one or more non-terminal symbols. The argument in the rules is used to put the parse tree together. For this data file, the parse tree will be a term router_traffic_sample(L) with L a list of terms router_traffic_sample(A,B,C,D,E,F) where the arguments are simple types (atoms and integers).

file(X) --> router_traffic_sample(X).

router_traffic_sample(router_traffic_sample(L)) --> 
        file_type_line("RouterTrafficSample"),
        header_break,
        router_traffic_sample_data_lines(L).

router_traffic_sample_data_lines([H|T]) --> 
        router_traffic_sample_data_line(H), 
        !,
        router_traffic_sample_data_lines(T).
router_traffic_sample_data_lines([]) --> [].

router_traffic_sample_data_line(
            router_traffic_sample(A,B,C,D,E,F)) --> 
        time_stamp(A),
        router_name(B),
        tcp_segments_in(C),
        tcp_segments_out(D),
        udp_datagrams_in(E),
        udp_datagrams_out(F),
        new_line.

tcp_segments_in(X) --> integer(X).

tcp_segments_out(X) --> integer(X).

udp_datagrams_in(X) --> integer(X).

udp_datagrams_out(X) --> integer(X).

Note the cut in the definition of router_traffic_sample_data_lines, which is used to commit to the rule when a complete data line as been read. If a format error occurs in the file, then we will stop reading at this point, and the remaining part of the input will be returned in phrase.

The following rules define the basic symbols of the grammar. Terminal symbols are placed in square brackets, while additional Prolog code is added with braces. The time_stamp rule for example reads one token X. It first checks that X is a string, then converts it to a number N, and then uses a library predicate eclipse_date to convert N into a date representation Date: 2003/04/29 17:49:48, which is returned as the parse result.

file_type_line(X) --> ndi_string(X), new_line.

header_break --> 
        ["#"],
        new_line.

router_name(X) --> name_string(X).

time_stamp(Date) --> 
        [X],
        {string(X),
         number_string(N,X),
         eclipse_date(N,Date)
        }.

integer(N) --> [X],{string(X),number_string(N,X),integer(N)}.

name_string(A) --> ndi_string(X),{atom_string(A,X)}.

ndi_string(X) --> [X],{string(X)}.

new_line --> [end_of_line(_)].

These primitives are reused for all files, so that adding the code to read a new file format basically just requires some rules defining the format.

Next: Creating output data files Up: Input/Output Previous: Reading input data into Contents Index

Warwick Harvey
2004-08-07