NAME

re_graph.pl - Graph regular expression


SYNOPSIS

re_graph.pl [-d] [-f re_file] [-P] [regular expression] [regular expression] [...]


DESCRIPTION

The re_graph.pl program graphs regular expressions. The guts of the regular expression engine is a simple state machine. The various states and operations in the regular expression parser can be displayed using a surprisingly simple diagram.

A few notes on what you are looking at:

The nodes Start and Stop denote the beginning and end of the regular expression.

The solid squares denote atoms. Lines indicate the next state. When a line splits, the state machine will take the top line first. If it's path is blocked it will backup and take the next lower line. This is repeated until it finds a path to the end or all paths are exhausted.

Brown boxes indicate a grouping operation, i.e. ().

Green boxes indicate a zero with test. The state machine will perform the test inside the box before moving ahead.

For more information, see the tutorial below.


OPTIONS

-d
Turn on debugging. The debugging output is printed on the console as regular expressions are compiled.

-f file
Specify a file containing a list of regular expressions, one per line, to be put in the regular expression list.

-P
Enable the Print button. This button doesn't work correctly right now, but you can play with it if you want. When Print is pressed, the results are written into a Postscript file named tmp.ps.

Note: If no regular expressions are specified, a list consisting of the items in the tutorial is loaded.


GUI CONTROLS

Next
Displays the next regular expression in the list.

Previous
Displays the previous regular expression in the list.

Regular Expression Blank
This blank contains the regular expression being graphed.

List
Pops up a window containing a list of expressions. You can select an expression from this list and press OK to graph it. You can also input a regular expression in the blank at the bottom of the window and press New to add it to the list.

Exit
Exits the program.

Print
If it worked, it would print the current graph. But since it is broken it won't even show up unless you put the -P switch on the command line.

When pressed it creates a file called tmp.ps which should contain a postscript version of your graph. (It has a few problems that prevent the output from being what you want.)


TUTORIAL

This tutorial shows what happens when a set of sample regular expressions are graphed. This set of regular expressions closely follows the Chapter 4 of ``Perl for C Programmers'' by Steve Oualline.

The set of regular expressions used for this tutorial is:

    test
    ^Test
    ^ *#
    ^[ \t]*#
    ^\s*#
    ([^#]*)(#.*)
    a|b
    ^(([^#"]*|".*")*)(#.*)
    ^((?:[^#"]*|".*?")*)(#.*)
    ^((?:[^#"]*|".*?(?<!\\)")*)(#.*)

Let's take a look at the graphs produced by these expressions.

/test/
This is a very simple expression. It matches ``test'' anywhere on the line. If you look at the graph of this expression, it consists of three nodes ``Start'', ``EXACT <test>'' and ``END''.

The ``Start'' node indicates the start of the regular expression.

The ``EXACT <test>'' node tells the engine that the text must match the text ``test'' exactly.

The ``END'' node indicates the end of the regular expression. If you reach the ``END'' node, a successful match was found.

Flow is a straight line from ``Start'', through the ``EXACT'' check, to end.

/^Test/
A new item was added with this expression, an anchor. It's named BOL (Beginning of line) and shows up as an additional node.

/^ *#/
Now we start having fun. This expression matches anything that consists of a start of line (^), a bunch of spaces ( *), and a sharp (#).

The way the state machine works it that it starts at ``Start'' and works it's way through the nodes. You'll notice that between ``BOL'' and ``EXACT < >'' there's a fork in the road.

The state machine will take the top branch if possible. So if the next character is a space, the system will take the top branch and match the ``EXACT < >'' node. If not, the bottom branch is taken and we wind up at the ``EXACT <#>'' node.

If there's no path to the ``END'', then we don't have a match.

/^[ \t]*#/
This is the same as the previous example, except the space was replaced by a character set. We call the set ``space and tab''. The system translates this into ``\11\40''. It's the same thing, suitable obfuscated for computer work.

/^\s*#/
Again, the middle as been replace by another token. In this case it's the SPACE token which matches any whitespace.

/([^#]*)(#.*)/
This expression introduces us to the grouping operators. They show as the big brown boxes.

The other change is that we use the expression [^#], which matches anything except a hash mark. Perl changes this to a ``ANYOF'' clauses which matches all characters except the single one we don't want.

Note: This ANYOF node overflows the size of the box. This is a know bug.

/a|b/
Now we introduce the concept of a selection of two different atoms. Note that the branch arrows are drawn smaller to make them stand out.

/^(([^#``]*|''.*``)*)(#.*)/
See the book for what this regular expression tries to match.

This expression adds nested grouping, and some additional stuff that we've seen before.

/^((?:[^#``]*|''.*?``)*)(#.*)/
This is like the previous example, except what was the $2 grouping has been replaced by the ``Group no $'' operator (?:...). Notice that the line around the second group has disappeared and what was $3 is now $2.

(In future versions of this graphing tool we will graph the invisible group operator. We just did figure out how to do it yet.)

Also notice the use of the ``*?'' operator. Remember when going through the nodes, when a branch is encountered, the system will try and take the top one first.

/^((?:[^#``]*|''.*?(?<!\\)``)*)(#.*)/
The grand finale. One new type of node has been introduced: (?<!\\). This is the negative look-behind. It's the red box on the screen. When the state machine sees this, it matches the text behind the current location marker against the indicated text and if it fails then a match against the next node is possible. (Boy does this not translate into English well.)

Basically the clause in question looks for a double quoted string (``xxx''), but will ignore a double quote it's escaped (``xxx\''yyy``).


BUGS / LIMITATIONS

This will not graph all the regular expressions. Some of the more advanced features of the engine are just not handled.

We currently ``graph'' the ``group, no $1'' (?:..) operator by displaying nothing. A box should be put around the expression.

The boxes drawn by the program are a fixed with not related to the size of the text they contain. Text can easily overflow the box.

The system is UNIX/Linux specific. This is caused by only one small section of code should anyone want to port this to a braindamaged operating system.

Better use of color can be made. Specifically all the nodes do not have to be green. Come to think of it they call don't have to be rectangles either.

Sometimes the lines connecting one section to another take some strange twists.

Horizontal scrolling of the regular expression could be better.


FUTURE WORK

I would like to connect this to the execution of the regular expression engine so we can see the thing in action.


LICENSE

Licensed under the GPL. (See the end of the source file for a copy).


AUTHOR

Steve Oualline (oualline@www.oualline.com)