From Citizendium
Jump to navigation Jump to search
This article is developing and not approved.
Main Article
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
Code [?]
Addendum [?]
This addendum is a continuation of the article Perl.

Enhancements for readability

healing slasheritis

In the standard Unix tools such as sed, a regular expression is enclosed in a pair of slashes, i.e. '/pattern/' . A non-printing character is written by using the backslash ("escape") character '\', e.g. '/\n/' represents a newline character in a pattern. Certain printing characters --of course the metacharacter '/' itself is one of them-- also need to be escaped. So, to match against '/', the pattern would be written as '/\//'. This is not uncommon, for example in file (path) names.

It gets confusing quickly if e.g. '\/' is to be substituted by its duplicate '\/\/'. Both the backslash and the slash need to be escaped: '/\\\//' represents the string '\/' inside a pattern definition. The "substitute" construct '$g =~ s/a/b/' (substitute 'a' by 'b') explodes into the so-called slasheritis: '$g =~s/\\\//\\\/\\\//', i.e. such regular Expression patterns become quickly unreadable.

Perl's solution is to allow the definition of pattern delimiters "on-the-fly", after all Perl knows exactly that a pattern definition begins after the '=~' operator, so why not take the well-chosen next character to represent the delimiter? Now you can resolve the above slasheritis by writing '$g =~ s#\\/#\\/\\/#' (you still need to escape the backslash), and everything is (somewhat) clearer again. It is customary to use non-alphanumeric characters, such as '!#|' as delimiters, but since Perl knows about paired characters such as '<>' or '{}', some well known Perl authors prefer this style: $a =~ s{\\/} {\\/\\/}, because it is even clearer.

special symbols

Perl introduced a whole new flock of shortcuts for classes of characters, usually combined with their (upper case) complement, i.e., '/\w/' stands for all "white" characters (blank, tab, newline, and a few special ones), and '/\W/' (capital 'W') stands for all non-white characters. Similarly, '/\d/' stands for numerical characters ("digit"), '/\D/' for non-digits, etc. The whole list can be found in the "Camel" book [1].

inline comments

Since version 5.002 a regular expression can be written with inline comments, if the closing delimiter is followed by the 'x' oprerator. Here a short program to eliminate comments from html code (by Perl author Tom Christiansen, with his original comments):

#!/usr/bin/perl -p0777
# htdecom -- remove html comments from a document
# taken from the larger striphtml program

require 5.002;

s{ <!                  # comments begin with a `<!'
                       # followed by 0 or more comments;

   (.*?)               # this is actually to eat up comments in non 
                       # random places

    (                  # not suppose to have any white space here

                       # just a quick start; 
     --                # each comment starts with a `--'
       .*?             # and includes all text up to and including
     --                # the *next* occurrence of `--'
       \s*             # and may have trailing while space
                       #   (albeit not leading white space XXX)
    )+                 # repetire ad libitum  XXX should be * not +
   (.*?)               # trailing non comment text
  >                    # up to a `>'
   if ($1 || $3) {     # this silliness for embedded comments in tags
       "<!$1 $3>";
}gesx;                 # mutate into nada, nothing, and niente


  • [1] Larry Wall, Tom Christiansen, Jon Orwant: Programming Perl - (the Camel Book). O'Reilly Media, Inc.; 3 edition (July 14, 2000). ISBN 0596000278. The standard reference.
  • [2] Jeffrey E. F. Friedl: Mastering Regular Expressions - (the Owls Book). O'Reilly Media, Inc.; 3 edition (August 8, 2006). ISBN 0596528124. All you ever need to know about Regular Expressions, not Perl specific