Perl/Addendum

From Citizendium
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
This article is developing and not approved.
Main Article
Discussion
Related Articles  [?]
Bibliography  [?]
External Links  [?]
Citable Version  [?]
Code [?]
Addendum [?]
 
This addendum is a continuation of the article Perl.

Enhancements for readability

healing slasheritis

In the standard Unix tools such as sed, a regular expression is enclosed in a pair of slashes, i.e. '/pattern/' . A non-printing character is written by using the backslash ("escape") character '\', e.g. '/\n/' represents a newline character in a pattern. Certain printing characters --of course the metacharacter '/' itself is one of them-- also need to be escaped. So, to match against '/', the pattern would be written as '/\//'. This is not uncommon, for example in file (path) names.

It gets confusing quickly if e.g. '\/' is to be substituted by its duplicate '\/\/'. Both the backslash and the slash need to be escaped: '/\\\//' represents the string '\/' inside a pattern definition. The "substitute" construct '$g =~ s/a/b/' (substitute 'a' by 'b') explodes into the so-called slasheritis: '$g =~s/\\\//\\\/\\\//', i.e. such regular Expression patterns become quickly unreadable.

Perl's solution is to allow the definition of pattern delimiters "on-the-fly", after all Perl knows exactly that a pattern definition begins after the '=~' operator, so why not take the well-chosen next character to represent the delimiter? Now you can resolve the above slasheritis by writing '$g =~ s#\\/#\\/\\/#' (you still need to escape the backslash), and everything is (somewhat) clearer again. It is customary to use non-alphanumeric characters, such as '!#|' as delimiters, but since Perl knows about paired characters such as '<>' or '{}', some well known Perl authors prefer this style: $a =~ s{\\/} {\\/\\/}, because it is even clearer.

special symbols

Perl introduced a whole new flock of shortcuts for classes of characters, usually combined with their (upper case) complement, i.e., '/\w/' stands for all "white" characters (blank, tab, newline, and a few special ones), and '/\W/' (capital 'W') stands for all non-white characters. Similarly, '/\d/' stands for numerical characters ("digit"), '/\D/' for non-digits, etc. The whole list can be found in the "Camel" book [1].

inline comments

Since version 5.002 a regular expression can be written with inline comments, if the closing delimiter is followed by the 'x' oprerator. Here a short program to eliminate comments from html code (by Perl author Tom Christiansen, with his original comments):

#!/usr/bin/perl -p0777
#
# htdecom -- remove html comments from a document
# tchrist@perl.com
# 
# taken from the larger striphtml program

require 5.002;

s{ <!                  # comments begin with a `<!'
                       # followed by 0 or more comments;

   (.*?)               # this is actually to eat up comments in non 
                       # random places

    (                  # not suppose to have any white space here

                       # just a quick start; 
     --                # each comment starts with a `--'
       .*?             # and includes all text up to and including
     --                # the *next* occurrence of `--'
       \s*             # and may have trailing while space
                       #   (albeit not leading white space XXX)
    )+                 # repetire ad libitum  XXX should be * not +
   (.*?)               # trailing non comment text
  >                    # up to a `>'
}{
   if ($1 || $3) {     # this silliness for embedded comments in tags
       "<!$1 $3>";
 } 
}gesx;                 # mutate into nada, nothing, and niente

Literature

  • [1] Larry Wall, Tom Christiansen, Jon Orwant: Programming Perl - (the Camel Book). O'Reilly Media, Inc.; 3 edition (July 14, 2000). ISBN 0596000278. The standard reference.
  • [2] Jeffrey E. F. Friedl: Mastering Regular Expressions - (the Owls Book). O'Reilly Media, Inc.; 3 edition (August 8, 2006). ISBN 0596528124. All you ever need to know about Regular Expressions, not Perl specific