Index

This file was automatically generated from http://svn.pugscode.org/pugs/docs/tutorial/ch07_grammars.pod on Sat Aug 1 14:01:20 2009 GMT, revision 27701.

TODO: This chapter is outdated in some ways


=head0 Grammars and Regexes

TODO: This chapter is outdated in some ways

  * It should be explained when we use "rule" and when "regex", and what
    a "subrule" is.
  * The interpolation rules are outdated
  * some of the assertion syntax has changed, for example <foo()> means
    something different now
  * Modifiers: explain :ratchet modifier
  * The match object needs more explanation

Perl 6 "regular expressions" are so far beyond the formal definition of regular expressions that we don't use that name anymore, but simply stick to the abbreviation Regex. Perl 6 regexes bring the full power of recursive descent parsing to the core of Perl, but are comfortably useful even if you don't know anything about recursive descent parsing. In the usual case, all you'll ever need to know is that regexes are patterns for matching text.

Using Regexes

Regexes are a language within a language, with their own syntax and conventions. At the highest level, though, they're just another set of Perl constructs. So the first thing to learn about regexes is the Perl "glue" code for creating and using them.

Immediate Matches

The simplest way to create and use a regex is an immediate match. A regex defined with the m// operator always immediately matches. Substitutions, defined with the s/// operator also immediately match. A regex defined with the // operator immediately matches when it's in void, boolean, string, or numeric context, or the argument of the smart-match operator (~~).

  if $string ~~ m/ \w+ /      {...}
  if $string ~~ s/ \w+ /word/ {...}
  if $string ~~ / \w+ /       {...}

You can substitute other delimiters, like #...#, [...], and {...} for the standard /.../, though ?...? and (...) are not valid delimiters:

  $string ~~ s/\w+/word/
  $string ~~ s[\w+][word]     # The same
  $string ~~ s{\w+}{word}     # The same
  $string ~~ s#\w+#word#      # The same
  $string ~~ s(\w+)(word)     # Wrong!
  $string ~~ s?\w?word}       # Wrong!

Modifiers now come in front using adverb syntax, so to do multiple substitutions on the same string is:

  $string ~~ s:g/\w+/word/

Also, if you use brackets on the first part of a substitution, the second part can be specified as a pseudoassignment:

  $string ~~ s[\w+] = 'word';

This form also allows assignment operators, so if you want to add one to all the number within a string, you can say:

  $string ~~ s:g[\d+] += 1;

If you want to do some processing on the match, you can call a function to prepare the replacement text too:

  $string ~~ s:g[\d+] = build_replacement()

Deferred Matches

Sometimes you want a little more flexibility than an immediate match. The rx// operator defines an anonymous regex that can be executed later.

  $digits = rx/\d+/;

The simple // operator also defines an anonymous regex in all contexts other than void, boolean, string, or numeric, or as an argument of ~~.

  $digits = /\d+/; # store regex

You can use the unary context forcing operators, +, ?, and ~, to force the // operator to match immediately in a context where it ordinarily wouldn't. For a boolean value of success or failure, force boolean context with ?//. For a count of matches, force numeric context with +//. For the matched string value, force string context with ~//.

  $truth  = ?/\d+/;       # match $_ and return success
  $count  = +/(\d+\s+)*/; # match $_ and return count
  $string = ~/^\w+/;      # match $_ and return string

Another option for deferred matches is a regex block. The regex keyword defines a named or anonymous regex, in much the same way that sub declares a subroutine or method declares a method. But the code within the block of a regex is regex syntax, not Perl syntax.

  $digits = regex {\d+};

  regex digits {\d+}

There are two more keywords that defines regexes similarly to regex, which imply slightly different behavior. token introduces a regex that does not backtrack, (more details on that below; for now it's enough to know that it matches simple regexes faster), and rule is the same as token except that whitespaces in regexes also match optional whitespaces in the string..

To match a named or anonymous regex, call it as a subregex within another regex. Subregexes, whether they're named regexes or a variable containing an anonymous regex, are enclosed in assertion delimiters <...>. You can read more about assertions in "Assertions" later in this chapter.

  $string ~~ /\d+/;
  # same as
  $string ~~ /<$digits>/;
  $string ~~ /E<lt>digitsE<gt>/;

Table 7-1 summarizes the basic Perl syntax for defining rules.

Grammars

A grammar is a collection of regexes, in much the same way that a class is a collection of methods. In fact, grammars are classes, they're just classes that inherit from the universal base class Regex. This means that grammars can inherit from other grammars, and that they define a namespace for their regexes.

  grammar Hitchhikers {
      token name { Zaphod | Ford | Arthur }
  
      token id   { \d+ }

      ...
  }

Any regex in the current grammar or in one of its parents can be called directly, but a regex from an external grammar needs to have its package specified:

  if $newsrelease ~~ / E<lt>Hitchhikers.nameE<gt> / {
      send_alert($1);
  }

If you want to match against the entire grammar, you can define a regex TOP in that grammar.

  grammar Hitchhikers {
      regex TOP { <name> <id> }
      ...
  }

  $roster ~~ Hitchhikers;           # Calls Hitchhikers.TOP by default

Grammars are especially useful for complex text or data parsing. In fact, overloading grammar rules for the Perl 6 grammar itself is a method to change the way the program is parsed. Instead of having to create custom complex source filters like was necessary in Perl 5, we can overload the rules in the Perl6::Grammar Grammar class to change the very syntax of Perl 6 on the fly.

Building Blocks

Every language has a set of basic components (words or parts of words) and a set of syntax rules for combining them. The "words" in regexes are literal characters (or symbols), some metacharacters (or metasymbols), and escape sequences, while the combining syntax includes other metacharacters, quantifiers, bracketing characters, and assertions.

Metacharacters

The "word"-like metacharacters are ., ^, ^^, $, $$. The . matches any single character, even a newline character. Actually, Perl 6 has a the notion of a Unicode level, which determines if string manipulation happens on the byte, codepoint or grapheme level. . matches a character in the current level, which defaults to grapheme. The Unicode level can be adjusted with a pragma or with modifiers. We'll talk more about modifiers in "Modifiers" later in this chapter. The ^ and $ metacharacters are zero-width matches on the beginning and end of a string. They each have doubled alternates ^^ and $$ that match at the beginning and end of every line within a string.

The |, &, \, #, and := metacharacters are all syntax structure elements. The | is an alternation between two options. The & matches two patterns simultaneously (the patterns must be the same length). The \ turns literal characters into metacharacters (the escape sequences) or turns metacharacters into literal characters. The # marks a comment to the end of the line. Whitespace insensitivity (the old /x modifier) is on by default, so you can start a comment at any point on any line in a regex. Just make sure you don't comment out the symbol that terminates the regex. The := binds a hypothetical variable to the result of a subregex or grouped pattern. Hypotheticals are covered in "Hypothetical Variables" later in this chapter.

The metacharacters (), [], {} and <> are bracketing pairs. The pairs always have to be balanced within the regex, unless they are literal characters (escaped with a \). The brackets () and [] group patterns to match as a single atom. They're often used to capture a result, mark the boundaries of an alternation, or mark a group of patterns with a quantifier, among other things. Parentheses () are capturing and square brackets [] are non-capturing. The {} brackets define a section of Perl code (a closure) within a regex. These closures are always a successful zero-width match, unless the code explicitly calls the fail function. The <...> brackets mark assertions, which handle a variety of constructs including character classes and user-defined quantifiers. Assertions are covered in "Assertions" later in this chapter.

Table 7-2 summarizes the basic set of metacharacters.

Escape Sequences

The escape sequences are literal characters acting as metacharacters, marked with the \ escape. Some escape sequences represent single characters that are difficult to represent literally, like \t for tab, or \x[...] for a character specified by a hexadecimal number. Some represent limited character classes, like \d for digits or \w for word characters. Some represent zero-width positions in a match, like \b for a word boundary. With all the escape sequences that use brackets, (), {}, and <> work in place of [].

Note that since an ordinary variable now interpolates as a literal string by default, the \Q escape is rarely needed. An interpolated array is interpreted as an alternation of all array elements.

Table 7-3 shows the escape sequences for regexes.

Quantifiers

Quantifiers specify the number of times an atom (a single character, metacharacter, escape sequence, grouped pattern, assertion, etc) will match.

The numeric quantifiers use the ** operator followed by the number of desired matches. For a range of matches you can use a closure that returns a range (a**{2..4} matches two to four a's, (a**{2..Inf}) two or more a's).

Each quantifier has a minimal alternate form, marked with a trailing ?, that matches the shortest possible sequence first.

Table 7-4 shows the built-in quantifiers.

Assertions

In general, an assertion simply states that some condition or state is true and the match fails when that assertion is false. Many different constructs with many different purposes use assertion syntax.

Assertions match named and anonymous regexes, arrays or hashes containing anonymous regexes, and subroutines or closures that return anonymous regexes. You have to enclose a variable in assertion delimiters to get it to interpolate as an anonymous rule or rules. A bare scalar in a pattern interpolates as a literal string, while a scalar variable in assertion brackets interpolates as an anonymous rule. A bare array in a pattern matches as a series of alternate literal strings, while an array in assertion brackets interpolates as a series of alternate anonymous rules. In the simplest case, a bare hash in a pattern matches a word (\w+) and tries to find that word as one of its keys., while a hash in assertion brackets does the same, but then also matches the associated value as an anonymous rule.

A bare closure in a pattern always matches (unless it calls fail), but a closure in assertion brackets <{...}> must return an anonymous rule, which is immediately matched.

An assertion with parentheses <(...)> is similar to a bare closure in a pattern in that it allows you to include straight Perl code within a rule. The difference is that <(...)> evaluates the return value of the closure in boolean context. The match succeeds if the return value is true and fails if the return value is false.

Assertions match character classes, both named and enumerated. A named rule character class is often more accurate than an enumerated character class. For example, <[a-zA-Z]> is commonly used to match alphabetic characters, but generally what's really needed is the built-in rule <alpha> which matches the full set of Unicode alphabetic characters.

Table 7-5 shows the syntax for assertions.

Modifiers

Modifiers alter the meaning of the pattern syntax. The standard position for modifiers is at the beginning of the rule, right after the m, s, or rx, or after the name in a named rule. Modifiers cannot attach to the outside of a bare /.../. For example:

  m:i /marvin/ # case insensitive
  rule names :i { marvin | ford | arthur }

Multiple modifiers can be chained, short and long names can be mixed:

  m:s :i :g/ zaphod /
  m:sigspace :i :global / zaphod /

Modifiers can be negated with the :!pair notation, so :!i forces case-sensitive matching.

Most of the modifiers can also go inside the rule, attached to the rule delimiters or to grouping delimiters. Internal modifiers are lexically scoped to their enclosing delimiters, so you get a temporary alteration of the pattern:

  m/:s I saw [:i zaphod] / # only 'zaphod' is case insensitive

The repetition modifiers (:x, :th, :global, and :exhaustive) and the continue modifier (:cont) can't be lexically scoped, because they alter the return value of the entire rule.

The :x modifier matches the rule a counted number of times. If the modifier expects more matches than the string has, the match fails. It has an alternate form :x() that can take a variable in place of the number.

The :global modifier matches as many times as possible. The :exhaustive modifier also matches as many times as possible, but in as many different ways as possible.

The :th modifier preserves one result from a particular counted match. If the rule matches fewer times than the modifier expects, the match fails. It has several alternate forms. One form--:th()--can take a variable in place of the number. The other forms--:st, :nd, and :rd--are for cases where it's more natural to write :1st, :2nd, :3rd than it is to write :1th, :2th, :3th. Either way is valid, so pick the one that's most comfortable for you.

By default, rules ignore literal whitespace within the pattern. The :s or :sigspace modifier makes rules sensitive to literal whitespace, but in an intelligent way. Any cluster of literal whitespace acts like an explicit \s+ when it separates two identifiers and \s* everywhere else.

More specifically any literal whitespace in the regex is translated to an implict call to <.ws>, where the ws rule matches as mentioned above, but can also be overridden by the user.

There are no modifiers to alter whether the matched string is treated as a single line or multiple lines. That's why the "beginning of string" and "end of string" metasymbols have "beginning of line" and "end of line" counterparts.

Table 7-6 shows the current list of modifiers.

Substition Modifiers

Special modifiers are available for substitions that do not make sense on normal matches.

The :samecase, or short :ii modifier implies the :ignorecase modifier, but also carries the case information on a character-by-character base

   my $s = 'The Quick Brown Fox';
   $s ~~ s:ii/brown/blue/;
   say $s;           # The Quick Blue Fox

If the :sigspace modifier is also present, a slightly more intelligent algorithm is used. If the source string follows one of the case patterns in $table (XXX: make that a proper cross-link), that pattern is recognized and applied onto the substitution string.

   $_ = 'All Words Capialized';
   s:s:ii/.*/other words/;
   .say;             # Other Words

There's a shortcut for s:s named ss, so you could have written the example above aswidth="348" height="300" ss:ii/.*/other words/.

A similar modifier is :sameaccent (short :aa). Instead of carrying case information, it carries accent and marking information.

   my $stuff = 'Möhre';
   $stuff ~~ s:aa/a/o/;
   say $stuff;          # Mähre

The third substitution modifier is :samespace, short :ss. It preserves whitespace that is matched by implicit <.ws> rules:

   my $s = "Some   white\t\n spaces";
   $s ~~ s:ss/\w+ \w+ \w+/Completely different text/;
   # $s is now "Completely   different\t\n text"

Built-in Rules

A number of named rules are provided by default, including a complete set of POSIX-style classes, and Unicode property classes. The list isn't fully defined yet, but Table 7-7 shows a few you're likely to see.

The <null> rule matches a zero-width string (so it's always true) and <prior> matches whatever the most recent successful rule matched. These replace the two behaviors of the Perl 5 null pattern //, which is no longer valid syntax for rules.

Backtracking Control

Backtracking is triggered whenever part of the pattern fails to match. You can also explicitly trigger backtracking by calling the fail function within a closure. Table 7-8 shows some metacharacters and built-in rules relevant to backtracking.

The :ratchet modifier, which is implied by regexes declared with the token or rule keyword, disables backtracking in the subrule, which is the same as adding a : after every atom.

The Match Object

A regex match produces a Match object, which contains all information about the match, including start and end position, matched string, and all captures.

The match object is returned from a regex match, and is also stored in the special variable $/.

   my $match = 'Zaphod Beeblebrox' ~~ m/\w+/;   
   say $match;    # prints Zaphod

In string context it evaluates to the text of the matched part of the string.

Table summarises the properties of the match object.

The variables $0, $1, $2 etc. are aliases to $/[0], $/[1], $/[2], and $<name> is an alias to $/<name>. Likewise an empty @() is the same as @($/), and %() stands for %($/).

Match variables can also store a different scalar object. A closure in a regex can store such an object by calling make, and can be accessed by forcing scalar context with $( $/ ):

   regex herd :i :s {
         (\d+)
         (\w+)s?
         {
            make Herd.new(
                  animal => $1.capitalize
                  count  => $0,
                 );
         }
   }
   'Yesterday we saw 4 mooses' ~~ m/ <herd> /;
   # now $($<herd>) contains the new Herd object

This can be used to build object trees directly from regex matches.

Capture variables are always match objects, and contain the information of their respective sub matches.

   m/ ( a ( geek ) ( passes ) )  ( many tests ) /
      |   |      | |        | |  |            |
      |   $/[0][0] $/[0][1]-+ |  |            |
      |                       |  |            |
      $/[0]-------------------+  $/[1] -------+

If a capturing group is quantified, it automatically becomes an array of match objects. Subsequent matches are not renumbered:

   '12 45 books' ~~ m:s/ ( \d+ )+ (\w+) /
   say $0[0];     # 12
   say $0[1];     # 45
   say $1;        # books

When a subrule is called with the <subrule> syntax, it produces a named capture of name subrule. That named can be changed with the <newname=subrule> syntax.

   token identifier { \w+ }
   token number     { \d+ }
   $_ = '24 hours'
   if m:s/<number> <unit=identifier> / {
      say "Number: $<number>. Unit: $<unit>";
   }

These variables are also available iin the regex itself:

  "Zaphod saw Zaphod" ~~ m:s/ E<lt>nameE<gt> \w+ $/<name> /;