This file was automatically generated from http://svn.pugscode.org/pugs/docs/tutorial/ch07_grammars.pod on Sat Aug 1 14:01:20 2009 GMT, revision 27701.
=head0 Grammars and Regexes
TODO: This chapter is outdated in some ways
* It should be explained when we use "rule" and when "regex", and what
a "subrule" is.
* The interpolation rules are outdated
* some of the assertion syntax has changed, for example <foo()> means
something different now
* Modifiers: explain :ratchet modifier
* The match object needs more explanation
Perl 6 "regular expressions" are so far beyond the formal definition of regular expressions that we don't use that name anymore, but simply stick to the abbreviation Regex. Perl 6 regexes bring the full power of recursive descent parsing to the core of Perl, but are comfortably useful even if you don't know anything about recursive descent parsing. In the usual case, all you'll ever need to know is that regexes are patterns for matching text.
Regexes are a language within a language, with their own syntax and conventions. At the highest level, though, they're just another set of Perl constructs. So the first thing to learn about regexes is the Perl "glue" code for creating and using them.
The simplest way to create and use a regex is an immediate match. A regex
defined with the m// operator always
immediately matches. Substitutions, defined with the s///
operator also immediately match. A
regex defined with the //
operator immediately matches when it's in void, boolean, string, or
numeric context, or the argument of the smart-match operator (~~).
if $string ~~ m/ \w+ / {...}
if $string ~~ s/ \w+ /word/ {...}
if $string ~~ / \w+ / {...}
You can substitute other delimiters, like #...#, [...], and
{...} for the standard /.../, though ?...? and (...) are
not valid delimiters:
$string ~~ s/\w+/word/
$string ~~ s[\w+][word] # The same
$string ~~ s{\w+}{word} # The same
$string ~~ s#\w+#word# # The same
$string ~~ s(\w+)(word) # Wrong!
$string ~~ s?\w?word} # Wrong!
Modifiers now come in front using adverb syntax, so to do multiple substitutions on the same string is:
$string ~~ s:g/\w+/word/
Also, if you use brackets on the first part of a substitution, the second part can be specified as a pseudoassignment:
$string ~~ s[\w+] = 'word';
This form also allows assignment operators, so if you want to add one to all the number within a string, you can say:
$string ~~ s:g[\d+] += 1;
If you want to do some processing on the match, you can call a function to prepare the replacement text too:
$string ~~ s:g[\d+] = build_replacement()
Sometimes you want a little more flexibility than an immediate match.
The rx// operator defines an
anonymous regex that can be executed later.
$digits = rx/\d+/;
The simple // operator also defines an anonymous regex in all
contexts other than void, boolean, string, or numeric, or as an
argument of ~~.
$digits = /\d+/; # store regex
You can use the unary context forcing operators, +, ?, and ~,
to force the // operator to match immediately in a context where it
ordinarily wouldn't. For a boolean value of success or failure, force
boolean context with ?//. For a count of matches, force numeric
context with +//. For the matched string value, force string
context with ~//.
$truth = ?/\d+/; # match $_ and return success $count = +/(\d+\s+)*/; # match $_ and return count $string = ~/^\w+/; # match $_ and return string
Another option for deferred matches is a regex block. The regex
keyword defines a named or anonymous regex, in much the same way that
sub declares a subroutine or method declares a method. But the
code within the block of a regex is regex syntax, not Perl syntax.
$digits = regex {\d+};
regex digits {\d+}
There are two more keywords that defines regexes similarly to regex, which
imply slightly different behavior. token introduces a regex that does
not backtrack, (more details
on that below; for now it's enough to know that it matches simple regexes
faster), and rule is the same as token except that whitespaces in
regexes also match optional whitespaces in the string..
To match a named or anonymous regex, call it as a subregex within
another regex. Subregexes, whether they're named regexes or a variable
containing an anonymous regex, are enclosed in assertion delimiters
<...>. You can read more about assertions in
"Assertions" later in this chapter.
$string ~~ /\d+/; # same as $string ~~ /<$digits>/; $string ~~ /E<lt>digitsE<gt>/;
Table 7-1 summarizes the basic Perl syntax for defining rules.
A grammar is a collection of regexes, in much the same way that a class is
a collection of methods. In fact, grammars are classes, they're just
classes that inherit from the universal base class Regex.
This means that grammars can inherit from other grammars, and that they
define a namespace for their regexes.
grammar Hitchhikers {
token name { Zaphod | Ford | Arthur }
token id { \d+ }
...
}
Any regex in the current grammar or in one of its parents can be called directly, but a regex from an external grammar needs to have its package specified:
if $newsrelease ~~ / E<lt>Hitchhikers.nameE<gt> / {
send_alert($1);
}
If you want to match against the entire grammar, you can define a regex
TOP in that grammar.
grammar Hitchhikers {
regex TOP { <name> <id> }
...
}
$roster ~~ Hitchhikers; # Calls Hitchhikers.TOP by default
Grammars are especially useful for complex text or data parsing. In fact,
overloading grammar rules for the Perl 6 grammar itself is a method to
change the way the program is parsed. Instead of having to create custom
complex source filters like was necessary in Perl 5, we can overload the
rules in the Perl6::Grammar Grammar class to change the very syntax of
Perl 6 on the fly.
Every language has a set of basic components (words or parts of words) and a set of syntax rules for combining them. The "words" in regexes are literal characters (or symbols), some metacharacters (or metasymbols), and escape sequences, while the combining syntax includes other metacharacters, quantifiers, bracketing characters, and assertions.
The "word"-like metacharacters are ., ^, ^^, $, $$. The
. matches any single character, even a newline character. Actually,
Perl 6 has a the notion of a Unicode level, which determines if string
manipulation happens on the byte, codepoint or grapheme level. .
matches a character in the current level, which defaults to grapheme.
The Unicode level can be adjusted with a pragma or with modifiers.
We'll talk more about modifiers in "Modifiers" later
in this chapter. The ^ and $ metacharacters are zero-width
matches on the beginning and end of a string. They each have doubled
alternates ^^ and $$ that match at the beginning and end of
every line within a string.
The |, &, \, #, and := metacharacters are all syntax
structure elements. The | is an alternation between two options. The
& matches two patterns simultaneously (the patterns must be the same
length). The \ turns literal characters into metacharacters (the
escape sequences) or turns metacharacters into literal characters. The
# marks a comment to the end of the line. Whitespace insensitivity
(the old /x modifier) is on by default, so you can start a comment at
any point on any line in a regex. Just make sure you don't comment out
the symbol that terminates the regex. The :=
binds a hypothetical variable to
the result of a subregex or grouped pattern. Hypotheticals are covered
in "Hypothetical Variables" later in this chapter.
The metacharacters (), [], {} and <> are bracketing
pairs. The pairs always have to be balanced within the regex, unless they
are literal characters (escaped with a \). The brackets () and
[] group patterns to match as a single atom. They're often used to
capture a result, mark the boundaries of an alternation, or mark a group
of patterns with a quantifier, among other things. Parentheses () are
capturing and square brackets [] are non-capturing. The {}
brackets define a section of Perl code (a closure) within a regex. These
closures are always a successful zero-width match, unless the code
explicitly calls the fail function. The <...> brackets
mark assertions, which handle a variety of constructs including
character classes and user-defined quantifiers. Assertions are covered
in "Assertions" later in this chapter.
Table 7-2 summarizes the basic set of metacharacters.
The escape sequences are literal characters acting as metacharacters,
marked with the \ escape. Some escape sequences represent single
characters that are difficult to represent literally, like \t for
tab, or \x[...] for a character specified by a hexadecimal number.
Some represent limited character classes, like \d for digits or \w
for word characters. Some represent zero-width positions in a match,
like \b for a word boundary. With all the escape sequences that use
brackets, (), {}, and <> work in place of [].
Note that since an ordinary variable now interpolates as a literal
string by default, the \Q escape is rarely needed. An interpolated
array is interpreted as an alternation of all array elements.
Table 7-3 shows the escape sequences for regexes.
Quantifiers specify the number of times an atom (a single character, metacharacter, escape sequence, grouped pattern, assertion, etc) will match.
The numeric quantifiers use the ** operator followed by the number of
desired matches. For a range of matches you can use a closure that returns
a range (a**{2..4} matches two to four a's, (a**{2..Inf}) two or
more a's).
Each quantifier has a minimal alternate form, marked with a trailing
?, that matches the shortest possible sequence first.
Table 7-4 shows the built-in quantifiers.
In general, an assertion simply states that some condition or state is true and the match fails when that assertion is false. Many different constructs with many different purposes use assertion syntax.
Assertions match named and anonymous regexes, arrays or hashes containing
anonymous regexes, and subroutines or closures that return anonymous
regexes. You have to enclose a variable in assertion delimiters to get it
to interpolate as an anonymous rule or rules. A bare scalar in a pattern
interpolates as a literal string, while a scalar variable in assertion
brackets interpolates as an anonymous rule. A bare array in a pattern
matches as a series of alternate literal strings, while an array in
assertion brackets interpolates as a series of alternate anonymous
rules. In the simplest case, a bare hash in a pattern matches a word
(\w+) and tries to find that word as one of its keys., while a hash in assertion brackets does
the same, but then also matches the associated value as an anonymous
rule.
A bare closure in a pattern always matches (unless it calls fail),
but a closure in assertion brackets <{...}> must return an
anonymous rule, which is immediately matched.
An assertion with parentheses <(...)> is similar to a bare
closure in a pattern in that it allows you to include straight Perl code
within a rule. The difference is that <(...)> evaluates the
return value of the closure in boolean context. The match succeeds if
the return value is true and fails if the return value is false.
Assertions match character classes, both named and enumerated. A named
rule character class is often more accurate than an enumerated character
class. For example, <[a-zA-Z]> is commonly used to match
alphabetic characters, but generally what's really needed is the
built-in rule <alpha> which matches the full set of Unicode
alphabetic characters.
Table 7-5 shows the syntax for assertions.
Modifiers alter the meaning of the pattern syntax. The standard
position for modifiers is at the beginning of the rule, right after
the m, s, or rx, or after the name in a named rule. Modifiers
cannot attach to the outside of a bare /.../. For example:
m:i /marvin/ # case insensitive
rule names :i { marvin | ford | arthur }
Multiple modifiers can be chained, short and long names can be mixed:
m:s :i :g/ zaphod / m:sigspace :i :global / zaphod /
Modifiers can be negated with the :!pair notation, so :!i forces
case-sensitive matching.
Most of the modifiers can also go inside the rule, attached to the rule delimiters or to grouping delimiters. Internal modifiers are lexically scoped to their enclosing delimiters, so you get a temporary alteration of the pattern:
m/:s I saw [:i zaphod] / # only 'zaphod' is case insensitive
The repetition modifiers (:x, :th,
:global, and :exhaustive) and the continue modifier (:cont)
can't be lexically scoped, because they alter the return value of the
entire rule.
The :x modifier matches the rule a counted number of times. If
the modifier expects more matches than the string has, the match fails.
It has an alternate form :x() that can take a variable in place
of the number.
The :global modifier matches as many times as possible. The
:exhaustive modifier also matches as many times as possible, but in
as many different ways as possible.
The :th modifier preserves one result from a particular counted
match. If the rule matches fewer times than the modifier expects, the
match fails. It has several alternate forms. One form--:th()--can
take a variable in place of the number. The other forms--:st,
:nd, and :rd--are for cases where it's more natural to
write :1st, :2nd, :3rd than it is to write :1th, :2th,
:3th. Either way is valid, so pick the one that's most comfortable
for you.
By default, rules ignore literal whitespace within the pattern. The
:s or :sigspace modifier makes rules sensitive to literal whitespace,
but in an intelligent way. Any cluster of literal whitespace acts like an
explicit \s+ when it separates two identifiers and \s* everywhere else.
More specifically any literal whitespace in the regex is translated to
an implict call to <.ws>, where the ws rule matches as
mentioned above, but can also be overridden by the user.
There are no modifiers to alter whether the matched string is treated as a single line or multiple lines. That's why the "beginning of string" and "end of string" metasymbols have "beginning of line" and "end of line" counterparts.
Table 7-6 shows the current list of modifiers.
Special modifiers are available for substitions that do not make sense on normal matches.
The :samecase, or short :ii modifier implies the :ignorecase
modifier, but also carries the case information on a
character-by-character base
my $s = 'The Quick Brown Fox'; $s ~~ s:ii/brown/blue/; say $s; # The Quick Blue Fox
If the :sigspace modifier is also present, a slightly more
intelligent algorithm is used. If the source string follows one of the
case patterns in $table (XXX: make that a proper cross-link),
that pattern is recognized and applied onto the
substitution string.
$_ = 'All Words Capialized'; s:s:ii/.*/other words/; .say; # Other Words
There's a shortcut for s:s named ss, so you could have written the
example above aswidth="348" height="300"
ss:ii/.*/other words/.
A similar modifier is :sameaccent (short :aa). Instead of carrying
case information, it carries accent and marking information.
my $stuff = 'Möhre'; $stuff ~~ s:aa/a/o/; say $stuff; # Mähre
The third substitution modifier is :samespace, short :ss. It preserves
whitespace that is matched by implicit <.ws> rules:
my $s = "Some white\t\n spaces"; $s ~~ s:ss/\w+ \w+ \w+/Completely different text/; # $s is now "Completely different\t\n text"
A number of named rules are provided by default, including a complete set of POSIX-style classes, and Unicode property classes. The list isn't fully defined yet, but Table 7-7 shows a few you're likely to see.
The <null> rule matches a zero-width string (so it's always
true) and <prior> matches whatever the most recent successful
rule matched. These replace the two behaviors of
the Perl 5 null pattern //,
which is no longer valid syntax for rules.
Backtracking is triggered whenever part of the pattern fails to match.
You can also explicitly trigger backtracking by calling the fail
function within a closure. Table 7-8 shows some
metacharacters and built-in rules relevant to backtracking.
The :ratchet modifier, which is implied by regexes declared with the
token or rule keyword, disables backtracking in the subrule, which
is the same as adding a : after every atom.
A regex match produces a Match object, which contains all information about the match, including start and end position, matched string, and all captures.
The match object is returned from a regex match, and is also stored in
the special variable $/.
my $match = 'Zaphod Beeblebrox' ~~ m/\w+/; say $match; # prints Zaphod
In string context it evaluates to the text of the matched part of the string.
Table summarises the properties of the match object.
The variables $0, $1, $2 etc. are aliases to $/[0],
$/[1], $/[2], and $<name> is an alias to
$/<name>. Likewise an empty @() is the same as @($/),
and %() stands for %($/).
Match variables can also store a different scalar object. A closure in a
regex can store such an object by calling make, and can be accessed
by forcing scalar context with $( $/ ):
regex herd :i :s {
(\d+)
(\w+)s?
{
make Herd.new(
animal => $1.capitalize
count => $0,
);
}
}
'Yesterday we saw 4 mooses' ~~ m/ <herd> /;
# now $($<herd>) contains the new Herd object
This can be used to build object trees directly from regex matches.
Capture variables are always match objects, and contain the information of their respective sub matches.
m/ ( a ( geek ) ( passes ) ) ( many tests ) /
| | | | | | | |
| $/[0][0] $/[0][1]-+ | | |
| | | |
$/[0]-------------------+ $/[1] -------+
If a capturing group is quantified, it automatically becomes an array of match objects. Subsequent matches are not renumbered:
'12 45 books' ~~ m:s/ ( \d+ )+ (\w+) / say $0[0]; # 12 say $0[1]; # 45 say $1; # books
When a subrule is called with the <subrule> syntax, it
produces a named capture of name subrule. That named can be
changed with the <newname=subrule> syntax.
token identifier { \w+ }
token number { \d+ }
$_ = '24 hours'
if m:s/<number> <unit=identifier> / {
say "Number: $<number>. Unit: $<unit>";
}
These variables are also available iin the regex itself:
"Zaphod saw Zaphod" ~~ m:s/ E<lt>nameE<gt> \w+ $/<name> /;