Pages

Friday, January 24, 2014

PLUG - Regular Expressions

PLUG - 1/21/2014 - David Oswald - "Regular Expressions" 
Provo Linux Users Group's (PLUG) David Oswald presents on Regular Expressions, for the January 21, 2014 meeting.  The saved YouTube video presentation is 55 minutes.  The examples are in Perl, but the concepts are universal.  Covers rxrx "RegEx Doctor", backtracking and NFA concepts.

"What is backtracking? How do character classes work? What is the "Little Engine that Could(n't)"?"






Notes


Definitions:
  • Literal Characters: abcd... ABC... 123...
  • Operators: m// (match), s/// (substitute), =~ or !~ (bind)
  • Meta-characters: \ | ( ) [ { ^ $ * + ? .
  • Meta symbols: \b \D \t \3 etc...
Example:
$string = "hello world";
say "Match" if $string =~ m/hello/;

Syntactic Shortcuts:
$_ = "hello world";
say "match" if /hello/;

DFA / NFA / Hybrid Engines - (Deterministic Finite Automata) DFAs and (Non-deterministic Finite Automata) NFAs have exactly the same capabilities and limitations. The only difference is notation convenience.
  • DFA: simpler, text-directed match, no backtracking, more limited, awk/egrep
  • NFA: regex-directed match, backtracking, more flexible semantics, grep/less/sed/more/Perl/PHP/Python
Focus for this presentation: Non-Deterministic Finite Automata (NFA) Engines

Two big NFA rules:
  1. The match that begins earliest wins.
  2. The standard quantifiers are greedy.
rxrx - Excellent RegEx debugging tool ("RegEx Doctor") :
  • rxrx - command-line REPL and wrapper for Regexp::Debugger
Quick rxrx overview:
Install:
 cpan Regexp::Debugger

Usage:
 rxrx
 perl -MRegexp::Debugger -E 'Regexp::Debugger::rxrx(@ARGV)'

To set the regex:
 /[SOMETHING]/

To set the string:
 '[SOMETHING]'

To start match engine: 'm'

To quit: 'q'

Commands:
      / : Enter a pattern
      ' : Enter a new literal string
      " : Enter a new double-quoted string
      m : Match current string against current pattern
 q or x : quit debugger and exit
      h : help

Match Commands:
       - : back step
 [enter] : forward step
       c : proceed visually quickly to the end

Backtracking - When partial match fails, the engine backtracks to try an alternative path.  Where the greatest performance hit comes from.

Example of large backtracking out of control:
# fails in 218 steps
/(a*)*[^Bb]$/
'aaaaaab'

Backtracking can be put under control by using "possessive quantifiers": (*+, ++, ?+, {n,m}+)
# fails in 79 steps
/(a*)*+[^Bb]$/

Possessive quantifiers - change the "greedy" quantifiers to "lazy" quantifiers that attempt to match the minimal amount.  Greedy quantifiers attempt to match the greatest left most match.  Lazy quantifiers attempt to match the least left most match.
/.*foo/   # Greedy ; backtrack backwards.
/.*?foo/  # Lazy; backtrack forward.

More NFA rules:
  1. Matches occur as far left as possible.
  2. Alternation has left-to-right precedence.
  3. Alternative matches if every item listed in the alternative matches sequentially.
  4. Backtracking occurs to try higher-pecking-order assertions.
  5. Quantifiers must be satisfied within their permissible range.
  6. Each atom matches according to its designated semantics.
Shorter segments are often easier:
if( /Brian/ && /John/ ) { ... }
# vs
if( /Brian.*John|John.*Brian/ ) { ... }

Progressive matching - starting where the previous pattern starts off

Extended Bracketed Character Classes - Character classes can now have set semantics (intersections, unions, subtraction, symmetric difference, complement)

When to use RegExes:

RegExes are for matching patters, not for things that have a parser, like HTML and JSON.  Should also not be used for Email Addresses.

"Regexes optimal for small HTML parsing problems, pessimal for large ones" -- Tom Christiansesn

Appropriate Alternatives:

  • Complex grammars - use parsing classes
  • Fixed-width fields - use unpack, substr
  • Comma Separated Values - use CSV libraries
  • Uncomplicated, predictable data - Regular Expressions!


3 comments: