Tuesday, May 8, 2012

RegExp Part 2

Position Matching

Position matching involves the use of the ^ and $ to search for beginning or ending of strings. Setting the pattern property to "^VBScript" will only successfully match "VBScript is cool." But it will fail to match "I like VBScript."
Symbol Function
^ Only match the beginning of a string.

"^A" matches first "A" in "An A+ for Anita."
$ Only match the ending of a string.

"t$" matches the last "t" in "A cat in the hat"
\b Matches any word boundary

"ly\b" matches "ly" in "possibly tomorrow."
\B Matches any non-word boundary

Literals

Literals can be taken to mean alphanumeric characters, ACSII, octal characters, hexadecimal characters, UNICODE, or special escaped characters. Since some characters have special meanings, we must escape them. To match these special characters, we precede them with a "\" in a regular expression.
Symbol Function
Alphanumeric Matches alphabetical and numerical characters literally.
\n Matches a new line
\f Matches a form feed
\r Matches carriage return
\t Matches horizontal tab
\v Matches vertical tab
\? Matches ?
\* Matches *
\+ Matches +
\. Matches .
\| Matches |
\{ Matches {
\} Matches }
\\ Matches \
\[ Matches [
\] Matches ]
\( Matches (
\) Matches )
\xxx Matches the ASCII character expressed by the octal number xxx.

"\50" matches "(" or chr (40).
\xdd Matches the ASCII character expressed by the hex number dd.

"\x28" matches "(" or chr (40).
\uxxxx Matches the ASCII character expressed by the UNICODE xxxx.

"\u00A3" matches "£".

Character Classes

Character classes enable customized grouping by putting expressions within [] braces. A negated character class may be created by placing ^ as the first character inside the []. Also, a dash can be used to relate a scope of characters. For example, the regular expression "[^a-zA-Z0-9]" matches everything except alphanumeric characters. In addition, some common character sets are bundled as an escape plus a letter.
Symbol Function
[xyz] Match any one character enclosed in the character set.

"[a-e]" matches "b" in "basketball".
[^xyz] Match any one character not enclosed in the character set.

"[^a-e]" matches "s" in "basketball".
. Match any character except \n.
\w Match any word character. Equivalent to [a-zA-Z_0-9].
\W Match any non-word character. Equivalent to [^a-zA-Z_0-9].
\d Match any digit. Equivalent to [0-9].
\D Match any non-digit. Equivalent to [^0-9].
\s Match any space character. Equivalent to [ \t\r\n\v\f].
\S Match any non-space character. Equivalent to [^ \t\r\n\v\f].

Repetition

Repetition allows multiple searches on the clause within the regular expression. By using repetition matching, we can specify the number of times an element may be repeated in a regular expression.
Symbol Function
{x} Match exactly x occurrences of a regular expression.

"\d{5}" matches 5 digits.
{x,} Match x or more occurrences of a regular expression.

"\s{2,}" matches at least 2 space characters.
{x,y} Matches x to y number of occurrences of a regular expression.

"\d{2,3}" matches at least 2 but no more than 3 digits.
? Match zero or one occurrences. Equivalent to {0,1}.

"a\s?b" matches "ab" or "a b".
* Match zero or more occurrences. Equivalent to {0,}.
+ Match one or more occurrences. Equivalent to {1,}.

Alternation & Grouping

Alternation and grouping is used to develop more complex regular expressions. Using alternation and grouping techniques can create intricate clauses within a regular expression, and offer more flexibility and control.
Symbol Function
() Grouping a clause to create a clause. May be nested. "(ab)?(c)" matches "abc" or "c".
| Alternation combines clauses into one regular expression and then matches any of the individual clauses.

"(ab)|(cd)|(ef)" matches "ab" or "cd" or "ef".

BackReferences

Backreferences enable the programmer to refer back to a portion of the regular expression. This is done by use of parenthesis and the backslash (\) character followed by a single digit. The first parenthesis clause is referred by \1, the second by \2, etc.
Symbol Function
()\n Matches a clause as numbered by the left parenthesis

"(\w+)\s+\1" matches any word that occurs twice in a row, such as "hubba hubba."


No comments: