ECMA-262 Core: JavaScript Regular Expressions



Navigation Aids -- This Page           Navigation Aids -- This Topic           Navigation Aids -- This Site




Introduction And Review

The topic of regular expressions is hard to master. The syntax is cryptic, the rules are obscure and the topics are not always intuitive. My advice to those who use regular expressions in their code; test, test and test some more.





Presentation Road Map

The topics covering regular expressions have been presented from three vantage points. Topical definitions are given; a code illustration is given; and a detailed discussion is given. Links to these areas are provided in the following table. The links to the code illustrations and definitions are on a separate pages.

The topics have been organized and arranged from general (simple) to specific (complex).


Road Map to Regular Expression Topics
Definitions Discussion Illustration
N/A RegExp Object Create a Simple RegExp Object
Patterns Creating RegExp Objects
Flags Flags Perform a Global Search (The Global Flag)
Meta Characters Meta Characters N/A
Literal Characters Literal Characters N/A
Escape Sequence Escape Sequence Codes N/A
Character Class Character Classes A Simple Character Class
Negation Class A Negation Class
Range Class A Range Class
Predefined Class The Predefined Digits Class
The Predefined Non-Whitespace Class
The Predefined Word Class
Wildcard The Wildcard
Repetition Quantifier Repetition Quantifiers N/A
Greedy Quantifier Greedy Quantifier
Reluctant Quantifier Reluctant Quantifier
Possessive Quantifier Possessive Quantifier
Grouping Complex Patterns Grouping Patterns
Backreference Backreference Patterns
Non-Capturing Groups Non-Capturing Groups Patterns
Alternation Alternation Patterns
Anchors Anchors Anchor Patterns
Assertions Assertions Patterns
Boundaries Boundary Patterns




Discussion



The RegExp Object

Many languages like Perl provide regular expression capabilities. JavaScript provides nearly all of the regular expression functionality as provided by Perl. However, the implementation of the regular expression functionality vary greatly. Perl provides special operators to facilitate the many operations related to regular expression processing. JavaScript provides regular expression processing via methods of a class type. The two class types that provide these methods are RegExp and String. Note that the syntax for building regular expression patterns are nearly the same between Perl and JavaScript.

Conceptually, we can say a RegExp object is made up of:

All links to the RegExp Object methods and properties are off page. Likewise, the links to the String Object methods that deal with regular expressions are:





Creating Regular Expressions Patterns (Two Approaches)

JavaScript provides two approaches for defining regular expression patterns.

Using the RegExp constructor, the regexp pattern is specified with a string as the first argument of the constructor method. The newly created regexp object instance reflects this regexp pattern. var reDog = new RegExp("dog");

Regular expressions can also be defined with a special literal syntax unique to regular expressions. This syntax is recognizable by two inclosing forward slash characters (/). var reDog = /dog/; Just as string literals are recognized by quotes, regular expression literals are recognized by the forward slash characters. The above line of code will create a new regexp object and assign it to the variable "reDog".

Both approaches yield the same result; a new object named "reDog" that contains a regexp pattern of "dog". This pattern can ultimately be used with one of the RegExp methods (and String methods) for matching against a target string.





Flags (Modifiers)

JavaScript regular expression flags are called "modifiers" in the Perl language. Flags hold a special position in regular expression grammar. Flags are not part of the regexp pattern, however, flags govern the application of the given pattern on the target string. Flags are optional. Following is how flags are specified with the RegExp constructor or the regexp literal:

Regular Expression Flags
Flag Character Usage Example RegExp
Symbol Character Name
occurs = occurrences
examples are expressed as regular expression literals
i lowercase i usage: case insensitivity
example: match a single occurs of "bird" regardless of case
/bird/i;
g lowercase g usage: cause a global search
example: match all occurs of bird
/bird/g;
m lowercase m usage: perform pattern matching in multiline mode
example: multiline mode is used with the beginning and ending line anchors
/^bird$/m;




Meta Characters

The following table will contain punctuation characters that have intrinsic meaning in a regular expression pattern. These are meta characters and will control the processing of the regular expression.

Regular Expression Control (Meta) Characters
Meta Character Usage Example
Symbol Operation
occurs = occurrences
examples are expressed as regular expression literals
[ ] simple class specifies a simple character class
example: match any letters of x, y or z
/[xyz]/
^ negation class the caret specifies a negated character class
example: match any character other than of x, y or z
/[^xyz]/
- range the hyphen serves as a range designator
example: match the entire alphabet (lowercase)
/[a-z]/
wildcard the period serves as a wildcard
a wildcard is a predefined character class
example: matches nearly all literal characters
/[•]/
\w, \W,
\d, \D,
\s, \S
predefined class these tokens represent additional
predefined character classes
example: matches any character that is a digit
/[\d]/
{n,m} quantifier curly braces signifies a quantifier
where n = minimum
where m = maximum
example: match between 3 and 4 digits
/\d{3,4}/
? quantifier question mark is same as {0,1}
example: matches zero or one occurs of a digit
/\d?/
+ quantifier the plus sign is same as {1,}
example: match's one or more digit(s)
/\d+/
* quantifier the asterisk is same as {0,}
example: match's zero or more digit(s)
/\d*/
( ) group specifies a grouping
example: match either ay or (ar or ap)
/(ar|ap)|ay)/
?: non-capturing this group will not be remembered
a non-capturing group
example: match one of more occurs of xyz
a backreference can not be used here
/(?:xyz)+/;
\n backreference the token \n references a former group
the "n" of \n is the reference number of the group
example: \1 references the group xyz
/(xyz)\1/;
| alternation the pipe specifies alternatives
example: match's mar or apr or may
/(mar|apr|may)/
^ anchor the caret will anchor the pattern to the beginning of the string
not to be confused with caret of negative character class
example: anchors pattern to beginning of string
/^JavaScript/;
$ anchor the dollar sign will anchor the pattern to the end of the string
example: anchors pattern to end of string
/JavaScript$/;
\b boundary match a word boundary
example: anchors pattern to word boundary
/\bJava\b/;
\B boundary match a position not a word boundary
example: anchors pattern to non-word boundary
/\Bscript/;
(?=p) assertion a positive look ahead assertion
where p = pattern
example: match if script follows Java
/(Java(?=script))/;
(?!p) assertion a negative look ahead assertion
where p = pattern
example: match if script does not follows Java
/(Java(?!script))/;




Literal Characters

Alphanumeric characters can all be represented in a regular expression. When matched to a string object, alphanumeric characters all match to themselves (a literal match). Other characters other than the alphanumeric can also be matched literally. Strings may contain ASCII control characters that must be matched via escape sequences.

Regular Expression Literal Characters
Character Description Unicode ASCII
Hex Dec
Alphanumeric 0-9 , A-Z , a-z \uxxxx nn nn
\0 the null character \u0000 00 00
\t the horizontal tab character \u0009 09 09
\n the newline character \u000A A 10
\v the vertical tab character \u000B B 11
\f the form feed character \u000C C 12
\r the carriage return character \u000D D 13
\xnn escape hex representation of ASCII character n/a n/a n/a




Escape Sequence Codes

We saw in the above table how some common control characters must be escaped to be used in a regular expression pattern and ultimately matched with a string with same character content. But what about the meta characters? What if the target string contains characters that we must match but are part of the regular expression meta character set? They must be escaped!! For example, what if we need to match the URI string: "http://www.w3c.org". The pattern must escape both the forward slash and period since they are regexp meta characters.

var repattern = /http:\/\/www\.w3c\.org/;

Although the escape sequences for a string and those for regular expression patterns are conceptually the same idea, the character set for each do differ. Take this link to review string escape sequencing where we explain escape sequencing more thoroughly.





Character Classes

The simple character class is designated with enclosing square brackets. The actual character class for a given regexp are the literal characters enclosed inside the square brackets. Each character of the character class is individually matched to the target string until a match is found. Complex character classes (negation and range classes) will require additional meta characters within the square brackets. Compound character classes will contain combinations of complex classes. The predefined character classes represent common matching patterns that have been given a corresponding shorthand token. Here are some additional observations concerning character classes:

Regular Expression Character Classes
Class Name

Class Type
Token
Symbol
Example Matching Description
Example Description
Equivalent
Examples are expressed as regular expression literals.
Predefined classes are like a shorthand; the equivalent longhand is given also.
Simple [ ] /[abc]/ matches: any one character in the class
example: matches all characters in class (a, b, c)
Negation

Complex
[^ ] /[^abc]/ matches: all characters except those within brackets
example: matches all characters except a, b, and c
Range

Complex
[ - ] /[0-9]/ matches: any one character in the class
example: matches all characters in class; 0,1,2,3,4,5,6,7,8,9
equivalent: /[0123456789]/
Word Character

Predefined
\w /[\w]/ matches: any alphanumeric character in the regexp literal character set (all word characters)
example: matches all letter (both uppercase and lowercase), all digits and the underscore
equivalent: /[a-zA-Z0-9_]/
Non-Word Character

Predefined
\W /[\W]/ matches: any non-alphanumeric character
example: matches any character that is not a word character
equivalent: /[^a-zA-Z0-9_]/
Digit

Predefined
\d /[\d]/ matches: any digit
example: matches any character that is a digit
equivalent: /[0-9]/
Non-Digit

Predefined
\D /[\D]/ matches: any character except digits
example: matches any character that is not a digit
equivalent: /[^0-9]/
Whitespace

Predefined
\s /[\s]/ matches: any whitespace character
example: matches any character that is whitespace
equivalent: /[\t\n\f\r\x0B]/
Non-Whitespace

Predefined
\S /[\S]/ matches: any character except whitespace
example: matches any character that is not a tab, newline, form feed, carriage return or space
equivalent: /[^\t\n\f\r\x0B]/
Wildcard (period)

Predefined
/•/ matches: any character except newline and carriage return
example: acts as a wildcard and matches nearly all of the literal character set
equivalent: /[^\n\r]/




Repetition Quantifiers

Repetition Quantifier are meta characters that specify and control regexp matching repetition. The meta characters and there significance as quantifiers are presented here:

Repetition Quantifiers Meta Characters
Meta Characters Description
{n,m} match the previous pattern at least n times but no more than m times
{n,} match the previous pattern at least n or more times
{n} match the previous pattern exactly n times
? match zero or one occurs of previous pattern
equivalent: {0,1}
+ match one or more occurs of previous pattern
equivalent: {1,}
* match zero or more occurs of previous pattern
equivalent: {0,}




The quantifiers must follow the pattern on which the repetition is based. The quantifiers as specified in the above table are intrinsically greedy. The greedy quantifier can be changed to a reluctant quantifier by placing an additional question mark after the greedy quantifier. The greedy quantifier can be changed to a possessive quantifier by placing an additional plus sign after the greedy quantifier.

Repetition Quantifiers (Greedy, Reluctant, Possessive)
Greedy Reluctant Possessive Description
{n,m} {n,m}? {n,m}+ at least n times but no more than m times
{n,} {n,}? {n,}+ at least n or more times
{n} {n}? {n}+ exactly n times
? ?? ?+ zero or one occurs
+ +? ++ one or more occurs
* *? *+ zero or more occurs


To better understand the repetition quantifiers, you may want to link to our illustration section where we provide an example for greedy, reluctant and possessive quantifiers. Here are the links into the illustration:





Complex Patterns

The complex pattern topics include: groupings, barkreferences and alternatives.

Complex Patterns
Meta Characters Description
Example
(...)

Grouping
description: a unit (pattern) is grouped and then can be used with quantifiers, alternations, etc.; will remember match scenario for later reference
example: /(xyz)+/; //match one of more occurs of xyz
(?:...)

Grouping
description: a unit (pattern) is grouped and then can be used with quantifiers, alternatives, etc.; will not remember match scenario for later reference
example: /(?:xyz)+/; //match one of more occurs of xyz
\n

Backreference
description: backreference: match back to a group number (n) as remembered in a previous group match result
example: /(xyz)\1/; //equivalent to /xyzxyz/
|

Alternation
description: alternation: match either the left or right pattern
example: /(xyz|abc)/; //match either abc or xyz


Grouping patterns allow you to match pattern parts as a unit. The pattern parts can be literal sequences, character classes or quantifiers. Enclosing these parts inside parentheses will make for a group. Matching is no longer based on a character by character match but instead, matching is based on the group. The typical grouping pattern, (...), will remember the pattern (capturing).

/abcabc/g; //is the same as /(abc){2}/g;

Groupings make references possible.

Backreference patterns are tokens that refer back to a group. The tokens are comprised of a reference number preceded by a forward slash; example: \2. The regexp matching process will keep track of each and all groups within the regexp. References are created and numbered in the order of the group opening parenthesis character is encountered going from left to right. View the expression below and how the groups are saved:

/(a+ (b? (c* (def?))))/;

The token \2 relates to (b?).

Note that \2 above does not reference the actual regexp group itself. It references the portion of the target string that the group matched. So, the reference points to the match results, meaning that specifying the \2 backreference token eliminates a large part of the processing imposed by the original match.

Alternation patterns allow you to match one of two or more patterns. The alternation meta character is the | symbol. The match starts on the left pattern and ends with the right pattern. The regexp alternation works like a bitwise OR operator. Alternation has the following behavior:





Anchors

An anchor will position a regexp pattern to a particular area of a target search string. A common anchor that positions a pattern at the beginning of a search string is the meta character "^" (caret). Likewise, the "$" will position a pattern at the end of a search string. By adding the "m" flag (multiline flag), the above string searches can be expanded to search multiple lines where the target string is of a multiline nature.

Boundary patterns are anchors that fix a pattern based on word boundaries. The "\b" token fixes the pattern on a word boundary. The "\B" token fixes the pattern on a non-word boundary. Word boundaries imply that a whitespace character precedes the word and follows the word. Boundary patterns are useful for matching at both the front of a string and at the end of a string. A pattern that uses the word boundary token at the front of the pattern will match the word of the pattern when the target string has the same word as the first word in the string. Likewise, the last word of a target string will be matched when the same word in the pattern is followed by the word boundary token.

Assertion patterns are confusing to say the least. When an assertion is specified as part of a pattern, it tells the interpreter to look ahead in the target string but do not loss the current position. It is conditional matching; but the assertion (or look ahead) is not returned. A positive assertion will capture a group of characters only if they appear before other characters (the other characters being the assertion). A negative assertion will capture a group of characters only if they do not appear before other characters. The parentheses in the assertion meta tokens must not be confused with grouping syntax.

Anchors
Meta Characters Description
Example
^

anchor
description: the caret will anchor the pattern to the beginning of the string; for a multiline search, it will anchor the pattern to the beginning of each line.
example: /^JavaScript/; //will cause a match when target string begins with "JavaScript"
$

anchor
description: the dollar sign will anchor the pattern to the end of the string; for a multiline search, it will anchor the pattern to the end of each line.
example: /JavaScript$/; //will cause a match when target string ends with "JavaScript"
\b

boundary
description: match a word boundary
example: /\bJava\b/; //matches a stand alone word of "Java", handles first and last word when "Java" appears at the beginning and/or end of the target string
\B

boundary
description: match a non-word boundary
example: /\Bscript/; //the word Javascript would cause a match; the stand alone word "script" would not cause a match
(?=p)

assertion
description: a positive look ahead assertion, where p = pattern
example: /(Java(?=script))/; //matches Javascript but not Javacode; returns "Java"
(?!p)

assertion
description: a negative look ahead assertion, where p = pattern
example: /(Java(?!script))/; //matches Javacode but not Javascript; returns "Java"


Top            

Rx4AJAX        About Us | Topic Index | Contact Us | Privacy Policy | 2008 This Site Built By PPThompson