Parts of the Java Tutorial by Sun, Lesson Regular Expressions
http://java.sun.com/docs/books/tutorial/extra/regex/index.html
What are regular expressions?
Regular expressions are a way to describe a set of strings based on common characteristics shared by each string in the set. They can be used as a tool to search, edit or manipulate text or data. You must learn a specific syntax to create regular expressions--one that goes beyond the normal syntax of the JavaTM programming language. Regular expressions range from being simple to quite complex, but once you understand the basics of how they're constructed, you'll be able to understand any regular expression.This tutorial will teach you the regular expression syntax supported by the http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/package-summary.html API, and will present plenty of working examples to illustrate how the various objects interact. In the world of regular expressions, there are many different flavors to choose from, such as grep, Perl, Tcl, Python, PHP, and awk. The regular expressions in the
java.util.regex
API are most similar to Perl.
The most basic form of pattern matching supported by this API is the match of a string literal. For example, if the regular expression isfoo
, and the input string isfoo
, the match will succeed because these strings are identical.This match was a success. Note that while the input string is 3 characters long, the start index is 0 and the end index is 3. By convention, ranges are inclusive of the beginning index and exclusive of the end index.Current REGEX is: foo Current INPUT is: foo Finds the text "foo" starting at index 0 and ending at index 3.The string "foo" starts at index 0 and ends at index 3, even though the characters themselves only occupy cells 0, 1, and 2.
With subsequent matches, you'll notice some overlap; the start index for the next match is the same as the end index of the previous match:
Current REGEX is: foo Current INPUT is: foofoofoo Finds the text "foo" starting at index 0 and ending at index 3. Finds the text "foo" starting at index 3 and ending at index 6. Finds the text "foo" starting at index 6 and ending at index 9.Metacharacters
This API also supports a number of special characters which can affect the way a pattern is matched. In yourregex.txt
file, change the regular expression tocat.
. and the input string tocats
. Here's the result:The match still succeeds, even though the period (.) is not present in the input string. It succeeds because the period is a metacharacter--a character with special meaning interpreted by the matcher. The metacharacter "." means "any character" which is why the match in our example succeeds.Current REGEX is: cat. Current INPUT is: cats Finds the text "cats" starting at index 0 and ending at index 4.The metacharacters supported by this API are:
([{\^$|)?*+.
Note: In certain situations the special characters listed above will not be treated as metacharacters. You'll encounter this as you learn more about how regular expressions are constructed. You can, however, use this list to check whether or not a specific character will ever be considered a metacharacter. For example, the characters!
@
and#
never carry a special meaning.There are two ways to force a metacharacter to be treated as an ordinary character:
When using this technique, the
- precede the metacharacter with a backslash, or
- enclose it within
\Q
(which starts the quote) and\E
(which ends it).\Q
and\E
can be placed at any location within the expression, provided that the\Q
comes first.
If you browse through the http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html class specification, you'll see tables summarizing the supported regular expression constructs. In the "Character classes" section you'll find the following:The left-hand column specifies the regular expression constructs, while the right-hand column describes the conditions under which each construct will match.
Character Classes [abc]
a, b, or c (simple class) [^abc]
Any character except a, b, or c (negation) [a-zA-Z]
a through z, or A through Z, inclusive (range) [a-d[m-p]]
a through d, or m through p: [a-dm-p] (union) [a-z&&[def]]
d, e, or f (intersection) [a-z&&[^bc]]
a through z, except for b and c: [ad-z] (subtraction) [a-z&&[^m-p]]
a through z, and not m through p: [a-lq-z] (subtraction)
Note: The word "class" in the phrase "character class" does not refer to a.class
file. In the context of regular expressions, a character class is a set of characters enclosed within square brackets. It specifies the characters that will successfully match a single character from a given input string.Simple Classes
The most basic form of a character class is to simply place a set of characters side-by-side within square brackets. For example, the regular expression[bcr]at
will match the words "bat", "cat", or "rat" because it defines a character class (accepting either "b", "c", or "r") as its first character.In the above examples, the overall match succeeds only when the first letter matches one of the characters defined by the character class.Current REGEX is: [rcb]at Current INPUT is: bat Finds the text "bat" starting at index 0 and ending at index 3. Current REGEX is: [rcb]at Current INPUT is: cat Finds the text "cat" starting at index 0 and ending at index 3. Current REGEX is: [rcb]at Current INPUT is: rat Finds the text "rat" starting at index 0 and ending at index 3. Current REGEX is: [rcb]at Current INPUT is: hat No match found.Negation
To match all characters except those listed, insert the^
metacharacter at the beginning of the character class. This technique is known as negation.The match is successful only if the first character of the input string does not contain any of the characters defined by the character class.Current REGEX is: [^bcr]at Current INPUT is: bat No match found. Current REGEX is: [^bcr]at Current INPUT is: cat No match found. Current REGEX is: [^bcr]at Current INPUT is: rat No match found. Current REGEX is: [^bcr]at Current INPUT is: hat Finds the text "hat" starting at index 0 and ending at index 3.Ranges
Sometimes you'll want to define a character class that includes a range of values, such as the letters "a through h" or the numbers "1 through 5". To specify a range, simply insert the-
metacharacter between the first and last character to be matched, such as[1-5]
or[a-h]
. You can also place different ranges beside each other within the class to further expand the match possibilities. For example,[a-zA-Z]
will match any letter of the alphabet: a to z (lowercase) or A to Z (uppercase).Here are some examples of ranges and negation:
Current REGEX is: [a-c] Current INPUT is: a Finds the text "a" starting at index 0 and ending at index 1. Current REGEX is: [a-c] Current INPUT is: b Finds the text "b" starting at index 0 and ending at index 1. Current REGEX is: [a-c] Current INPUT is: c Finds the text "c" starting at index 0 and ending at index 1. Current REGEX is: [a-c] Current INPUT is: d No match found. Current REGEX is: foo[1-5] Current INPUT is: foo1 Finds the text "foo1" starting at index 0 and ending at index 4. Current REGEX is: foo[1-5] Current INPUT is: foo5 Finds the text "foo5" starting at index 0 and ending at index 4. Current REGEX is: foo[1-5] Current INPUT is: foo6 No match found. Current REGEX is: foo[^1-5] Current INPUT is: foo1 No match found. Current REGEX is: foo[^1-5] Current INPUT is: foo6 Finds the text "foo6" starting at index 0 and ending at index 4.Unions
You can also use unions to create a single character class comprised of two or more separate character classes. To create a union, simply nest one class inside the other, such as[0-4[6-8]]
. This particular union creates a single character class that matches the numbers 0, 1, 2, 3, 4, 6, 7, and 8.Current REGEX is: [0-4[6-8]] Current INPUT is: 0 Finds the text "0" starting at index 0 and ending at index 1. Current REGEX is: [0-4[6-8]] Current INPUT is: 5 No match found. Current REGEX is: [0-4[6-8]] Current INPUT is: 6 Finds the text "6" starting at index 0 and ending at index 1. Current REGEX is: [0-4[6-8]] Current INPUT is: 8 Finds the text "8" starting at index 0 and ending at index 1. Current REGEX is: [0-4[6-8]] Current INPUT is: 9 No match found.Intersections
To create a single character class matching only the characters common to all of its nested classes, use the intersection operator&&
, as in[0-9&&[345]]
. This particular intersection creates a single character class matching only the numbers common to both character classes: 3, 4, and 5.And here's an example that shows the intersection of two ranges:Current REGEX is: [0-9&&[345]] Current INPUT is: 3 Finds the text "3" starting at index 0 and ending at index 1. Current REGEX is: [0-9&&[345]] Current INPUT is: 4 Finds the text "4" starting at index 0 and ending at index 1. Current REGEX is: [0-9&&[345]] Current INPUT is: 5 Finds the text "5" starting at index 0 and ending at index 1. Current REGEX is: [0-9&&[345]] Current INPUT is: 2 No match found. Current REGEX is: [0-9&&[345]] Current INPUT is: 6 No match found.Current REGEX is: [2-8&&[4-6]] Current INPUT is: 3 No match found. Current REGEX is: [2-8&&[4-6]] Current INPUT is: 4 Finds the text "4" starting at index 0 and ending at index 1. Current REGEX is: [2-8&&[4-6]] Current INPUT is: 5 Finds the text "5" starting at index 0 and ending at index 1. Current REGEX is: [2-8&&[4-6]] Current INPUT is: 6 Finds the text "6" starting at index 0 and ending at index 1. Current REGEX is: [2-8&&[4-6]] Current INPUT is: 7 No match found.Subtraction
Finally, you can use subtraction to negate one or more nested character classes, such as[0-9&&[^345]]
. This example creates a single character class that matches everything from 0 to 9, except the numbers 3, 4, and 5.Now that we've covered how character classes are created, You may want to review the Character Classes table before continuing with the next section.Current REGEX is: [0-9&&[^345]] Current INPUT is: 2 Finds the text "2" starting at index 0 and ending at index 1. Current REGEX is: [0-9&&[^345]] Current INPUT is: 3 No match found. Current REGEX is: [0-9&&[^345]] Current INPUT is: 4 No match found. Current REGEX is: [0-9&&[^345]] Current INPUT is: 5 No match found. Current REGEX is: [0-9&&[^345]] Current INPUT is: 6 Finds the text "6" starting at index 0 and ending at index 1. Current REGEX is: [0-9&&[^345]] Current INPUT is: 9 Finds the text "9" starting at index 0 and ending at index 1.
ThePattern
API contains a number of useful predefined character classes, which offer convenient shorthands for commonly-used regular expressions:In the table above, each construct in the left-hand column is shorthand for the character class in the right-hand column. For example,
Predefined Character Classes .
Any character (may or may not match line terminators) \d
A digit: [0-9]
\D
A non-digit: [^0-9]
\s
A whitespace character: [ \t\n\x0B\f\r]
\S
A non-whitespace character: [^\s]
\w
A word character: [a-zA-Z_0-9]
\W
A non-word character: [^\w]
\d
means a range of digits (0-9), and\w
means a word character (any lowercase letter, any uppercase letter, the underscore character, or any digit). Use the predefined classes whenever possible. They make your code easier to read and eliminate errors introduced by malformed character classes.Constructs beginning with a backslash are called escaped constructs; we previewed escaped constructs in the String Literals section where we mentioned the use of backslash and
\Q
and\E
for quotation. If you are using an escaped construct within a string literal, you must preceed the backslash with another backslash for the string to compile. For example:In this exampleprivate final String REGEX = "\\d"; // a single digit\d
is the regular expression; the extra backslash is required so that the code compile. Our test harness reads the expressions directly from a file, however, so the extra backslash is unnecessary.The following examples demonstrate the use of predefined character classes. For each case, try to predicit the result before you read the output on the last line.
In the first three examples, our regular expression is simplyCurrent REGEX is: . Current INPUT is: @ Finds the text "@" starting at index 0 and ending at index 1. Current REGEX is: . Current INPUT is: 1 Finds the text "1" starting at index 0 and ending at index 1. Current REGEX is: . Current INPUT is: a Finds the text "a" starting at index 0 and ending at index 1. Current REGEX is: \d Current INPUT is: 1 Finds the text "1" starting at index 0 and ending at index 1. Current REGEX is: \d Current INPUT is: a No match found. Current REGEX is: \D Current INPUT is: 1 No match found. Current REGEX is: \D Current INPUT is: a Finds the text "a" starting at index 0 and ending at index 1. Current REGEX is: \s Current INPUT is: Finds the text " " starting at index 0 and ending at index 1. Current REGEX is: \s Current INPUT is: a No match found. Current REGEX is: \S Current INPUT is: No match found. Current REGEX is: \S Current INPUT is: a Finds the text "a" starting at index 0 and ending at index 1. Current REGEX is: \w Current INPUT is: a Finds the text "a" starting at index 0 and ending at index 1. Current REGEX is: \w Current INPUT is: ! No match found. Current REGEX is: \W Current INPUT is: a No match found. Current REGEX is: \W Current INPUT is: ! Finds the text "!" starting at index 0 and ending at index 1..
(the "period" or "dot" metacharacter) which indicates "any character." Therefore, the match is successful in all three cases (a randomly-selected@
character, a digit, and a letter). The remaining examples each use a single regular expression construct from the Predefined Character Classes table. You can refer to this table to figure out the logic behind each match:Alternatively, a capital letter means the opposite:
\d
matches all digits\s
matches spaces\w
matches word characters
\D
matches non-digits\S
matches non-spaces\W
matches non-word characters
Quantifiers allow you to specify the number of occurrences to match against. For convenience, the three sections of the API specification describing greedy, relucant, and possessive quantifiers are presented below. At first glance it may appear that the quantifiersX?
,X??
andX?+
do exactly the same thing, since they all promise to match "X
, once or not at all". There are subtle implementation differences which will be explained near the end of this section.
Quantifiers Meaning Greedy Reluctant Possessive X?
X??
X?+
X
, once or not at allX*
X*?
X*+
X
, zero or more timesX+
X+?
X++
X
, one or more timesX{n}
X{n}?
X{n}+
X
, exactlyn
timesX{n,}
X{n,}?
X{n,}+
X
, at leastn
timesX{n,m}
X{n,m}?
X{n,m}+
X
, at leastn
but not more thanm
timesLet's start our look at greedy quantifiers by creating three different regular expressions: the letter "a" followed by either
?
,*
, or+
.Current REGEX is: a? // looking for the letter "a", once or not at all Current INPUT is: Finds the text "" starting at index 0 and ending at index 0. Current REGEX is: a* // looking for the letter "a", zero or more times Current INPUT is: Finds the text "" starting at index 0 and ending at index 0. Current REGEX is: a+ // looking for the letter "a", one or more times Current INPUT is: No match found.Zero-Length Matches
In the above example, the match is successful in the first two cases, because the expressionsa?
anda*
both allow for zero occurances of the lettera
. You'll also notice that the start and end indices are both zero, which is unlike any of the examples we've seen so far. The empty input string""
has no length, so the test simply matches nothing at index 0. Matches of this sort are known as a zero-length matches. A zero-length match can occur in a several cases: in an empty input string, at the beginning of an input string, after the last character of an input string, or in between any two characters of an input string. Zero-length matches are easily identifiable because they always start and end at the same index position.Let's explore zero-length matches with a few more examples. Change the input string to a single letter "a" and you'll notice something interesting:
All three quantifiers found the letter "a", but the first two also found a zero-length match at index 1; that is, after the last character of the input string. Remember, the matcher sees the character "a" as sitting in the cell between index 0 and index 1, and our test harness loops until it can no longer find a match. Depending on the quantifier used, the presence of "nothing" at the index after the last character may or may not trigger a match.Current REGEX is: a? Current INPUT is: a Finds the text "a" starting at index 0 and ending at index 1. Finds the text "" starting at index 1 and ending at index 1. Current REGEX is: a* Current INPUT is: a Finds the text "a" starting at index 0 and ending at index 1. Finds the text "" starting at index 1 and ending at index 1. Current REGEX is: a+ Current INPUT is: a Finds the text "a" starting at index 0 and ending at index 1.Now change the input string to the letter "a" five times in a row and you'll get the following:
The expressionCurrent REGEX is: a? Current INPUT is: aaaaa Finds the text "a" starting at index 0 and ending at index 1. Finds the text "a" starting at index 1 and ending at index 2. Finds the text "a" starting at index 2 and ending at index 3. Finds the text "a" starting at index 3 and ending at index 4. Finds the text "a" starting at index 4 and ending at index 5. Finds the text "" starting at index 5 and ending at index 5. Current REGEX is: a* Current INPUT is: aaaaa Finds the text "aaaaa" starting at index 0 and ending at index 5. Finds the text "" starting at index 5 and ending at index 5. Current REGEX is: a+ Current INPUT is: aaaaa Finds the text "aaaaa" starting at index 0 and ending at index 5.a?
finds an individual match for each character, since it matches when "a" appears zero or one times. The expressiona*
finds two separate matches: all of the letter a's in the first match, then the zero-length match after the last character at index 5. And finally,a+
matches all occurances of the letter a, ignoring the presense of "nothing" at the last index.At this point, you might be wondering what the results would be if the first two quantifiers encounter a letter other than "a". For example, what happens if it encounters the letter "b", as in "ababaaaab"?
Let's find out:
Even though the letter "b" appears in cells 1, 3, and 8, the output reports a zero-length match at those locations. The regular expressionCurrent REGEX is: a? Current INPUT is: ababaaaab Finds the text "a" starting at index 0 and ending at index 1. Finds the text "" starting at index 1 and ending at index 1. Finds the text "a" starting at index 2 and ending at index 3. Finds the text "" starting at index 3 and ending at index 3. Finds the text "a" starting at index 4 and ending at index 5. Finds the text "a" starting at index 5 and ending at index 6. Finds the text "a" starting at index 6 and ending at index 7. Finds the text "a" starting at index 7 and ending at index 8. Finds the text "" starting at index 8 and ending at index 8. Finds the text "" starting at index 9 and ending at index 9. Current REGEX is: a* Current INPUT is: ababaaaab Finds the text "a" starting at index 0 and ending at index 1. Finds the text "" starting at index 1 and ending at index 1. Finds the text "a" starting at index 2 and ending at index 3. Finds the text "" starting at index 3 and ending at index 3. Finds the text "aaaa" starting at index 4 and ending at index 8. Finds the text "" starting at index 8 and ending at index 8. Finds the text "" starting at index 9 and ending at index 9. Current REGEX is: a+ Current INPUT is: ababaaaab Finds the text "a" starting at index 0 and ending at index 1. Finds the text "a" starting at index 2 and ending at index 3. Finds the text "aaaa" starting at index 4 and ending at index 8.a?
is not specifically looking for the letter "b"; it's merely looking for the presence (or lack thereof) of the letter "a". If the quantifier allows for a match of "a" zero times, anything in the input string that's not an "a" will show up as a zero-length match. The remaining a's are matched according to the rules discussed in the previous examples.To match a pattern exactly n number of times, simply specify the number inside a set of braces:
Here, the regular expressionCurrent REGEX is: a{3} Current INPUT is: aa No match found. Current REGEX is: a{3} Current INPUT is: aaa Finds the text "aaa" starting at index 0 and ending at index 3. Current REGEX is: a{3} Current INPUT is: aaaa Finds the text "aaa" starting at index 0 and ending at index 3.a{3}
is searching for three occurences of the letter "a" in a row. The first test fails because the input string does not have enough a's to match against. The third test contains exactly 3 a's in the input string, which triggers a match. The fourth example also triggers a match because there are exactly 3 a's at the beginning of the input string. Anything following that is irrelevant to the first match. If the pattern should appear again after that point, it would trigger subsequent matches:To require a pattern to appear at least n times, add a comma after the number:Current REGEX is: a{3} Current INPUT is: aaaaaaaaa Finds the text "aaa" starting at index 0 and ending at index 3. Finds the text "aaa" starting at index 3 and ending at index 6. Finds the text "aaa" starting at index 6 and ending at index 9.With the same input string, this test finds only one match, because the 9 a's in a row satisfy the need for "at least" 3 a's.Current REGEX is: a{3,} Current INPUT is: aaaaaaaaa Finds the text "aaaaaaaaa" starting at index 0 and ending at index 9.Finally, to specify an upper limit on the number of occurances, add a second number inside the braces:
Here the first match is forced to stop at the upper limit of 6 characters. The second match includes whatever is left over, which happens to be three a's--the mimimum number of characters allowed for this match. If the input string were one character shorter, there would not be a second match since only two a's would remain.Current REGEX is: a{3,6} // find at least 3 (but no more than 6) a's in a row Current INPUT is: aaaaaaaaa Finds the text "aaaaaa" starting at index 0 and ending at index 6. Finds the text "aaa" starting at index 6 and ending at index 9.Capturing Groups and Character Classes with Quantifiers
Until now, we've only tested quantifiers on input strings containing one character. In fact, quantifiers can only attach to one character at a time, so the regular expression "abc+" would mean "a, followed by b, followed by c one or more times". It would not mean "abc" one or more times. However, quantifiers can also attach to Character Classes and Capturing Groups, such as[abc]+
(a or b or c, one or more times) or(abc)+
(the group "abc", one or more times).Let's illustrate by specifing the group
(dog)
, three times in a row.Here the first example finds three matches, since the quantifier applies to the entire capturing group. Remove the parenthesis, however, and the match fails because the quantifierCurrent REGEX is: (dog){3} Current INPUT is: dogdogdogdogdogdog Finds the text "dogdogdog" starting at index 0 and ending at index 9. Finds the text "dogdogdog" starting at index 9 and ending at index 18. Current REGEX is: dog{3} Current INPUT is: dogdogdogdogdogdog No match found.{3}
now applies only to the letter "g".Similarly, we can apply a quantifier to an entire character class:
Here the quantifierCurrent REGEX is: [abc]{3} Current INPUT is: abccabaaaccbbbc Finds the text "abc" starting at index 0 and ending at index 3. Finds the text "cab" starting at index 3 and ending at index 6. Finds the text "aaa" starting at index 6 and ending at index 9. Finds the text "ccb" starting at index 9 and ending at index 12. Finds the text "bbc" starting at index 12 and ending at index 15. Current REGEX is: abc{3} Current INPUT is: abccabaaaccbbbc No match found.{3}
applies to the entire character class in the first example, but only to the letter "c" in the second.Differences Among Greedy, Reluctant, and Possessive Quantifiers
As mentioned earlier, there are subtle differences among greedy, reluctant, and possessive quantifiers.Greedy quantifiers are considered "greedy" because they force the matcher to read in, or eat, the entire input string prior to attempting the first match. If the first match attempt (the entire input string) fails, the matcher backs off the input string by one character and tries again, repeating the process until a match is found or there are no more characters left to back off from. Depending on the quantifier used in the expression, the last thing it will try matching against is 1 or 0 characters.
The reluctant quantifiers, however, take the opposite approach: they start at the beginning of the input string, then reluctantly eat one character at a time looking for a match. The last thing they try is the entire input string.
Finally, the possessive quantifiers always eat the entire input string, trying once (and only once) for a match. Unlike the greedy quantifiers, possessive quantifiers never back off, even if doing so would allow the overall match to succeed.
To illustrate, consider the input string
xfooxxxxxxfoo
.The first example uses the greedy quantifierCurrent REGEX is: .*foo // greedy quantifier Current INPUT is: xfooxxxxxxfoo Finds the text "xfooxxxxxxfoo" starting at index 0 and ending at index 13. Current REGEX is: .*?foo // reluctant quantifier Current INPUT is: xfooxxxxxxfoo Finds the text "xfoo" starting at index 0 and ending at index 4. Finds the text "xxxxxxfoo" starting at index 4 and ending at index 13. Current REGEX is: .*+foo // possessive quantifier Current INPUT is: xfooxxxxxxfoo No match found..*
to find "anything", zero or more times, followed by the letters"f" "o" "o"
. Because the quantifier is greedy, the.*
portion of the expression first eats the entire input string. At this point, the overall expression cannot succeed, because the last three letters ("f" "o" "o"
) have already been consumed. So the matcher slowly backs off one letter at a time until the rightmost occurrence of "foo" has been regurgitated, at which point the match succeeds and the search ends.The second example, however, is reluctant, so it starts by first consuming "nothing". Because "foo" doesn't appear at the beginning of the string, it's forced to swallow the first letter (an "x"), which triggers the first match at 0 and 4. Our test harness continues the process until the input string is exhausted. It finds another match at 4 and 13.
The third example fails to find a match because the quantifier is possessive. In this case, the entire input string is consumed by
.*+
, leaving nothing left over to satisfy the "foo" at the end of the expression. Use a possessive quantifier for situations where you want to seize all of something without ever backing off; it will outperform the equivalent greedy quantifier in cases where the match is not immediately found.
In the previous section, we saw how quantifiers attach to one character, character class, or capturing group at a time. But until now, we have not discussed the notion of capturing groups in any detail.Capturing groups are a way to treat multiple characters as a single unit. They are created by placing the characters to be grouped inside a set of parentheses. For example, the regular expression
(dog)
creates a single group containing the letters"d" "o"
and"g"
. The portion of the input string that matches the capturing group will be saved in memory for later recall via backreferences (as discussed below in the section, Backreferences.Numbering
As described in the http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Pattern.html API, capturing groups are numbered by counting their opening parentheses from left to right. In the expression((A)(B(C)))
, for example, there are four such groups:To find out how many groups are present in the expression, call the
((A)(B(C)))
(A)
(B(C))
(C)
groupCount
method on a matcher object. ThegroupCount
method returns anint
showing the number of capturing groups present in the matcher's pattern. In this example,groupCount
would return the number4
, showing that the pattern contains 4 capturing groups.There is also a special group, group 0, which always represents the entire expression. This group is not included in the total reported by
groupCount
. Also note that groups beginning with(?
are pure, non-capturing groups that do not capture text and do not count towards the group total. (You'll see examples of non-capturing groups later in the section Methods of the Pattern Class)It's important to understand how groups are numbered because some
Matcher
methods accept anint
specifying a particular group number as a parameter:
- http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Matcher.html:
Returns the start index of the subsequence captured by the given group during the previous match operation.
- http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Matcher.html:
Returns the index of the last character, plus one, of the subsequence captured by the given group during the previous match operation.
- http://java.sun.com/j2se/1.4.2/docs/api/java/util/regex/Matcher.html:
Returns the input subsequence captured by the given group during the previous match operation.Backreferences
The section of the input string matching the capturing group(s) is saved in memory for later recall via a backreference. A backreference is specified in the regular expression as a backslash (\
) followed by a digit indicating the number of the group to be recalled. For example, the expression(\d\d)
defines one capturing group matching two digits in a row, which can be recalled later in the expression via the backreference\1
.To match any 2 digits, followed by the exact same two digits, you would use
(\d\d)\1
as the regular expression:If you change the last two digits and the match will fail:Current REGEX is: (\d\d)\1 Current INPUT is: 1212 Finds the text "1212" starting at index 0 and ending at index 4.For nested capturing groups, backreferencing works in exactly the same way: Specify a backslash followed by the number of the group to be recalled.Current REGEX is: (\d\d)\1 Current INPUT is: 1234 No match found.
Until now, we've only been interested in whether or not a match is found at some location within a particular input string. We never cared about where in the string the match was taking place.You can make your pattern matches more precise by specifying such information with boundary matchers. For example, maybe you're interested in finding a particular word, but only if it appears at the beginning or end of a line. Or maybe you want to know if the match is taking place on a word boundary, or at the end of the previous match.
The following table lists and explains all the boundary matchers.
The following examples demonstrate the use of boundary matchers
Boundary Matchers ^
The beginning of a line $
The end of a line \b
A word boundary \B
A non-word boundary \A
The beginning of the input \G
The end of the previous match \Z
The end of the input but for the final terminator, if any \z
The end of the input ^
and$
. As noted above,^
matches the beginning of a line, and$
matches the end.The first example is successful because the pattern occupies the entire input string. The second example fails because the input string contains extra whitespace at the beginning. The third example specifies an expression that allows for unlimited white space, followed by "dog" on the end of the line. The fourth example requires "dog" to be present at the beginning of a line followed by an unlimited number of word characters.Current REGEX is: ^dog$ Current INPUT is: dog Finds the text "dog" starting at index 0 and ending at index 3. Current REGEX is: ^dog$ Current INPUT is: dog No match found. Current REGEX is: \s*dog$ Current INPUT is: dog Finds the text " dog" starting at index 0 and ending at index 15. Current REGEX is: ^dog\w* Current INPUT is: dogblahblah Finds the text "dogblahblah" starting at index 0 and ending at index 11.To check if a pattern begins and ends on a word boundary (as opposed to a substring within a longer string), just use
\b
on either side; for example,\bdog\b
To match the expression on a non-word boundary, useCurrent REGEX is: \bdog\b Current INPUT is: The dog plays in the yard. Finds the text "dog" starting at index 4 and ending at index 7. Current REGEX is: \bdog\b Current INPUT is: The doggie plays in the yard. No match found.\B
instead:To require the match to occur only at the end of the previous match, useCurrent REGEX is: \bdog\B Current INPUT is: The dog plays in the yard. No match found. Current REGEX is: \bdog\B Current INPUT is: The doggie plays in the yard. Finds the text "dog" starting at index 4 and ending at index 7.\G
:Here the second example finds only one match, because the second occurrence of "dog" does not start at the end of the previous match.Current REGEX is: dog // Without \G Current INPUT is: dog dog Finds the text "dog" starting at index 0 and ending at index 3. Finds the text "dog" starting at index 4 and ending at index 7. Current REGEX is: \Gdog // With \G Current INPUT is: dog dog Finds the text "dog" starting at index 0 and ending at index 3.