Oracle Regular Expressions

Oracle Regular Expressions

Overview

The regular expressions feature in Oracle Database 10g is a powerful tool for manipulating textual data
A new feature of Oracle Database 10g vastly improves your ability to search and manipulate character data. This feature, regular expressions, is a notation for describing textual patterns. It has long been available in many programming languages and a number of UNIX utilities.
Oracle's implementation of regular expressions comes in the form of various SQL functions and a WHERE clause operator. If you are unfamiliar with regular expressions, this article can give you a glimpse into this new and extremely powerful yet seemingly cryptic capability. Readers already familiar with regular expressions can gain an understanding of how to apply this functionality within the context of the Oracle SQL language.

What Is a Regular Expression?

A regular expression comprises one or more character literals and/or metacharacters. In its simplest format, a regular expression can consist only of character literals, such as the regular expression cat. It is read as the letter c followed by the letters aand t and this pattern matches strings such as cat, location, and catalog. Metacharacters provide algorithms that specify how Oracle should process the characters that make up a regular expression. When you understand the meaning of the various metacharacters, you will see that regular expressions are powerful for isolating and replacing specific textual data.
Data validation, identification of duplicate word occurrences, detection of extraneous white spaces, or parsing of strings are just some of the many uses of regular expressions. You can apply them in order to validate the formats of phone numbers, zip codes, email addresses, Social Security numbers, IP addresses, filenames and pathnames, and so on. Furthermore, you can locate patterns such as HTML tags, numbers, dates, or anything that fits any pattern within any textual data and replace them with other patterns.

Using Regular Expressions With Oracle Database 10g

To harness the power of regular expressions, you can exploit the newly introduced Oracle SQL REGEXP_LIKE operator and the REGEXP_INSTR, REGEXP_SUBSTR, and REGEXP_REPLACE functions. You will see how this new functionality supplements the existing LIKE operator and the INSTR, SUBSTR, and REPLACE functions. In fact, they are similar to the existing operator and functions but now offer powerful pattern-matching capabilities. The searched data can be simple strings or large volumes of text stored in the database character columns. Regular expressions let you search, replace, and validate data in ways you have never thought of before, with a high degree of flexibility.

Basic Examples of Regular Expressions

Before using the new functionality, you need to understand the meaning of some of the metacharacters. The period (.) matches any character (except newline) in a regular expression. For example, the regular expression a.b matches a string containing the letter a, followed by any other single character (except newline), followed by the letter b. The strings axb, xaybx, and abbaare matches because this pattern is buried in the string. If you want to exactly match a three-letter string in which the line begins with aand ends with b, anchor the regular expression. The caret (^) metacharacter indicates the start of a line, and the dollar symbol ($) designates the end of the line (see Table 1). Therefore, the regular expression ^a.b$ matches the strings aab, abb, or axb. To contrast this approach with the familiar pattern matching available with the LIKE operator, you can express such a pattern as a_b, where the underbar (_) is the one-character wildcard.
By default, an individual character or character list in a regular expression matches just once. To indicate multiple occurrences of a character in a regular expression, you apply a quantifier, also called a repetition operator. If you want a match that starts with the letter aand ends with the letter b, your regular expression looks like this: ^a.*b$. The *metacharacter repeats the preceding match any metacharacter (.) zero, one, or more times. The equivalent pattern with the LIKE operator is a%b, with the percent (%) indicating zero, one, or multiple occurrences of any character.
Table 2 shows the complete list of repetition operators. Notice that it contains specific repetition choices that allow more flexibility than the existing LIKE wildcard characters. If you use parentheses around an expression, effectively creating a subexpression, the subexpression can be repeated a certain number of times. For example, the regular expression b(an)*amatches ba, bana, banana, yourbananasplit, and so on.
Oracle's regular expression implementation supports the POSIX (Portable Operating System Interface) character classes, as listed in Table 3. This means that you can be very specific about the type of character you are looking for. Imagine writing a LIKE condition that looks only for nonalphabetic characters.the resulting WHERE clause could easily become very complex.
The POSIX character class must be enclosed by a character list indicated by square brackets ([]). For example, the regular expression matches a lowercase character and {5}matches five consecutive lowercase characters.
Besides the POSIX character classes, you can place individual characters in a character list. For example, the regular expression ^ab[cd]ef$matches the strings abcefand abdef. Either cor dmust be chosen.
Most metacharacters inside a character list are understood as literals, with the exception of the caret (^) and the hyphen (-). Regular expressions appear complicated because some metacharacters have multiple meanings, depending on the context. The ^is just one such metacharacter. If you use it as the first character inside a character list, it means the negation of a character list. Therefore, [^[:]]looks for a pattern consisting of any nondigit character whereas ^looks for matches that start with a digit. The hyphen (-) indicates a range; the regular expression [a-m]matches any of the letters athrough letter m. But it means the literal hyphen if it is the first character in a character list such as in [-afg].
One of the previous examples introduced the use of parentheses to create a subexpression; they allow you to enter alternates separated by the vertical bar (|) alteration metacharacter.
For example, the regular expression t(a|e|i)nallows three possible alternate characters between the letters tand n. Matches include words such as tan, ten, tin, and Pakistanbut not teen, mountain, or tune. Alternatively, the regular expression t(a|e|i)ncan also be expressed as a character list, which is t[aei]n. Table 4summarizes these metacharacters. Although more metacharacters exist, this brief overview is sufficient for understanding the regular expressions this article uses.

The REGEXP_LIKE Operator

The REGEXP_LIKE operator introduces you to regular expression functionality when applied within the Oracle database. Table 5lists the syntax of REGEXP_LIKE.
The following SQL query's WHERE clause shows the REGEXP_LIKE operator, which searches the ZIP column for a pattern that satisfies the regular expression [^[:]]. It will retrieve those rows in the ZIPCODE table for which the ZIP column values contain any character that is not a numeric digit.
SELECT zip
FROM zipcode
WHERE REGEXP_LIKE(zip, '[^[:]]')
ZIP
-----
ab123
123xy
007ab
abcxy 
This example of a regular expression consists only of metacharacters.more specifically the POSIX character class digitdelimited by colons and square brackets. The second set of brackets (as in [^[:]]) encloses a character class list. As previously mentioned, this is required because you can use POSIX character classes only for constructing a character list.

The REGEXP_INSTR Function

This function returns the starting position of a pattern, so it works much like the familiar INSTR function. The syntax of the new REGEXP_INSTR function is shown in Table 6. The main difference between the two functions is that REGEXP_INSTRlets you specify a pattern instead of a specific search string; thus providing greater versatility. The next example uses REGEXP_INSTRto return the starting position of the five-digit zip-code pattern within the string Joe Smith, 10045 Berry Lane, San Joseph, CA 91234. If the regular expression is written as {5}, you will get the house number's starting position instead of the zip code's, because 10045is the first occurrence of five consecutive digits. Therefore, anchor the expression to the end of the line, as indicated with the $metacharacter, and the function will display the starting position of the zip code regardless of the number of digits for the house number.
SELECT REGEXP_INSTR('Joe Smith, 10045 Berry Lane, San Joseph, CA 91234',
'{5}$')
AS rx_instr
FROM dual
RX_INSTR
---------- 
45 
Writing More Complex Patterns

Let's expand on the zip code pattern of the previous example to include an optional four digits. Your pattern may now look like this: {5}(-{4})?$. If your source string ends in either the 5-digit zip code or the 5-digit + 4 zip-code format, you'll be able to show the pattern's starting position.
SELECT REGEXP_INSTR('Joe Smith, 10045 Berry Lane, San Joseph, CA 91234-1234',
' {5}(-{4})?$')
AS starts_at
FROM dual
STARTS_AT
----------
44 
In this example the parenthesized subexpression (-{4})is repeated zero or one times, as indicated by the ?repetition operator. Again, attempting to use the traditional SQL functions to accomplish the same result poses a challenge even to SQL experts. To better explain the various components of the regular expression example, Table 7 contains a description of the individual literals and metacharacters.

The REGEXP_SUBSTR Function

The REGEXP_SUBSTR function, much like the SUBSTR function, extracts part of a string. Table 8 displays the syntax of the new function. In the following example, the string that matches the pattern , [^,]*, is returned. The regular expression searches for a comma followed by a space; then zero or more characters that are not commas, as indicated by [^,]*; and lastly looks for another comma. The pattern will look somewhat similar to a comma-separated values string.
SELECT REGEXP_SUBSTR('first field, second field , third field',
', [^,]*,')
FROM dual
REGEXP_SUBSTR('FIR
------------------
, second field , 
The REGEXP_REPLACE Function

Let's first look at the traditional REPLACESQL function, which substitutes one string with another. Assume your data has extraneous spaces in the text and you would like to replace them with a single space. With the REPLACE function, you need to list exactly how many spaces you want to replace. However, the number of extra spaces may not be the same everywhere in the text. The next example has three spaces between Joeand Smith. The REPLACE function's parameter specifies that two spaces should be replaced with one space. In this case, the result leaves an extra space where there were three spaces in the original string between Joeand Smith.
SELECT REPLACE('Joe Smith',' ', ' ')
AS replace
FROM dual
REPLACE
---------
Joe Smith 
The REGEXP_REPLACE function takes the substitution a step further; the syntax is listed in Table 9. The following query replaces any two or more spaces with a single space. The ( )subexpression contains a single space, which can be repeated two or more times, as indicated by {2,}.
SELECT REGEXP_REPLACE('Joe Smith',
'( ){2,}', ' ')
AS RX_REPLACE
FROM dual
RX_REPLACE
----------
Joe Smith 
Backreferences

A useful feature of regular expressions is the ability to store subexpressions for reuse later; this is also called backreferencing(summarized in Table 10). It allows sophisticated replace capabilities such as swapping patterns in new positions or indicating repeated word or letter occurrences. The matched part of the subexpression is stored in a temporary buffer. The buffer is numbered from left to right and accessed with the \digitnotation, where digitis a number between 1 and 9 and matches the digit-th subexpression, as indicated by a set of parentheses.
The next example shows the name Ellen Hildi Smith transformed to Smith, Ellen Hildi, by referring to the individual subexpressions by number.
SELECT REGEXP_REPLACE(
'Ellen Hildi Smith',
'(.*) (.*) (.*)', '\3, \1 \2')
FROM dual
REGEXP_REPLACE('EL
------------------
Smith, Ellen Hildi 
The SQL statement shows three individual subexpressions enclosed by parentheses. Each individual subexpression consists of a match any metacharacter (.) followed by the *metacharacter, indicating that any character (except newline) must be matched zero or more times. A space separates each subexpression and must be matched as well. The parentheses create subexpressions that capture the values and can be referenced with \digit. The first subexpression is assigned \1, the second \2, and so on. These backreferences are used in the last parameter of this function (\3, \1 \2), which effectively returns the replacement substrings and places them in the desired format (including comma and spaces). Table 11details the individual components of this regular expression.
Backreferences are useful for replacing, formatting, and substituting values, and you can apply them to find adjacent occurrences of values. The next example shows use of the REGEP_SUBSTR function to find any duplicate occurrences of alphanumeric values separated by a space. The displayed result shows the substring that identifies the duplicated words is.
SELECT REGEXP_SUBSTR(
'The final test is is the implementation',
'(+)(+)\1') AS substr
FROM dual
SUBSTR
------
is is 
The Match Parameter Option

You may have noticed that the regular expression operator and functions contain an optional match parameter. This parameter controls case-sensitivity, matching of the newline character, and retaining multiline inputs.

Practical Applications for Regular Expressions

You can use regular expressions not only in queries but also anywhere you can use a SQL operator or function, such as in the PL/SQL language. You can write triggers that take advantage of regular expression functionality in order to validate, generate, or extract values.
The next example illustrates how you can apply the REGEXP_LIKE operator in a column check constraint for data validation. It checks for the correct Social Security number format upon insert or update. Social Security numbers in such formats as 123-45-6789 and 123456789 are acceptable values for this column constraint. Valid data must begin with three digits, followed by a hyphen, two more digits and a hyphen, and lastly another four digits. The alternate expression allows only nine consecutive digits. The vertical bar symbol (|) separates the individual choices.
ALTER TABLE students
  ADD CONSTRAINT stud_ssn_ck CHECK
  (REGEXP_LIKE(ssn,
  '^({3}-{2}-{4}|{9})$'))
Leading or trailing characters are not acceptable, as indicated by ^and $. Make sure your regular expression does not split across multiple lines or contain any extraneous spaces unless you want them to be part of the pattern and matched accordingly. Table 12explains the individual components of this regular expression example.

Next Steps

Visit the Oracle Database 10g page: /products/database/oracle10g/index.html

Comparing Regular Expressions to Existing Functionality

Regular expressions have several advantages over the familiar LIKE operator and INSTR, SUBSTR, and REPLACE functions. These traditional SQL functions have no facility for matching patterns. Only the LIKE operator performs matching of characters, through the use of the %and _wildcards, but LIKEdoes not support repetitions of expressions, complex alternations, ranges of characters, characters lists, POSIX character classes, and so on. Furthermore, the new regular expression functions allow detection of duplicate word occurrences and swapping of patterns. The examples here offer you a glimpse into the world of regular expressions and on how you can apply them in your applications.

A Solid Addition to Your Toolkit

Because they help solve complex problems, regular expressions are very powerful. Some of the functionality of regular expressions is difficult to duplicate by using traditional SQL functions. When you've learned the basic building blocks of this somewhat cryptic language, regular expressions will become an indispensable part of your toolkit in the context of not only SQL but also other programming languages. Although trial and error are sometimes necessary to get your individual pattern right, the elegance and power of regular expressions is indisputable.
Alice Rischert (ar280@yahoo.com) chairs the Database Application Development and Design track at Columbia University's Computer Technology and Application Program. She is the author of the Oracle SQL Interactive Workbook 2nd edition (Prentice Hall, 2002) and the forthcoming Oracle SQL by Example(Prentice Hall, 2003). Rischert has over 15 years of experience as a database architect, DBA, and project leader for Fortune 100 companies and she has worked with Oracle since version 5.

Table 1: Anchoring Metacharacters

Metacharacter Description

^ Anchor the expression to the start of a line

$ Anchor the expression to the end of a line

Table 2: Quantifiers, or Repetition Operators

Quantifier Description

* Match 0 or more times

? Match 0 or 1 time

+ Match 1 or more times

{m} Match exactly m times

{m,} Match at least m times

{m, n} Match at least m times but no more than n times

Table 3: Predefined POSIX Character Classes

Character Class Description

[:] Alphabetic characters

[:] Lowercase alphabetic characters

[:] Uppercase alphabetic characters

[:] Numeric digits

[:] Alphanumeric characters

[:] Space characters (nonprinting), such as carriage return, newline, vertical tab, and form feed

[:] Punctuation characters

[:] Control characters (nonprinting)

[:] Printable characters

Table 4: Alternate Matching and Grouping of Expressions

Metacharacter Description

| Alternation Separates alternates, often used with grouping operator ()

( ) Group Groups subexpression into a unit for alternations, for quantifiers, or for backreferencing (see "Backreferences" section)

[char] Character list Indicates a character list; most metacharacters inside a character list are understood as literals, with the exception of character classes, and the ^and -metacharacters

Table 5: The REGEXP_LIKE Operator

Syntax Description

REGEXP_LIKE(source_string, pattern [, match_parameter]) source_stringsupports character datatypes (CHAR, VARCHAR2, CLOB, NCHAR, NVARCHAR2, and NCLOBbut not LONG). The patternparameter is another name for the regular expression. match_parameterallows optional parameters such as handling the newline character, retaining multiline formatting, and providing control over case-sensitivity.

Table 6: The REGEXP_INSTR Function

Syntax Description

REGEXP_INSTR(source_string, pattern [, start_position [, occurrence [, return_option [, match_parameter]]]]) This function looks for a patternand returns the first position of the pattern. Optionally, you can indicate the start_positionyou want to begin the search. The occurrenceparameter defaults to 1 unless you indicate that you are looking for a subsequent occurrence. The default value of the return_optionis 0, which returns the starting position of the pattern; a value of 1 returns the starting position of the next character following the match.

Table 7: Explanation of 5-digit + 4 Zip-Code Expression

Syntax Description

Empty space that must be matched

[:] POSIX numeric digit class

] End of character list

{5} Repeat exactly five occurrences of the character list

( Start of subexpression

- A literal hyphen, because it is not a range metacharacter inside a character list

[ Start of character list

[:] POSIX [:]class

[ Start of character list

] End of character list

{4} Repeat exactly four occurrences of the character list

) Closing parenthesis, to end the subexpression

? The ?quantifier matches the grouped subexpression 0 or 1 time thus making the 4-digit code optional

$ Anchoring metacharacter, to indicate the end of the line

Table 8: The REGEXP_SUBSTR Function

Syntax Description

REGEXP_SUBSTR(source_string, pattern [, position [, occurrence [, match_parameter]]]) The REGEXP_SUBSTR function returns the substring that matches the pattern.

Table 9: The REGEXP_REPLACE Function

Syntax Description

REGEXP_REPLACE(source_string, pattern [, replace_string [, position [,occurrence, [match_parameter]]]]) This function replaces the matching pattern with a specified replace_string, allowing complex search-and-replace operations.

Table 10: Backreference Metacharacter

Metacharacter Description

\digit Backslash Followed by a digit between 1 and 9, the backslash matches the preceding digit-th parenthesized subexpression. (Note: The backslash has another meaning in regular expressions; depending on the context it can also mean the Escape character

Table 11: Explanation of Pattern-Swap Regular Expression

Regular-Expression Item Description

( Start of first subexpression

. Match any single character except a newline

* Repetition operator, matches previous .metacharacter 0 to n times

) End of first subexpression; result of the match is captured in \1 (In this example, it's Ellen.)

Empty space that needs to be present

( Start of the second subexpression

. Match any single character except a newline

* Repetition operator matches the previous .metacharacter 0 to n times

) End of second subexpression; result of this match is captured in \2 (In this example, it stores Hildi.)

Empty space

( Start of third subexpression

. Match any single character except a newline

* Repetition operator matches .metacharacter 0 to n times

) End of third subexpression; result of this match is captured in \3 (In this example, it holds Smith.)

Table 12: Explanation of the Social Security Number Regular Expression

Regular-Expression Item Description

^ Start of line character (Regular expression cannot have any leading characters before the match.)

( Start subexpression and list alternates separated by the | metacharacter

[ Start of character list

[:] POSIX numeric digit class

] End of character list

{3} Repeat exactly three occurrences of character list

- A hyphen

[ Start of character list

[:] POSIX numeric digit class

] End of character list

{2} Repeat exactly two occurrences of character list

- Another hyphen

[ Start of character list

[:] POSIX numeric digit class

] End of character list

{4} Repeat exactly four occurrences of character list

| Alternation metacharacter; ends the first choice and starts the next alternate expression

[ Start of character list

[:] POSIX numeric digit class.

] End of character list

{9} Repeat exactly nine occurrences of character list

) Ending parenthesis, to close the subexpression group used for alternation

$ Anchoring metacharacter, to indicate the end of the line; no extra characters can follow the pattern

Metacharacter	Description
^	Anchor the expression to the start of a line
$	Anchor the expression to the end of a line

Quantifier	Description
*	Match 0 or more times
?	Match 0 or 1 time
+	Match 1 or more times
{m}	Match exactly m times
{m,}	Match at least m times
{m, n}	Match at least m times but no more than n times

Character Class	Description
[:]	Alphabetic characters
[:]	Lowercase alphabetic characters
[:]	Uppercase alphabetic characters
[:]	Numeric digits
[:]	Alphanumeric characters
[:]	Space characters (nonprinting), such as carriage return, newline, vertical tab, and form feed
[:]	Punctuation characters
[:]	Control characters (nonprinting)
[:]	Printable characters

Metacharacter	Description
\|	Alternation Separates alternates, often used with grouping operator ()
( )	Group Groups subexpression into a unit for alternations, for quantifiers, or for backreferencing (see "Backreferences" section)
[char]	Character list Indicates a character list; most metacharacters inside a character list are understood as literals, with the exception of character classes, and the ^and -metacharacters

Syntax	Description
	Empty space that must be matched
[:]	POSIX numeric digit class
]	End of character list
{5}	Repeat exactly five occurrences of the character list
(	Start of subexpression
-	A literal hyphen, because it is not a range metacharacter inside a character list
[	Start of character list
[:]	POSIX [:]class
[	Start of character list
]	End of character list
{4}	Repeat exactly four occurrences of the character list
)	Closing parenthesis, to end the subexpression
?	The ?quantifier matches the grouped subexpression 0 or 1 time thus making the 4-digit code optional
$	Anchoring metacharacter, to indicate the end of the line

Regular-Expression Item	Description
(	Start of first subexpression
.	Match any single character except a newline
*	Repetition operator, matches previous .metacharacter 0 to n times
)	End of first subexpression; result of the match is captured in \1 (In this example, it's Ellen.)
	Empty space that needs to be present
(	Start of the second subexpression
.	Match any single character except a newline
*	Repetition operator matches the previous .metacharacter 0 to n times
)	End of second subexpression; result of this match is captured in \2 (In this example, it stores Hildi.)
	Empty space
(	Start of third subexpression
.	Match any single character except a newline
*	Repetition operator matches .metacharacter 0 to n times
)	End of third subexpression; result of this match is captured in \3 (In this example, it holds Smith.)

Regular-Expression Item	Description
^	Start of line character (Regular expression cannot have any leading characters before the match.)
(	Start subexpression and list alternates separated by the \| metacharacter
[	Start of character list
[:]	POSIX numeric digit class
]	End of character list
{3}	Repeat exactly three occurrences of character list
-	A hyphen
[	Start of character list
[:]	POSIX numeric digit class
]	End of character list
{2}	Repeat exactly two occurrences of character list
-	Another hyphen
[	Start of character list
[:]	POSIX numeric digit class
]	End of character list
{4}	Repeat exactly four occurrences of character list
\|	Alternation metacharacter; ends the first choice and starts the next alternate expression
[	Start of character list
[:]	POSIX numeric digit class.
]	End of character list
{9}	Repeat exactly nine occurrences of character list
)	Ending parenthesis, to close the subexpression group used for alternation
$	Anchoring metacharacter, to indicate the end of the line; no extra characters can follow the pattern