Click here to read the latest newsletter! This is an iEntry.com Website
Search iEntry News

xp_regex: Regular Expressions in SQL Server 2000

By Dan Farino
Contributing Writer
Article Date: 02.20.03



Download source files/compiled DLL - 58 Kb

Regular Expressions are probably the best tool for text parsing. Using Regular Expressions greatly simplifies anything for which you would normally use CHARINDEX, PATINDEX, SUBSTRING, STUFF, etc.

xp_regex is an Extended Stored Procedure for SQL Server 2000 written in Managed C++ that lets you use Regular Expressions from T-SQL. In case you’re skeptical about the performance when mixing non-.NET code (in this case SQL Server) and .NET code in the same process, I can tell you that I wrote a completely non-.NET version using the Boost Regex++ Regular Expression library and the speed was actually slightly slower. Plus, the .NET Regular Expression library is pretty much a complete implementation of Perl 5.6's (the de factostandard).

There are two Extended Stored Procedures in the DLL:

xp_regex_split
xp_regex_format

1. XP_REGEX_FORMAT

Syntax:
xp_regex_format @input, @regex, @format_string, @result OUTPUT

@input is the text to parse.
@regex is the regular expression to match.
@format_string is used to format the results
@result is an output parameter that will hold the formatted results

All parameters are either VARCHAR or CHAR of any length. (TEXT might work too. Haven't tried…)

xp_regex_format is used to parse an input string and format the results. Probably the best example to demonstrate is by doing some telephone number parsing. I’ll assume you know the Regular Expression syntax for this documentation.

The regex [^d]*(d{3})[^d]*(d{3}) [^d]*(d{4}) will parse just about any phone- number- like string you throw at it. For instance, this code:
DECLARE @out VARCHAR(50)


EXEC xp_regex_format '(310)555-1212',
'[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})'
, '($1) $2-$3', @out OUTPUT PRINT @out

EXEC xp_regex_format '310.555.1212',
'[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})' , '($1) $2-$3', @out OUTPUT PRINT @out

EXEC xp_regex_format ' 310!555 hey! 1212',
'[^d]*(d{3})[^d]*(d{3})[^d]*
(d{4})','($1) $2-$3', @out OUTPUT PRINT @out

EXEC xp_regex_format ' hello, ( 310 ) 555.1212
is my phone number.
Thank you.', '[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})', '($1) $2-$3', @out OUTPUT PRINT @out

prints out:

(310) 555-1212
(310) 555-1212
(310) 555-1212
(310) 555-1212

(Cool!)

2. XP_REGEX_SPLIT

Syntax:
xp_regex_split @input, @regex, @column_number, @result OUTPUT

@input is the text to parse.
@regex is a regular expression that matches the delimiter
@column_number: We’re basically doing a "text-to-columns" here, so @column_number lets you specify which of the resulting columns should be passed back in the @result parameter
@result is an output parameter that will hold the formatted results

@input, @regex and @result are either VARCHAR or CHAR of any length.
@column_number is an INT. Columns are numbered starting at 1.

This function splits text data on some sort of delimiter (comma, pipe,

whatever). The cool thing about a split using regular expressions is that the delimiter does not have to be as consistent as you would normally expect.

For example, take this line as your source data:

one ,two|three : four

In this case, our delimiter is either a comma, pipe or colon with any number of spaces either before or after (or both). In regex form, that is written: s*[,|:]s*

For example:

DECLARE @out VARCHAR(8000)


EXEC xp_regex_split 'one ,two|three : four',
's*[,|:]s*', 1, @out
OUTPUT PRINT @out

EXEC xp_regex_split 'one ,two|three : four',
's*[,|:]s*', 2, @out
OUTPUT PRINT @out

EXEC xp_regex_split 'one ,two|three : four',
's*[,|:]s*', 3, @out
OUTPUT PRINT @out

EXEC xp_regex_split 'one ,two|three : four',
's*[,|:]s*', 4, @out
OUTPUT PRINT @out

prints out:

one 
two
three 
four

Please note that in the case above, there is no performance penalty for running the same split more than once. xp_regex_split caches both the input string and the regular expression. So calling it multiple times while changing only the @column_number parameter is perfectly fine. The actual split is only done the first time and the rest is pulled from cache.

3. FN_XP_REGEX_SPLIT and FN_XP_REGEX_FORMAT

These are user-defined functions that wrap the stored procedures. This way you can use the function as part of a select list or a where clause:

SELECT master.dbo.fn_regex_format(phone_number,
 '[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})', '$1 $2 $3')
FROM
	customers

This would format every phone number in the "customers" table.

4. Installation

a. Copy xpRegex.dll to your Program FilesMicrosoft SQL ServerMSSQLinn folder.
b. Run the SQL script INSTALL.SQL. This will register the procedures and create the user-defined functions.

5. Misc

I've created a set of Performance Counters (the Performance Object is called "xp_regex"). You can use the Windows 2000 System Monitor to see various statistics about xp_regex while it is running.

Comments/corrections/additions are welcome. Please let me know if you find this useful! Thanks!

Originally appeared at http://www.codeproject.com/useritems/xpRegex.asp.
dan@stamps.com
Stamps.com, Inc.
Postage from Your Printer


Newsletter Archive | Submit Article | Advertising Information | About Us | Contact