By Dan Farino
Contributing Writer
Article Date: 02.20.03
Download
source files/compiled DLL - 58 Kb
Regular Expressions are probably the best tool for text parsing. Using
Regular Expressions greatly simplifies anything for which you would normally use
CHARINDEX, PATINDEX, SUBSTRING, STUFF, etc.
xp_regex is an Extended Stored Procedure for SQL Server 2000
written in Managed C++ that lets you use Regular Expressions from T-SQL. In case
you’re skeptical about the performance when mixing non-.NET code (in this case
SQL Server) and .NET code in the same process, I can tell you that I wrote a completely
non-.NET version using the Boost Regex++ Regular Expression library and the speed
was actually slightly slower. Plus, the .NET Regular Expression library is pretty
much a complete implementation of Perl 5.6's (the de factostandard).
There are two Extended Stored Procedures in the DLL:
xp_regex_split
xp_regex_format
1. XP_REGEX_FORMAT
Syntax:
xp_regex_format @input, @regex, @format_string, @result OUTPUT
@input is the text to parse.
@regex is the regular expression to match.
@format_string is used to format the results
@result is an output parameter that will hold the formatted results
All parameters are either VARCHAR or CHAR of any
length. (TEXT might work too. Haven't tried…)
xp_regex_format is used to parse an input string and format the
results. Probably the best example to demonstrate is by doing some telephone number
parsing. I’ll assume you know the Regular Expression syntax for this documentation.
The regex [^d]*(d{3})[^d]*(d{3})
[^d]*(d{4}) will parse just about any
phone- number- like string you throw at it. For instance, this code:
DECLARE @out VARCHAR(50)
EXEC xp_regex_format '(310)555-1212',
'[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})'
, '($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format '310.555.1212',
'[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})'
, '($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format ' 310!555 hey! 1212',
'[^d]*(d{3})[^d]*(d{3})[^d]*
(d{4})','($1) $2-$3', @out OUTPUT
PRINT @out
EXEC xp_regex_format ' hello, ( 310 ) 555.1212
is my phone number.
Thank you.', '[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})', '($1) $2-$3',
@out OUTPUT
PRINT @out
prints out:
(310) 555-1212
(310) 555-1212
(310) 555-1212
(310) 555-1212
(Cool!)
2. XP_REGEX_SPLIT
Syntax:
xp_regex_split @input, @regex, @column_number, @result OUTPUT
@input is the text to parse.
@regex is a regular expression that matches the delimiter
@column_number: We’re basically doing a "text-to-columns" here, so
@column_number lets you specify which of the resulting columns should be passed
back in the @result parameter
@result is an output parameter that will hold the formatted results
@input, @regex and @result are either
VARCHAR or CHAR of any length.
@column_number is an INT. Columns are numbered starting
at 1.
This function splits text data on some sort of delimiter (comma, pipe,
whatever). The cool thing about a split using regular expressions is that the
delimiter does not have to be as consistent as you would normally expect.
For example, take this line as your source data:
one ,two|three : four
In this case, our delimiter is either a comma, pipe or colon with any number
of spaces either before or after (or both). In regex form, that is written: s*[,|:]s*
For example:
DECLARE @out VARCHAR(8000)
EXEC xp_regex_split 'one ,two|three : four',
's*[,|:]s*', 1, @out
OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four',
's*[,|:]s*', 2, @out
OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four',
's*[,|:]s*', 3, @out
OUTPUT
PRINT @out
EXEC xp_regex_split 'one ,two|three : four',
's*[,|:]s*', 4, @out
OUTPUT
PRINT @out
prints out:
one
two
three
four
Please note that in the case above, there is no performance penalty for running
the same split more than once. xp_regex_split caches both the input
string and the regular expression. So calling it multiple times while changing
only the @column_number parameter is perfectly fine. The actual split
is only done the first time and the rest is pulled from cache.
3. FN_XP_REGEX_SPLIT and FN_XP_REGEX_FORMAT
These are user-defined functions that wrap the stored procedures. This way
you can use the function as part of a select list or a where clause:
SELECT master.dbo.fn_regex_format(phone_number,
'[^d]*(d{3})[^d]*(d{3})[^d]*(d{4})', '$1 $2 $3')
FROM
customers
This would format every phone number in the "customers" table.
4. Installation
a. Copy xpRegex.dll to your Program FilesMicrosoft SQL ServerMSSQLinn
folder.
b. Run the SQL script INSTALL.SQL. This will register the procedures and
create the user-defined functions.
5. Misc
I've created a set of Performance Counters (the Performance Object is called
"xp_regex"). You can use the Windows 2000 System Monitor to see various statistics
about xp_regex while it is running.
Comments/corrections/additions are welcome. Please let me know if you find
this useful! Thanks!
Originally appeared at http://www.codeproject.com/useritems/xpRegex.asp.
dan@stamps.com
Stamps.com, Inc.
Postage from Your Printer