Friday, May 1, 2009

Regular Expressions with The Microsoft .NET Framework

Introduction
Regular expressions have been used in various programming languages and tools for many years. The .NET Base Class Libraries include a namespace and a set of classes for utilizing the power of regular expressions. They are designed to be compatible with Perl 5 regular expressions whenever possible.

In addition, the regexp classes implement some additional functionality, such as named capture groups, right- to-left pattern matching, and expression compilation.

In this article, I'll provide a quick overview of the classes and methods of the System.Text.RegularExpression assembly, some examples of matching and replacing strings, a more detailed walk-through of a grouping structure, and finally, a set of cookbook expressions for use in your own applications.

The RegularExpression Assembly
The regexp classes are contained in the System.Text.RegularExpressions.dll assembly, and you will have to reference the assembly at compile time in order to build your application. For example: csc /r:System.Text.RegularExpressions.dll foo.cs will build the foo.exe assembly, with a reference to the System.Text.RegularExpressions assembly.

There are actually only six classes and one delegate definition in the assembly namespace. These are:

Capture: Contains the results of a single match

CaptureCollection: A sequence of Capture's

Group: The result of a single group capture, inherits from Capture

Match: The result of a single expression match, inherits from Group

MatchCollection: A sequence of Match's

MatchEvaluator: A delegate for use during replacement operations

Regex: An instance of a compiled regular expression

The Regex class also contains several static methods:

Escape: Escapes regex metacharacters within a string

IsMatch: Methods return a boolean result if the supplied regular expression matches within the string

Match: Methods return Match instance

Matches: Methods return a list of Match as a collection

Replace: Methods that replace the matched regular expressions with replacement strings

Split: Methods return an array of strings determined by the expression

Unescape: Unescapes any escaped characters within a string

Regular expressions are used to search specified in the source string.

Examples:

Pattern#1
Regex objNotNaturalPattern=new Regex("[^0-9]");

Pattern#2
Regex objNaturalPattern=new Regex("0*[1-9][0-9]*");

Pattern#1 will match for strings other than 0 to 9.^ symbol is used for Specifying not condition.[] brackets if we are to give range values such as 0 - 9 or a-z or A-Z

eg. abc will return true

123 will return false.

Pattern#2 will match for string which are Natural Numbers.Natural numbers Are numbers which are always greater than 0.The pattern 0* tells that a natural Number can be prefixed with any number of zero's or no zero's.the next [1-9] tells that it should contain atleast one number from 1 to 9 followed by any numbers of

0-9's

Eg. 0007 returns true whereas 00 will return false.

Basic things to be understood in RegEx:

"*" matches 0 or more patterns
"?" matches single character
"^" for ignoring matches.
"[]" for searching range patterns.

1. Getting numbers in string
First, here we look at how you can get all numbers in a string, and then actually parse them into integers for easier usage in your C# program. The important part of the example is that it splits on all non-digit values in the string, and then loops through the result strings and uses int.TryParse.

=== Program that uses Regex.Split (C#) ===
using System;
using System.Text.RegularExpressions;

class Program
{
static void Main()
{
//
// String containing numbers.
//
string sentence = "10 cats, 20 dogs, 40 fish and 1 programmer.";
//
// Get all digit sequence as strings.
//
string[] digits = Regex.Split(sentence, @"\D+");
//
// Now we have each number string.
//
foreach (string value in digits)
{
//
// Parse the value to get the number.
//
int number;
if (int.TryParse(value, out number))
{
Console.WriteLine(value);
}
}
}
}

=== Output of the program ===

10
20
40
1

2. Splitting on multiple whitespaces
Here we see how you can extract all substrings in your string that are separated by whitespace characters. You could also use string Split, but this version is simpler and can also be extended more easily. The example gets all operands and operators from an equation string.

=== Program that tokenizes (C#) ===

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main()
{
//
// The equation.
//
string operation = "3 * 5 = 15";
//
// Split it on whitespace sequences.
//
string[] operands = Regex.Split(operation, @"\s+");
//
// Now we have each token.
//
foreach (string operand in operands)
{
Console.WriteLine(operand);
}
}
}

=== Output of the program ===

3
*
5
=
15

3. Getting all uppercase words

Here we look at a method that gets all the words that have an initial uppercase letter in a string. The Regex.Split call used actually just gets all the words, and the loop checks the first letter for its case. In most programs, it is useful to combine regular expressions and manual looping and string operations. Programs are not art projects.

=== Program that collects uppercase words (C#) ===

using System;
using System.Collections.Generic;
using System.Text.RegularExpressions;

class Program
{
static void Main()
{
//
// String containing uppercased words.
//
string sentence = "Bob and Michelle are from Indiana.";
//
// Get all words.
//
string[] uppercaseWords = Regex.Split(sentence, @"\W");
//
// Get all uppercased words.
//
var list = new List();
foreach (string value in uppercaseWords)
{
//
// Check the word.
//
if (!string.IsNullOrEmpty(value) &&
char.IsUpper(value[0]))
{
list.Add(value);
}
}
//
// Write all proper nouns.
//
foreach (var value in list)
{
Console.WriteLine(value);
}
}
}

=== Output of the program ===

Bob
Michelle
Indiana

4. Using class-level compiled Regex
Here we see how you can use a compiled regular expression, and store it at the class level. We see two new approaches here. The Regex is stored as a static field, meaning it can be reused throughout the application without recreating it.

=== Program that uses static compiled Regex (C#) ===

using System;
using System.Text.RegularExpressions;

class Program
{
static Regex _wordRegex = new Regex(@"\W+", RegexOptions.Compiled);

static void Main()
{
string s = "This is a simple /string/ for Regex.";
string[] c = _wordRegex.Split(s);
foreach (string m in c)
{
Console.WriteLine(m);
}
}
}

=== Output of the program ===

This
is
a
simple
string
for
Regex

5. Using instance Regex with Split
Here we see faster approach than the above example. This example creates an expression with new Regex. It works the same, but has better performance. It stores the Regex as a method-level instance.

=== Program that uses instance Regex (C#) ===

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main()
{
string s = "This is a simple /string/ for Regex.";
Regex r = new Regex(@"\W+");
string[] c = r.Split(s);
foreach (string m in c)
{
Console.WriteLine(m);
}
}
}

=== Output of the program ===

This
is
a
simple
string
for
Regex

6. Using static Regex.Split
Here we look at the slowest of the examples in this document. This is the static Regex.Split method in System.Text.RegularExpressions. For the next three examples, I use Split, but other methods such as Matches, Match, and Replace have similar characteristics.

=== Program that uses Regex.Split (C#) ===

using System;
using System.Text.RegularExpressions;

class Program
{
static void Main()
{
string s = "This is a simple /string/ for Regex.";
string[] c = Regex.Split(s, @"\W+");
foreach (string m in c)
{
Console.WriteLine(m);
}

}
}

=== Output of the program ===

This
is
a
simple
string
for
Regex