Introduction to Regular Expressions

Beginning Regular Expressions in C#, Java, and Perl

Let's begin by saying that I'm an beginner, so my methods of using Regex is rather primitive, and there's more than one way to do it! (Perl philosophy!)

This tutorial is mainly in C#, going over Java and Perl syntaxes a bit - but since this is a regex tutorial, it should become easy to translate between the languages. Also, the prerequisite for this tutorial is some idea of how the languages work, because this tutorial is long without explaining all the various things that the language presents.

(You may ask yourself, why don't I just parse the string myself? Well, one is because it's easy to make a mistake, two is that usually regular expressions are compiled and optimized, so it's faster.)

So let's start with the definition:
A regular expression is a pattern of text in regular language. It describes what to match in a regular string. It serves as a filter to find what you're looking for.

Regular expressions are somewhat standardized - they're consistent with the different languages, despite the different syntax to call them.

The two main function of using regular expression is to match and to replace. The former determines if the pattern is in the string, and if so, find it. Replace changes the string according to the pattern to another pattern.

Let's go over matching first:

Basic Matching
Let's begin with a Hello World to welcome ourselves to the world of regex! (short for regular expression - very commonly used.)

The following is the full C# code: (I will dissect this in a minute)

using System.Text.RegularExpressions;

class RegEx{
    public static void Main(){
        string data = "Hello World!";
        if ( Regex.IsMatch( data, "Hello" ) ){
            System.Console.WriteLine( "Hello Found, joy!" );
        }
        data = "Goodbye World!";
        if ( Regex.IsMatch( data, "Hello" ) ){
        // Sadly, this will never be reached..
            System.Console.WriteLine( "Hello Found, joy!" );
        }
    }
}

This is of course, very basic of regex. Basically, IsMatch takes in two strings, the first one is the string, and the latter is the pattern. IsMatch returns true if it is found, and false otherwise.

In this case, data contains the string "Hello World!", and IsMatch finds the pattern "Hello" in "Hello World!", so it returns true. Later on, data is changed to the string "Goodbye World!", and when you evaluate IsMatch this time, "Hello" is not found, so it returns false.

Java, with its recent 1.4 upgrade, has finally regex support! So now we can actually use it..
Let's look at the Java version now:

import java.util.regex.*;
public class RegEx{
    public static void main( String args[] ){
        String data = "Hello World!";
        // This is to compile the pattern - this will increase performance
        Pattern pat = Pattern.compile("Hello");

        Matcher m = pat.matcher( data );
        if ( m.find() ){
            System.out.println("Hello Found!  joy!");
        }
        data = "Goodbye, cruel cruel world!";
        m = pat.matcher( data );
        if ( m.find() ){
            // Similarly, this will never be reached either..
            System.out.println("Hello Found!  joy!");
        }
    }
}

If you familiarized yourself with the C# example, then everything should look very similar, with some notable differences. Instead of using Strings and let the compiler do the work for you, you have to explicitly compiled the patterns and then applied it to whatever you need to do. This is achieved in the example by:

Pattern p = Pattern.compile("Hello");

Of course, p is the variable name, so you can use any traditional namespace. The string that is the argument of the compile method will be the pattern that will be compiled. It is "Hello" in this case.

The matcher is the "engine that performs match operations on a character sequence by interpreting a Pattern" (Java 1.4 API).
From the example:

Matcher m = pat.matcher( s );
if ( m.find() ){
    System.out.println("Hello Found!  joy!");
}

m is just the variable name, and matcher is declared with the argument of a string to be matched - s in this case. This is very similar to the C# example, so I will not go any further into this, except for one thing:
find() attempts to find the next subsequence of the string that fits the pattern. What this means is that every time you use it, it will advance its marker. A picture of this would be:

String data = "Hello World!";
Pattern pat = pattern.compile("Hello");
Matcher m = pat.matcher( s );
// The first time, it should print.
if ( m.find() ){
    System.out.println("Hello Found!  joy!");
}
// However, the second time around, it will not be true because 
// it doesn't search the entire string again, instead it will pick 
// up where it left off..
if ( m.find() ){
    // Won't be reached.
    System.out.println("Hello Found!  joy!");
}

To reach the same effect as C#'s IsMatch, you should use lookingAt(). This will search the entire string again, starting from the beginning, against the pattern. So you would have something like this again:

String data = "Hello World!";
Pattern pat = pattern.compile("Hello");
Matcher m = pat.matcher( s );
if ( m.lookingAt() ){
    System.out.println("Hello Found!  joy!");
}
if ( m.lookingAt() ){
    // Will be reached this time, double the joy!
    System.out.println("Hello Found!  joy!");
}

Lastly, but surely not any less inferior will be the Perl's implementation.

#!/usr/bin/perl

$string = "Hello World!";
print "Hello found!  joy!\n" if ( $string =~ /Hello/ );
$string = "Goodbye World!";
#Will not be true.
print "Hello found!  joy!\n" if ( $string =~ /Hello/ );

In Perl, =~ is an operator the matches the pattern in slashes. The code is pretty trivial.

To negate the boolean expression in Java and C# is, of course, is to use ! over the function. In Perl, you can use the !~. This is true only if the pattern is not in the string.

Pattern Matching

The link above contains a summary of regular expression constructs. Let's use some of it somewhere.

Let's say we're trying to match social security numbers. The format for a social security number is xxx-xx-xxxx.

Let's start with the C# code:

using System.Text.RegularExpressions;

class RegEx{
    public static void Main(){
        string[] id = { "123-45-6789", 
                "1234-5-6789", 
                "547-12-6346", 
                "54-12-5623",
                "3513-15134",
                "608-12-61347",
                "8608-12-6134",
                "608-12-6134" };
    
        for ( int i = 0; i < id.Length; i++ )
            if ( Regex.IsMatch( id[i], @"\d{3}-\d{2}-\d{4}" ) )
                System.Console.WriteLine( id[i] );
    }
}

Most of this should look familiar. We'll take a look at the regex line:

if ( Regex.IsMatch( id[i], @"\d{3}-\d{2}-\d{4}" ) )

The only part that is new is the @"\d{3}-\d{2}-\d{4}"
Well, if you look it up, \d means a digit from 0 to 9. {3} means to match the previous character (the \d in this case) exactly three times, no more, no less. This would be then followed by a dash, two more digits, another dash, then finally, four more digits.

So what's the flaw in this? It doesn't look at the entire string, therefore, the above sample code will match both "608-12-61347" and "8608-12-6134", since the "\d{3}-\d{2}-\d{4}" is a substring in both of them. This is of course, not ideal, so we'll add a \b in there. \b means a word boundary, and it would work in this case, because it would expect a word boundary where there is more digits.
(Note that it will still match a number if they put spaces, such as "608-12-6134 1", but if you use Match, then you can filter out the space and the one, while keeping the rest of the number. This will be shown later.) So basically this would be the pattern:

"\b\d{3}-\d{2}-\d{4}\b"

You should be able to write the equivalent in Java and Perl following the same concepts easily: (In Java and Perl respectively)

import java.util.regex.*;
public class RegEx{
    public static void main( String args[] ){
        String[] id = { "123-45-6789", 
                "1234-5-6789", 
                "547-12-6346", 
                "54-12-5623",
                "3513-15134",
                "608-12-61347",
                "8608-12-6134",
                "608-12-6134" };
            Pattern pat = Pattern.compile("\\b\\d{3}-\\d{2}-\\d{4}\\b");
        for ( int i = 0; i < id.length; i++ )
            // I've condensed the statement a bit, but you should get the idea..
            if ( pat.matcher(id[i]).find() )
                System.out.println( id[i] );
    }
}

#!/usr/bin/perl

@id = (     "123-45-6789", 
                "1234-5-6789", 
                "547-12-6346", 
                "54-12-5623",
                "3513-15134",
                "608-12-61347",
                "8608-12-6134",
                "608-12-6134" );

foreach $i ( @id ){
    print $i, "\n" if $i =~ /\b\d{3}-\d{2}-\d{4}\b/;
}

The Perl version is trivial, but one may wonder why the Java version is slightly different, with tons of /'s. This is because / is an escape character in Java - to use it (and trust me, usually, you would want to) you need to use two of them - one to escape the latter. C# also needs this - except for the @ operator lets the compile to interprete the string literally - eliminating the need for the escape character.

More Matching
Now you should be able to, with practice and experience, to see if a pattern in a string easily! But what if you want to see what the match was? Well, easy, we can use the Match class! Let's revisit our Hello World regex:

using System.Text.RegularExpressions;

class RegEx{
    public static void Main(){
        string data = "Hello World!";
        if ( Regex.IsMatch( data, "Hello" ) ){
            //Instead, we will use Match this time
            Match m = Regex.Match( data, "Hello" );
            // Print it out.
            System.Console.WriteLine( m.ToString() );
        }
    }
}

And the output for this would be just simply "Hello".

So to modify our social security example so that it gets rid of the cases we've mentioned, the code will be as such instead:

using System.Text.RegularExpressions;

class RegEx{
    public static void Main(){
        string[] id = { "123-45-6789", 
                "1234-5-6789", 
                "547-12-6346", 
                "54-12-5623",
                "3513-15134",
                "608-12-61347",
                "8608-12-6134",
                "608-12-6134" };
    
        for ( int i = 0; i < id.Length; i++ )
            if ( Regex.IsMatch( id[i], @"\d{3}-\d{2}-\d{4}" ) )
                System.Console.WriteLine( Regex.Match( id[i], @"\d{3}-\d{2}-\d{4}" ) );
    }
}

Now, if you're given the case "608-12-6134 1", it will match, and output "608-12-6134". It will be left up to the reader as an excerise if you don't necessary want to parse that.

Of course, sometimes you want to match more than once in the same string, so what do you do? Well, a Match object can be viewed loosely as a collection, so you can of course do it this way:

using System.Text.RegularExpressions;

class RegEx{
    public static void Main(){
        string str = "HelloBelloYelloHallo";
        string strMatch = @"\wello";

        if ( Regex.IsMatch( str, strMatch ) )
            for ( Match m = Regex.Match( str, strMatch ); m.Success; m = m.NextMatch() )
                System.Console.WriteLine( m );

    }
}

Simple enough, just one for loop. (Note: In fact, the if statement isn't needed, because m would be initialized to Match.Empty (and not Match.Success) if it doesn't match, but I wanted to be a little more explicit.)

Here's the Java version:

import java.util.regex.*;
public class RegEx{
    public static void main( String args[] ){
        String s = "HelloBelloWelloHalloYello";
        Pattern pat = Pattern.compile("\\wello");
            Matcher m = pat.matcher( s );
            while ( m.find() ){
                System.out.println( m.group() );
            }
    }
}

As briefly mentioned previously, since find() starts off where the last one was found, you can easily use a while loop to evaluate it until it returns false. group() returns the input subsequence matched by the previous match. (You can put in a .ToString() if you want.)

Finally, the Perl version:

#!/usr/bin/perl

$str = "HelloBelloWelloHalloYello";
print $1, "\n" while ( $str =~ /(\wello)/g );

The only thing that changed is that there is a /g modifier. When you use it, the regular expression engine will keep track of where it finished, and start right where you left off the next time around. Another thing that may be unfamiliar is the $1. What is it? It's capturing, of course, which brings us to the next topic:

Capturing and Groups
Now, what if we want part of the string? For example, you want to parse a series of money values for the following format:
$x.yz
Where x can be any positive number, y and z are both single digit numbers (to make up the cents.)

Of course, we'll write something like this to match it:

$\d+\.\d\d

(The above matches a dollar sign, then 1 or more digit, then a period, then two more digits. This does not take account negative dollar values, while we will ignore for the sake of simplicity.)

And the following code:

using System.Text.RegularExpressions;

class Reg{
    public static void Main(){
        // Should match everything except the last two.
        string str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
        string strMatch = @"\$\d+\.\d\d";
    
        for ( Match m = Regex.Match( str, strMatch ); m.Success; m = m.NextMatch() ){
            System.Console.WriteLine( m );
        }
    }    
}

(All this should be second nature hopefully - and do note that we need to escape the $ and the . because they are special characters.)

Of course, all this is dandy and all, but suppose we want to do something to it, for example, add 5 dollars, then we need to extract this. The way we would do this is using Capture and Groups.

Capturing groups are numbered by counting their opening parentheses from left to right. - Java 1.4 API
So we want to capture the dollar (and cents as an example), all we have to do is to enclose them in parentheses, and we have the resulting pattern:

$(\d+)\.(\d\d)

and how we would write the code:

using System.Text.RegularExpressions;

class Reg{
    public static void Main(){
        // Should match everything except the last two.
        string str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
        string strMatch = @"\$(\d+)\.(\d\d)";
    
        for ( Match m = Regex.Match( str, strMatch ); m.Success; m = m.NextMatch() ){
            GroupCollection gc = m.Groups;

            System.Console.WriteLine( "The number of captures: " + gc.Count );
            // Group 0 is the entire matched string itself
            // while Group 1 is the first group to be captured.
            for ( int i = 0; i < gc.Count; i++ ){
                Group g = gc[i];
                System.Console.WriteLine( g.Value );
            }
        }
    }
}

So basically, we use m.Groups to return a GroupCollection object of the groups captured. Then we just iterate through each one of them, and printing out the value of each. The above is just a way of getting all the captures. To solve our problem of adding 5 dollars, we would do the following:

using System.Text.RegularExpressions;

class Reg{
    public static void Main(){
        // Should match everything except the last two.
        string str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
        string strMatch = @"\$(\d+)\.(\d\d)";
    
        for ( Match m = Regex.Match( str, strMatch ); m.Success; m = m.NextMatch() ){
            GroupCollection gc = m.Groups;

            System.Console.WriteLine( "${0}.{1}", int.Parse(gc[1].Value) + 5, gc[2].Value );
        }
    }
}

As we can see above, we can directly access the values of GroupCollection as a Group object using array notations.

If you understand the concepts so far, the Java implementation is then pretty trivial. You can probably figure them out from the code. I will just show the implementation of both C# programs here:

import java.util.regex.*;
public class RegEx{
    public static void main( String args[] ){
        String str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
        Pattern strMatch = Pattern.compile( "\\$(\\d+)\\.(\\d\\d)" );
            Matcher m = strMatch.matcher( str );
            while ( m.find() ){
                System.out.println("The number of captures: " + m.groupCount() + 1 );
                for ( int i = 0; i <= m.groupCount(); i++ )
                    System.out.println( m.group(i) );
            }
    }
}

And to add 5 dollars:

import java.util.regex.*;
public class RegEx{
    public static void main( String args[] ){
        String str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
        Pattern strMatch = Pattern.compile( "\\$(\\d+)\\.(\\d\\d)" );
            Matcher m = strMatch.matcher( str );
            while ( m.find() ){
                System.out.println( "$" + ( Integer.parseInt( m.group(1) ) + 5 ) 
                   + "." + m.group(2) );
            }
    }
}

The code looks very similar, with group( int ) taking the group number as the argument. This will throw an exception if the argument is out of bounds. A very notable difference is that groupCount() returns the number of captures not including the original match. Therefore, in the above example, groupCount() will return 2. (Which is why I added one in the first statement, and in the for loop, it's now a less than or equal '<=' to, not a strictly less than '<'that appeared in C#.

The Perl versions are very simple if you follow it up to here, and have an understanding of Perl:

#!/usr/bin/perl

$str = "\$1.57 \$316.15 \$19.30 \$0.30 \$0.00 \$41.10 \$5.1 \$.5";
print $1, " ", $2, "\n" while ( $str =~ /\$(\d+)\.(\d\d)/g );

In Perl, $ is an operator denoting variables, so we have to escape them, even in a string. Also, there are variable names that are reserved, and you can refer to the groups in numerical order as a namespace followed by a digit. ($1, $2, $3...) Otherwise, it is very easy. Likewise, to add 5 dollars:

#!/usr/bin/perl

$str = "\$1.57 \$316.15 \$19.30 \$0.30 \$0.00 \$41.10 \$5.1 \$.5";
print $1 + 5, " ", $2, "\n" while ( $str =~ /\$(\d+)\.(\d\d)/g );

There. Literally just added 4 characters.

Replacement
Now that we have a decently strong foundation, we can move on! Matching is just half the fun, as now we can play around with Strings!

It is very straightforward to replace something in C#. We can use the method Regex.Replace( string, string, string ). This takes in a string to be manipulated, a pattern string, and a replacement string. For example:

using System.Text.RegularExpressions;

class Reg{
    public static void Main(){
        string str = "Hello World";
        string search = "Hello";
        string replace = "Goodbye";
        str = Regex.Replace( str, search, replace );
        System.Console.WriteLine( str );
    }
}

or a more condensed version:

using System.Text.RegularExpressions;

class Reg{
    public static void Main(){
        string str = Regex.Replace( "Hello World", "Hello", "Goodbye");
        System.Console.WriteLine( str );
    }
}

This will output

Goodbye World

after replacing Hello with Goodbye.

Java uses a method replaceAll( string ) called by a Matcher object. It takes the replacement string as an argument, and the method itself returns a String that is replaced.
Here's how to do it:

import java.util.regex.*;
public class RegEx{
    public static void main( String args[] ){
        String str = "Hello World";
        Pattern strMatch = Pattern.compile( "Hello" );
        Matcher m = strMatch.matcher( str );
        System.out.println( m.replaceAll( "Goodbye" ) );
    }
}

Perl uses s///g to do the replacement. Basically, the syntax is: s/<pattern>/<replacement>/g(We still want to use to /g because we still want it to be global.)
So here is Perl code:

#!/usr/bin/perl

$str = "Hello World";
$str =~ s/Hello/Goodbye/g;

Of course, all this is fun, but we can also use patterns!

Pattern Replacement
The concept behind this is nothing new, basically, use a pattern as an argument. For example: if I want to convert all 4 letter words to astericks, I would simply do this (C#):

using System.Text.RegularExpressions;

class Reg{
    public static void Main(){
        string a = "abcd hello ello yellow";
        string str = Regex.Replace( a, @"\b\w{4}\b", "****");
        System.Console.WriteLine( str );
    }
}

And very similar in Java and Perl, respectively:

import java.util.regex.*;
public class RegEx{
    public static void main( String args[] ){
        String str = "abcd hello ello yellow";
        Pattern strMatch = Pattern.compile( "\\b\\w{4}\\b" );
        Matcher m = strMatch.matcher( str );
        System.out.println( m.replaceAll( "****" ) );
    }
}

#!/usr/bin/perl

$str = "abcd hello ello yellow";
$str =~ s/\b\w{4}\b/\*\*\*\*/g;
print $str;

Replacement and Capturing
We have pairs of numbers formatted as <int>-<int>, and you want to swap each pair.
Basically, you can capture them, and then replace them using what you captured! The syntax for the replacement string is very much like the capturing you've seen above with Perl. Here's the C# example:

using System.Text.RegularExpressions;

class Reg{
    public static void Main(){
        string str = "30-50 100-200 123-647 952-142 5-1231";
        string search = @"(\d+)-(\d+)";
        string replace = "$2-$1";
        str = Regex.Replace( str, search, replace );
        System.Console.WriteLine( str );
    }
}

So in this example,

@"(\d+)-(\d+)"

captures the two numbers into two groups, then the replacement string

"$2-$1"

takes the second group, put it in front, followed by a dash, then concat the first group to it. The result is that every pair of numbers will be swapped in order:

50-30 200-100 647--5

as planned.

$2 refers to the second group of the capture, while $1 refers to the first group of the capture. Of course, this would also apply to all the groups.

Similar code can be used in Java and Perl:

import java.util.regex.*;
public class RegEx{
    public static void main( String args[] ){
        String str = "30-50 100-200 123-647 952-142 5-1231";
        Pattern strMatch = Pattern.compile( "(\\d+)-(\\d+)" );
        Matcher m = strMatch.matcher( str );
        System.out.println( m.replaceAll( "$2-$1" ) );
    }
}

#!/usr/bin/perl

$str = "30-50 100-200 123-647 952-142 5-1231";
$str =~ s/(\d+)-(\d+)/$2-$1/g;
print $str;

And this concludes the Introduction to Regular Expressions. The key to knowing regex is to keep on practicing!

Links
- Contains the API for the Pattern and the Matcher classes. The Pattern class also contains a summary of regular-expression constructs, and is very useful.
- The MSDN Documentations for the Regex class.
Book
- by Jeffrey E. F. Friedl - perhaps the best book ever published on regular expressions, and is considered to be a bible to many people.

King Mak, 2002

Return to Browsing Tutorials

Email this Tutorial to a Friend


Rate this Content:
low quality	1	2	3	4	5	high quality

Reader's Comments	Post a Comment

There's a part up there that reads: m == Match.Success it should be either: m.Success or m != m.Empty I kept getting them confused at one point, and then finally just changed it all to m.Success, but I missed that one in editing.
-- Larry Mak, September 18, 2002

There's another editing mistake: The java code: import java.util.regex.*; public class RegEx{ public static void main( String args[] ){ String data = "Hello World!"; // This is to compile the pattern - this will increase performance --> Pattern pat = pattern.compile("Hello"); --> Matcher m = pat.matcher( s ); if ( m.find() ){ System.out.println("Hello Found! joy!"); } data = "Goodbye, cruel cruel world!"; --> m = pat.matcher( s ); if ( m.find() ){ // Similarly, this will never be reached either.. System.out.println("Hello Found! joy!"); } } } pattern.compile should be Pattern.compile (p is capitalized) the variable s should be data instead.
-- Larry Mak, September 18, 2002

I will shortly post up the source code for this tutorial, and all the mistakes that I mentioned above will be corrected.
-- Larry Mak, September 18, 2002

I've corrected the tutorial, so the above mistakes aren't here anymore..
-- Larry Mak, September 18, 2002

larry good tutorial...i was able to follow it really well, and was done well. thanks for the wonderful tutorial.
-- Justin Jones, September 19, 2002

Very thorough overview of RegEx in different languages! Good job.
-- Robert Wlodarczyk, September 20, 2002

Ya, I don't believe in language dependence. (though knowing the API or a language's equivalant generally makes a job easier.) Hope this will also ease the transition between the languages..
-- Larry Mak, September 24, 2002

Good tutorial Larry, well written and organized. Gives someone with no previous knowledge a good understanding of regular expressions, and their use in different languages.
-- Lee Bankewitz, September 26, 2002

Great stuff. Helped me a lot!~
-- Brian Li, October 10, 2002

After just reading O'Reillys "Mastering Regular Expressions" this is a good how-to on RegEx but lacks the basic foundations. It seems that more is needed to explain character classes, matching operators, etc. However this is an excellant guide for some one who knows the basics of RegEx and wishes to apply them to a new language.
-- Kevin Sullivan, October 15, 2002

Thanks for the tutorial Larry. I am beginning to learn regular expressions in one of my classes and this tutorial gives me good examples on how to put them into practice.
-- Luke Walker, October 17, 2002

very cool. Too bad it uses C# already ;-).
-- Aaron Brethorst, October 20, 2002

Very nice tutorial, especially for a beginning introduction.
-- Marcus Griep, November 10, 2002

Great tutorial Larry.
-- Sushant Bhatia, November 12, 2002

good use of semantics. a nice tutorial.
-- Greg Sun, November 16, 2002

Good work, Larry. It's good to see people do their own work for a change. Use of Regex in C# will greatly improve Validator classes in Windows Application projects.
-- Seth Peck, December 06, 2002

Nice Tutorial. Very useful.
-- AJ Tomich, December 11, 2002

A valuable tutorial. I was looking for such a resource. Good job!
-- Sami Ahmed, February 02, 2003

Conditions of Use | Privacy Notice

Kansas City

Wichita

Overland Park

Belleville