| |
Beginning Regular Expressions in C#, Java, and Perl
Let's begin by saying that I'm an beginner, so my methods of using Regex is rather primitive, and there's more than one way to do it! (Perl philosophy!)
This tutorial is mainly in C#, going over Java and Perl syntaxes a bit - but since this is a regex tutorial, it should become easy to translate between the languages. Also, the prerequisite for this tutorial is some idea of how the languages work, because this tutorial is long without explaining all the various things that the language presents.
(You may ask yourself, why don't I just parse the string myself? Well, one is because it's easy to make a mistake, two is that usually regular expressions are compiled and optimized, so it's faster.)
So let's start with the definition: A regular expression is a pattern of text in regular language. It describes what to match in a regular string. It serves as a filter to find what you're looking for.
Regular expressions are somewhat standardized - they're consistent with the different languages, despite the different syntax to call them.
The two main function of using regular expression is to match and to replace. The former determines if the pattern is in the string, and if so, find it. Replace changes the string according to the pattern to another pattern.
Let's go over matching first:
Basic Matching Let's begin with a Hello World to welcome ourselves to the world of regex! (short for regular expression - very commonly used.)
The following is the full C# code: (I will dissect this in a minute)
using System.Text.RegularExpressions;
class RegEx{
public static void Main(){
string data = "Hello World!";
if ( Regex.IsMatch( data, "Hello" ) ){
System.Console.WriteLine( "Hello Found, joy!" );
}
data = "Goodbye World!";
if ( Regex.IsMatch( data, "Hello" ) ){
System.Console.WriteLine( "Hello Found, joy!" );
}
}
}
|
This is of course, very basic of regex. Basically, IsMatch takes in two strings, the first one is the string, and the latter is the pattern. IsMatch returns true if it is found, and false otherwise.
In this case, data contains the string "Hello World!", and IsMatch finds the pattern "Hello" in "Hello World!", so it returns true. Later on, data is changed to the string "Goodbye World!", and when you evaluate IsMatch this time, "Hello" is not found, so it returns false.
Java, with its recent 1.4 upgrade, has finally regex support! So now we can actually use it.. Let's look at the Java version now:
import java.util.regex.*;
public class RegEx{
public static void main( String args[] ){
String data = "Hello World!";
Pattern pat = Pattern.compile("Hello");
Matcher m = pat.matcher( data );
if ( m.find() ){
System.out.println("Hello Found! joy!");
}
data = "Goodbye, cruel cruel world!";
m = pat.matcher( data );
if ( m.find() ){
System.out.println("Hello Found! joy!");
}
}
}
|
If you familiarized yourself with the C# example, then everything should look very similar, with some notable differences. Instead of using Strings and let the compiler do the work for you, you have to explicitly compiled the patterns and then applied it to whatever you need to do. This is achieved in the example by:
Pattern p = Pattern.compile("Hello");
|
Of course, p is the variable name, so you can use any traditional namespace. The string that is the argument of the compile method will be the pattern that will be compiled. It is "Hello" in this case.
The matcher is the "engine that performs match operations on a character sequence by interpreting a Pattern" (Java 1.4 API). From the example:
Matcher m = pat.matcher( s );
if ( m.find() ){
System.out.println("Hello Found! joy!");
}
|
m is just the variable name, and matcher is declared with the argument of a string to be matched - s in this case. This is very similar to the C# example, so I will not go any further into this, except for one thing: find() attempts to find the next subsequence of the string that fits the pattern. What this means is that every time you use it, it will advance its marker. A picture of this would be:
String data = "Hello World!";
Pattern pat = pattern.compile("Hello");
Matcher m = pat.matcher( s );
if ( m.find() ){
System.out.println("Hello Found! joy!");
}
if ( m.find() ){
System.out.println("Hello Found! joy!");
}
|
To reach the same effect as C#'s IsMatch, you should use lookingAt(). This will search the entire string again, starting from the beginning, against the pattern. So you would have something like this again:
String data = "Hello World!";
Pattern pat = pattern.compile("Hello");
Matcher m = pat.matcher( s );
if ( m.lookingAt() ){
System.out.println("Hello Found! joy!");
}
if ( m.lookingAt() ){
System.out.println("Hello Found! joy!");
}
|
Lastly, but surely not any less inferior will be the Perl's implementation.
#!/usr/bin/perl
$string = "Hello World!";
print "Hello found! joy!\n" if ( $string =~ /Hello/ );
$string = "Goodbye World!";
#Will not be true.
print "Hello found! joy!\n" if ( $string =~ /Hello/ );
|
In Perl, =~ is an operator the matches the pattern in slashes. The code is pretty trivial.
To negate the boolean expression in Java and C# is, of course, is to use ! over the function. In Perl, you can use the !~. This is true only if the pattern is not in the string.
Pattern Matching
The link above contains a summary of regular expression constructs. Let's use some of it somewhere.
Let's say we're trying to match social security numbers. The format for a social security number is xxx-xx-xxxx.
Let's start with the C# code:
using System.Text.RegularExpressions;
class RegEx{
public static void Main(){
string[] id = { "123-45-6789",
"1234-5-6789",
"547-12-6346",
"54-12-5623",
"3513-15134",
"608-12-61347",
"8608-12-6134",
"608-12-6134" };
for ( int i = 0; i < id.Length; i++ )
if ( Regex.IsMatch( id[i], @"\d{3}-\d{2}-\d{4}" ) )
System.Console.WriteLine( id[i] );
}
}
|
Most of this should look familiar. We'll take a look at the regex line:
if ( Regex.IsMatch( id[i], @"\d{3}-\d{2}-\d{4}" ) )
|
The only part that is new is the @"\d{3}-\d{2}-\d{4}" Well, if you look it up, \d means a digit from 0 to 9. {3} means to match the previous character (the \d in this case) exactly three times, no more, no less. This would be then followed by a dash, two more digits, another dash, then finally, four more digits.
So what's the flaw in this? It doesn't look at the entire string, therefore, the above sample code will match both "608-12-61347" and "8608-12-6134", since the "\d{3}-\d{2}-\d{4}" is a substring in both of them. This is of course, not ideal, so we'll add a \b in there. \b means a word boundary, and it would work in this case, because it would expect a word boundary where there is more digits. (Note that it will still match a number if they put spaces, such as "608-12-6134 1", but if you use Match, then you can filter out the space and the one, while keeping the rest of the number. This will be shown later.) So basically this would be the pattern:
You should be able to write the equivalent in Java and Perl following the same concepts easily: (In Java and Perl respectively)
import java.util.regex.*;
public class RegEx{
public static void main( String args[] ){
String[] id = { "123-45-6789",
"1234-5-6789",
"547-12-6346",
"54-12-5623",
"3513-15134",
"608-12-61347",
"8608-12-6134",
"608-12-6134" };
Pattern pat = Pattern.compile("\\b\\d{3}-\\d{2}-\\d{4}\\b");
for ( int i = 0; i < id.length; i++ )
if ( pat.matcher(id[i]).find() )
System.out.println( id[i] );
}
}
|
#!/usr/bin/perl
@id = ( "123-45-6789",
"1234-5-6789",
"547-12-6346",
"54-12-5623",
"3513-15134",
"608-12-61347",
"8608-12-6134",
"608-12-6134" );
foreach $i ( @id ){
print $i, "\n" if $i =~ /\b\d{3}-\d{2}-\d{4}\b/;
}
|
The Perl version is trivial, but one may wonder why the Java version is slightly different, with tons of /'s. This is because / is an escape character in Java - to use it (and trust me, usually, you would want to) you need to use two of them - one to escape the latter. C# also needs this - except for the @ operator lets the compile to interprete the string literally - eliminating the need for the escape character.
More Matching Now you should be able to, with practice and experience, to see if a pattern in a string easily! But what if you want to see what the match was? Well, easy, we can use the Match class! Let's revisit our Hello World regex:
using System.Text.RegularExpressions;
class RegEx{
public static void Main(){
string data = "Hello World!";
if ( Regex.IsMatch( data, "Hello" ) ){
Match m = Regex.Match( data, "Hello" );
System.Console.WriteLine( m.ToString() );
}
}
}
|
And the output for this would be just simply "Hello".
So to modify our social security example so that it gets rid of the cases we've mentioned, the code will be as such instead:
using System.Text.RegularExpressions;
class RegEx{
public static void Main(){
string[] id = { "123-45-6789",
"1234-5-6789",
"547-12-6346",
"54-12-5623",
"3513-15134",
"608-12-61347",
"8608-12-6134",
"608-12-6134" };
for ( int i = 0; i < id.Length; i++ )
if ( Regex.IsMatch( id[i], @"\d{3}-\d{2}-\d{4}" ) )
System.Console.WriteLine( Regex.Match( id[i], @"\d{3}-\d{2}-\d{4}" ) );
}
}
|
Now, if you're given the case "608-12-6134 1", it will match, and output "608-12-6134". It will be left up to the reader as an excerise if you don't necessary want to parse that.
Of course, sometimes you want to match more than once in the same string, so what do you do? Well, a Match object can be viewed loosely as a collection, so you can of course do it this way:
using System.Text.RegularExpressions;
class RegEx{
public static void Main(){
string str = "HelloBelloYelloHallo";
string strMatch = @"\wello";
if ( Regex.IsMatch( str, strMatch ) )
for ( Match m = Regex.Match( str, strMatch ); m.Success; m = m.NextMatch() )
System.Console.WriteLine( m );
}
}
|
Simple enough, just one for loop. (Note: In fact, the if statement isn't needed, because m would be initialized to Match.Empty (and not Match.Success) if it doesn't match, but I wanted to be a little more explicit.)
Here's the Java version:
import java.util.regex.*;
public class RegEx{
public static void main( String args[] ){
String s = "HelloBelloWelloHalloYello";
Pattern pat = Pattern.compile("\\wello");
Matcher m = pat.matcher( s );
while ( m.find() ){
System.out.println( m.group() );
}
}
}
|
As briefly mentioned previously, since find() starts off where the last one was found, you can easily use a while loop to evaluate it until it returns false. group() returns the input subsequence matched by the previous match. (You can put in a .ToString() if you want.)
Finally, the Perl version:
#!/usr/bin/perl
$str = "HelloBelloWelloHalloYello";
print $1, "\n" while ( $str =~ /(\wello)/g );
|
The only thing that changed is that there is a /g modifier. When you use it, the regular expression engine will keep track of where it finished, and start right where you left off the next time around. Another thing that may be unfamiliar is the $1. What is it? It's capturing, of course, which brings us to the next topic:
Capturing and Groups Now, what if we want part of the string? For example, you want to parse a series of money values for the following format: $x.yz Where x can be any positive number, y and z are both single digit numbers (to make up the cents.)
Of course, we'll write something like this to match it:
(The above matches a dollar sign, then 1 or more digit, then a period, then two more digits. This does not take account negative dollar values, while we will ignore for the sake of simplicity.)
And the following code:
using System.Text.RegularExpressions;
class Reg{
public static void Main(){
string str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
string strMatch = @"\$\d+\.\d\d";
for ( Match m = Regex.Match( str, strMatch ); m.Success; m = m.NextMatch() ){
System.Console.WriteLine( m );
}
}
}
|
(All this should be second nature hopefully - and do note that we need to escape the $ and the . because they are special characters.)
Of course, all this is dandy and all, but suppose we want to do something to it, for example, add 5 dollars, then we need to extract this. The way we would do this is using Capture and Groups.
Capturing groups are numbered by counting their opening parentheses from left to right. - Java 1.4 API So we want to capture the dollar (and cents as an example), all we have to do is to enclose them in parentheses, and we have the resulting pattern:
and how we would write the code:
using System.Text.RegularExpressions;
class Reg{
public static void Main(){
string str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
string strMatch = @"\$(\d+)\.(\d\d)";
for ( Match m = Regex.Match( str, strMatch ); m.Success; m = m.NextMatch() ){
GroupCollection gc = m.Groups;
System.Console.WriteLine( "The number of captures: " + gc.Count );
for ( int i = 0; i < gc.Count; i++ ){
Group g = gc[i];
System.Console.WriteLine( g.Value );
}
}
}
}
|
So basically, we use m.Groups to return a GroupCollection object of the groups captured. Then we just iterate through each one of them, and printing out the value of each. The above is just a way of getting all the captures. To solve our problem of adding 5 dollars, we would do the following:
using System.Text.RegularExpressions;
class Reg{
public static void Main(){
string str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
string strMatch = @"\$(\d+)\.(\d\d)";
for ( Match m = Regex.Match( str, strMatch ); m.Success; m = m.NextMatch() ){
GroupCollection gc = m.Groups;
System.Console.WriteLine( "${0}.{1}", int.Parse(gc[1].Value) + 5, gc[2].Value );
}
}
}
|
As we can see above, we can directly access the values of GroupCollection as a Group object using array notations.
If you understand the concepts so far, the Java implementation is then pretty trivial. You can probably figure them out from the code. I will just show the implementation of both C# programs here:
import java.util.regex.*;
public class RegEx{
public static void main( String args[] ){
String str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
Pattern strMatch = Pattern.compile( "\\$(\\d+)\\.(\\d\\d)" );
Matcher m = strMatch.matcher( str );
while ( m.find() ){
System.out.println("The number of captures: " + m.groupCount() + 1 );
for ( int i = 0; i <= m.groupCount(); i++ )
System.out.println( m.group(i) );
}
}
}
|
And to add 5 dollars:
import java.util.regex.*;
public class RegEx{
public static void main( String args[] ){
String str = "$1.57 $316.15 $19.30 $0.30 $0.00 $41.10 $5.1 $.5";
Pattern strMatch = Pattern.compile( "\\$(\\d+)\\.(\\d\\d)" );
Matcher m = strMatch.matcher( str );
while ( m.find() ){
System.out.println( "$" + ( Integer.parseInt( m.group(1) ) + 5 )
+ "." + m.group(2) );
}
}
}
|
The code looks very similar, with group( int ) taking the group number as the argument. This will throw an exception if the argument is out of bounds. A very notable difference is that groupCount() returns the number of captures not including the original match. Therefore, in the above example, groupCount() will return 2. (Which is why I added one in the first statement, and in the for loop, it's now a less than or equal '<=' to, not a strictly less than '<'that appeared in C#.
The Perl versions are very simple if you follow it up to here, and have an understanding of Perl:
#!/usr/bin/perl
$str = "\$1.57 \$316.15 \$19.30 \$0.30 \$0.00 \$41.10 \$5.1 \$.5";
print $1, " ", $2, "\n" while ( $str =~ /\$(\d+)\.(\d\d)/g );
|
In Perl, $ is an operator denoting variables, so we have to escape them, even in a string. Also, there are variable names that are reserved, and you can refer to the groups in numerical order as a namespace followed by a digit. ($1, $2, $3...) Otherwise, it is very easy. Likewise, to add 5 dollars:
#!/usr/bin/perl
$str = "\$1.57 \$316.15 \$19.30 \$0.30 \$0.00 \$41.10 \$5.1 \$.5";
print $1 + 5, " ", $2, "\n" while ( $str =~ /\$(\d+)\.(\d\d)/g );
|
There. Literally just added 4 characters.
Replacement Now that we have a decently strong foundation, we can move on! Matching is just half the fun, as now we can play around with Strings!
It is very straightforward to replace something in C#. We can use the method Regex.Replace( string, string, string ). This takes in a string to be manipulated, a pattern string, and a replacement string. For example:
using System.Text.RegularExpressions;
class Reg{
public static void Main(){
string str = "Hello World";
string search = "Hello";
string replace = "Goodbye";
str = Regex.Replace( str, search, replace );
System.Console.WriteLine( str );
}
}
|
or a more condensed version:
using System.Text.RegularExpressions;
class Reg{
public static void Main(){
string str = Regex.Replace( "Hello World", "Hello", "Goodbye");
System.Console.WriteLine( str );
}
}
|
This will output
after replacing Hello with Goodbye.
Java uses a method replaceAll( string ) called by a Matcher object. It takes the replacement string as an argument, and the method itself returns a String that is replaced. Here's how to do it:
import java.util.regex.*;
public class RegEx{
public static void main( String args[] ){
String str = "Hello World";
Pattern strMatch = Pattern.compile( "Hello" );
Matcher m = strMatch.matcher( str );
System.out.println( m.replaceAll( "Goodbye" ) );
}
}
|
Perl uses s///g to do the replacement. Basically, the syntax is: s/<pattern>/<replacement>/g(We still want to use to /g because we still want it to be global.) So here is Perl code:
#!/usr/bin/perl
$str = "Hello World";
$str =~ s/Hello/Goodbye/g;
|
Of course, all this is fun, but we can also use patterns!
Pattern Replacement The concept behind this is nothing new, basically, use a pattern as an argument. For example: if I want to convert all 4 letter words to astericks, I would simply do this (C#):
using System.Text.RegularExpressions;
class Reg{
public static void Main(){
string a = "abcd hello ello yellow";
string str = Regex.Replace( a, @"\b\w{4}\b", "****");
System.Console.WriteLine( str );
}
}
|
And very similar in Java and Perl, respectively:
import java.util.regex.*;
public class RegEx{
public static void main( String args[] ){
String str = "abcd hello ello yellow";
Pattern strMatch = Pattern.compile( "\\b\\w{4}\\b" );
Matcher m = strMatch.matcher( str );
System.out.println( m.replaceAll( "****" ) );
}
}
|
#!/usr/bin/perl
$str = "abcd hello ello yellow";
$str =~ s/\b\w{4}\b/\*\*\*\*/g;
print $str;
|
Replacement and Capturing We have pairs of numbers formatted as <int>-<int>, and you want to swap each pair. Basically, you can capture them, and then replace them using what you captured! The syntax for the replacement string is very much like the capturing you've seen above with Perl. Here's the C# example:
using System.Text.RegularExpressions;
class Reg{
public static void Main(){
string str = "30-50 100-200 123-647 952-142 5-1231";
string search = @"(\d+)-(\d+)";
string replace = "$2-$1";
str = Regex.Replace( str, search, replace );
System.Console.WriteLine( str );
}
}
|
So in this example,
captures the two numbers into two groups, then the replacement string
takes the second group, put it in front, followed by a dash, then concat the first group to it. The result is that every pair of numbers will be swapped in order:
as planned.
$2 refers to the second group of the capture, while $1 refers to the first group of the capture. Of course, this would also apply to all the groups.
Similar code can be used in Java and Perl:
import java.util.regex.*;
public class RegEx{
public static void main( String args[] ){
String str = "30-50 100-200 123-647 952-142 5-1231";
Pattern strMatch = Pattern.compile( "(\\d+)-(\\d+)" );
Matcher m = strMatch.matcher( str );
System.out.println( m.replaceAll( "$2-$1" ) );
}
}
|
#!/usr/bin/perl
$str = "30-50 100-200 123-647 952-142 5-1231";
$str =~ s/(\d+)-(\d+)/$2-$1/g;
print $str;
|
And this concludes the Introduction to Regular Expressions. The key to knowing regex is to keep on practicing!
Links - Contains the API for the Pattern and the Matcher classes. The Pattern class also contains a summary of regular-expression constructs, and is very useful. - The MSDN Documentations for the Regex class. Book - by Jeffrey E. F. Friedl - perhaps the best book ever published on regular expressions, and is considered to be a bible to many people.
King Mak, 2002
|
|