| |
What Are Regular Expressions And What Are They Good For?
In an nutshell, regular expressions identify patterns in text.
Manipulating strings can be one of the more complex topics to deal with. Regular expressions make it easy and efficient to perform complex searches and/or replacements of text in a string, given pretty much any criteria for the search. The following example demonstrates how easy it is to count the number of tags in an HTML file using regular expressions in Perl, as compared to equivalent c code using a character array.
Perl:
Equivalent in c (using character arrays) :
while (string[i] != '\0'){
while (string[i] != '<' && string[i] != '\0') i++;
if (string[i] == '\0') break;
while (string[i] != '>' && string[i] != '\0') i++;
if (string[i] == '\0') break;
tags = tags + 1;
} |
Granted, the c code fairly easy to understand, and wouldn't be bad to write, but string manipulation can get incredibly more complex than counting the number of tags in an HTML file. Say you want to search a Web page for all sentences that contain a word that ends with "ing" and doesn't contain the letter "m", with the added stipulation that the word must be preceeded by either the name "Romeo" or "Juliet" in the sentence. I don't even want to attempt this in c, but the following Perl regular expression will do the trick. (Note: it assumes that sentences end with a period, but could be easily modified to handle ?, ;, :, etc.)
/\.?\s*([^\.]*(Romeo|Juliet)[^\.]*\b[^mM]*ing\b[^\.]*.)/; |
"Aaack! That is way too complicated!"
Don't worry! I don't expect you to understand that. I was simply demonstrating that regular expressions can handle complex searches without a lot of code. And once you become accustomed to using regular expressions it shouldn't be too difficult to understand what that code is doing.
Getting Started
To begin with, we must have some text to search. We'll place this in a string:
my $string = "This is the text Phil wishes to search." |
Now, we need to tell Perl what string we're looking for. We "bind" our search terms to the string with the =~ operator. The search terms appear to the right of this as follows:
The 'm' stands for match. The two forward slashes are delimiters that set off the search string. In place of the two slashes //, we could use [], {}, ##, <>, !!, ??, or any other non-word character.
The above code returns true (1) if a match is found, and false (0) if it is not. Thus, the following will print "The condition is false":
if ($string =~ m/Bill/){
print "The condition is true";
}else{
print "The condition is false";
} |
The contents of the search string can also be a string variable. Thus the following will return true:
my $string = "This is the text Phil wants to do the searching on.";
my $strToFind = "Phil";
$string =~ m/$strToFind/; |
Metacharacters
The above technique is good if we know exactly what we're trying to find. But this won't work if we want to find a string that is variable. We may want to find a 5 digit number, the first word of a particular sentence, words ending in "ing", etc. We can do this with metacharacters. Metacharacters represent individual characters or combinations of characters.
An example of a metacharacter is a the wildcard character, which is the period (.). This represents any single character (including whitespace). So the following will return true:
my $word = "fall";
$word =~ m/fa.l/; |
In this case, the regular expression would have returned true if $word had been "falling", "failed", "unfailing", etc. However, it would not have returned true if $word was "fal", "fill", "faail", etc. An (incomplete!) list of some other common modifiers follows:
[abc] - Any one of a, b, or c. Example:
#Returns true if $word is "ton", "cone", "prison", etc, but not if it is "non", "won", etc.
$word =~ m/[sct]on/; |
[a-d], [4-9] - Any character or number in the range. Example:
#Returns true if $string contains the strings "513", "514", or "515"
$string =~ m/51[3-5]/; |
? - The preceeding character or group may or may not be present. Note: Groups are enclosed in parentheses (). Example:
#Returns true if $name is "Phil Lanier" or "Philip Lanier"
$name =~ m/Phil(ip)? Lanier/; |
* - The preceeding character or group is present 0 or more times. + - The preceeding character or group is present 1 or more times. Example:
#Returns true if $word is "bal", "ball", "balll", etc, but not "ba"
$word =~ m/bal+/; |
\s,\S - A whitespace character (including \n, \t, \r); a non-whitespace character. Example:
#Returns true if $string is "two words", but not if it is "two words", "twowords", etc
$string =~ m/two\swords/;
#Returns true if $string is "one-word", "oneXword", etc but not if it is "one word"
$string =~ m/one\Sword/; |
\d,\D - A digit; a non-digit. \w,\W - A word character [a-z], [A-Z], or [0-9]
For a complete list of metacharacters, see a Perl regular expression reference, such as one found at .
Reserved Characters You should note that the following special characters are reserved, and you must use the escape character (\) if you want to use them literally in a regular expression: . * ? + [ ] ( ) { } ^ $ | \
Modifiers The search strings are case sensitive. Therefore $string =~ /phil/ will return false. We can turn the case sensitivity off, however, using a modifier. Modifiers are single characters that go to the right of the last delimiter. The modifier to make the search case-insensitive is /i. So if $string = "DeVhoOD", the following will return true:
The the example I gave you at the beginning of this tutorial (for counting the number of HTML tags in a string) uses the modifier /g. This is for a global match, which means that every time the regular expression is evaluated, the "cursor" does not start back at the beginning of the expression. If it were left out of that example, we would get an infinite loop. The /g modifier is also useful for substitution, but I will cover this in a later tutorial. Additionally, the +? is a special sequence of metachacters that I will cover in the next tutorial. Consider it a teaser.
There are several other modifiers as well. As with the metacharacters, I would suggest you look at a Perl reference for a complete list.
Conclusion That's a basic introduction to regular expressions in Perl. It should be enough to get you started writing a few regular expressions on your own and reading some of the simpler expressions that others have written (now you finally know what that crazy =~ thing is!). The next tutorial will cover some of the other main topics you need to become familiar comfortable using regular expressions for all of your text manipulations needs.
|
|