Simple Regular Expressions

Regular Expression Tutorial

Welcome to the PJ Net Solutions Ltd Regular Expression tutorial. The tutorial provides information on what regular expressions are, how to use them and there is even a part where you will be able to put your new found knowledge to the test.
The tutorial was pretty much a running joke between myself and the ColdFusion community, so I wrote it to help ColdFusion programmers write better Regular Expressions. There are many languages that use regular expressions, Perl being the most well known and widely used, but for ColdFusion programmers it can be a little daunting to say the least. I cannot guarantee that any of the regular expressions will work in any other language (although the way to build Regular Expressions is pretty much the same whatever you use).

What are Regular Expressions?
Simple Regular Expressions are a powerful tool and although not vital to producing good code, can become invaluable when a coder realises what can be done with them. They are a way of matching a pattern in a string for the purposes of either finding it or doing something with it. An example might be where you want to find the word "and" in a string:

There was a banging and a crashing in the house.
                   
^^^

Or you might want to change all of the occurrences of the word "television" with the word "TV":

He went over to put the television on.

So that the line then becomes:

He went over to put the TV on.

It is quite possible to create regular expressions to remove tags from an HTML document or check for malicious code in an uploaded file or make code still work but unreadable (obfuscation).

Why use them?
Some (hard core coders) would say "because you can" and that's almost good enough for me. They are seen as being for those people who really know what they are going on about. However my personal opinion is that anyone can use them, they just need to be taught how and when it?s appropriate.

They do produce really powerful pieces of code that can easily be re-used. This allows for additional functionality to be easily added into an application. For ColdFusion people, there are several code libraries on cflib.org (ColdFusion 5 and up) that use Regular Expressions and are all ready to be slotted into code.

You will find that as you become familiar with Regular Expressions, they will increasingly become an integral part of your coding. Regular Expressions work the same every time (when written well) and can do anything from the very general e.g removing all line breaks, to the very specific e.g. removing all specified swear words in a post to a website.
The first thing to learn about is simple matching of letters and words.

Your First Regular Expression
The simplest form of a regular expression is to say "Find me all a's in a string". So here is a string:

I think Regular Expressions are great!

A little simple I know, but a regular expression to find all a's would be:

"a" or /a/

NOTE: There are delimiters around the regular expression. In ColdFusion? a Regular Expression is a string, so quotes are used. In Javascript and Perl it is common to use a / as a delimiter. In all future examples I will use quotes (as it makes it easier to read), and I will use ColdFusion as the example Regular Expression engine.


So what does this Regular Expression return when matching against the string above? It returns this:

14

The reason for this is that the first occurrence of the letter a is at position 14 in the string (count from the start of the string if you are unsure!).

POINT 1: Inserting a single character into a Regular Expression will match the first occurrence of that character

Simple regular expressions work with letters and numbers. Regular Expressions are case sensitive unless otherwise specified.

Matching the Start and End of a string
If Regular Expressions only matched letters in a string, then I think it would be pointless me telling you anything further. However, it is about pattern matching, not string matching and so how do you begin to make a pattern? The first thing to know about is matching the start and end of strings. A Regular Expression that includes a ^ at the start says "Find the following Regular Expression at the start of the string":

I think Regular Expressions are great!

The Regular Expression

"^I"

returns a value of 1 whereas the Regular Expression

"^R"

returns a value of 0 because even though there is an "R" in the string, it is not at the start of the string. The ^ must occur at the start of the string to match the start of the string.

POINT 2: ^ matches the start of a string (when placed at the start of the Regular Expression)

The ^ operator has an equivalent operator to match the end of the string. That operator is a $. When a $ is encountered at the end of the Regular Expression, it means that the preceding Regular Expression must be at the end of the string that it is being matched against.

I think Regular Expressions are great!

The Regular Expression

"!$"

returns a value of 38 whereas the Regular Expression

"n$"

returns a value of 0 because the letter n does not occur at the end of the string being matched against even though it does occur in the string itself.

POINT 3: $ matches the end of a string (when placed at the end of the Regular Expression)

NOTE: I realise that most regular expressions engines allow you to create regular expressions that are not case sensitive, but this does not teach best practice as far as Regular Expressions are concerned. It can cause poor Regular Expression coding and can also cause incorrect matches to occur in. Once Regular Expressions are understood, then using case insensitive matches becomes a lot easier and it is far less likely that the coder will produce poor code. From now on, case sensitivity will be assumed on all regular expressions.

Grouping of characters
Regular Expressions can match the first occurrence of a specified character and you can match with both the start and the end of a string. However, how do you match multiple characters?
This is where Regular Expressions begin to get complicating, so stick with it. The easiest way to match multiple characters is by putting specific characters in a row:

"re"

This will match exactly what you would expect, which is the first occurrence of the substring "re" in the string and return the position at which it occurs (or 0 if it doesn't):

I think Regular Expressions are great!
20

Which turns out to be the "re" in "Expressions". However, this might not be exactly as you expected, because the first "re" is actually the "Re" in "Regular". How do you check for the first occurrence of "Re" or "re"? There are two very distinct but completely correct ways of doing this. Either you say:

[1] Find an upper case "R" OR a lower case "r" and then an "e"
...or...
[2] Find me any occurrence of "Re" or "re"

While these both may look the same, they are different in Regular Expression terms. The first case, is basically saying "Find me one of a specified group of characters and then find me another specified character". The second case says "Find me these specific characters in this order, or these specific characters in this order". The way they are written in Regular Expressions is like this:

[1] "[Rr]e"
[2] "(Re|re)"

Now these may look slightly confusing, but they are actually quite simple to understand. The square brackets [] specify a group of characters for a single pattern match. So what is being said in the first case is "Find me one occurrence of an R or an r (character group) and then follow it with an e".

POINT 4: Square Brackets [ ] specify a group of characters of which you are looking for one to match

In the second case there are two new syntaxes. The first is the round brackets (). What these do is to group what is inside the brackets into an order to find characters in.

POINT 5: Using () specifies a grouping of characters to be found in that order

The second syntax is the | operator. This operator is an OR operator and allows for a Regular Expression to group sub expressions together to find 1 or another of them. You can group any number of | operators together like this:

"(re|ing|pr)"

What the above regular expression says is find me the first occurrence of either "re" or "ing" or "pr".

POINT 6: Using | specifies the Regular Expression to find what is on the Left Side OR the Right side of the bar (an OR operator)

The last example should help you to understand when to use the | operator and when to use character groups using [] brackets.

Combining what we?ve learned so far
Regular Expressions can be combined to make a further regular expression. If one regular expression is next to another, then they make a third regular expression which is a combination of both:

[1] Regular Expression: "r"
[2] Regular Expression: "e"
[1] + [2] = "re"

This may not seem like rocket science, but combining character groupings using [] and combining groupings of ordered characters using () and using the | operator can perform some simple, but very powerful regular expressions. Here are some examples:

"[ab][cd]" - matches "ac", "ad", "bc", "bd"
"([ab]c|d)" - matches "ac", "bc", "d"
"a(d|ef|gh[ijk])!" - matches "ad!", "aef!", "aghi!", "aghj!", "aghk!"

While these may look confusing, they are only combinations of smaller, simpler Regular Expressions which are combined. Even the bottom Regular Expression is actually pretty simple! Imagine if you replaced the starting "a" with a character group like "[abcd]", what would happen? It would match all of the matches given, but also match all of those matches with "a" replaced by "b", "c" or "d"! It would give 20 matches instead of 5! Powerful stuff!

POINT 7: Combining simple (), [] and | Regular Expressions can produce new and very powerful Regular Expressions

One of the most important things about Regular Expressions, is being able to write them. This may seem incredibly obvious, but what I mean by "write" is writing them in English (or whatever language you choose). For example, suppose I wanted to find the first occurrence of a <p> or <br> HTML tag in a document, here is how I would go about it in words:

[1] Find an "<"
[2] After the "<" must come either a "p" or a "br"
[3] After the "p" must be either a space, " ", or a ">" (assume not XHTML)
[4] After the "br" must come either a space, " ", or a ">" (assume not XHTML)

Why must you have a space after the "p"? Because otherwise you may have found a "<param>" tag and not a "<p>" tag! Why have a space after the "br"? Because you may have a style or class reference in there! So to create the Regular Expression I would do this:

[1] "<"
[2] "<(p|br)"
[3/4] "<(p|br)[ >]"

Note the space in there! A space is interpreted as a character, and so will match a " " in a string. Here is the HTML to test against:

<html>
  <head>
   <title>
My Page</title>
  </head>
  <body>

  <h3>My Page</h3>
  <p style=
"color: red;">This is Red Text<br>This is more Red Text</p>
  </body>
</html>

It should find the first "<p" tag in there (note the style in there), which is at position 55 in the string. And the Regular Expression returns:

55

Which is correct!

POIINT 8: Writing Regular Expressions is easiest when fully written down in simple language first

Let's put your new found knowledge to the test!
Test
The test is to find the first match of any XML tag in a string of XML (assume no attributes) where the tagname:

  • ends with "a" or "h" followed by
    • the letters "or"
    • or "er"
    • or "me"

An example of the tag that this would find would be an <author> tag as the tagname ends in "hor".

Answer (hide this somewhere)
There are several ways to do this. 2 possible ways are:

(a|h)([eo]r|me)> and [ah]([oe]r|me)>

Next Tutorial
Going deeper into the complexities of Regular Expressions by using special characters and occurrence matching.

All ColdFusion Tutorials By Author: Paul Johnston
  • Simple Regular Expressions
    It's a first in a three part tutorial on Regular Expressions, what they are, simple uses for them and also a test at the end. It goes through simple matching, start and end of strings, grouping and the | (OR) operator.
    Author: Paul Johnston
    Views: 26,907
    Posted Date: Friday, May 16, 2003
Download the EasyCFM.COM Browser Toolbar!