Thinking in Regular Expressions

The beauty of working with text is that the moment you either write or read a lot of it, you will quickly detect some underlying patterns. Here the patterns may have to do with words, but usually this has to do with how the text is written.

Try an ordinary find operation in a word-processor: you will soon be frustrated when you know for certain that the word you want is there, but you are failing to get it due to some misspelling or a capitalisation. Of course you can try ticking the Ignore Case option, but then a space or a number of white spaces may fail to find a phrase which you know is present in the document.

Regular Expressions to the Rescue!

Regular expressions is a language that describes text patterns. Using a text editor that supports regular expressions will make your life easier and makes you wonder why this feature does not come by default in every word-processor.

Anyway, to think in regular expressions, you start to think about how your words or phrases are composed:

  • Ordinary words are written using the alphabetical characters, either in lower or uppercase.
  • Usually characters at the beginning of a sentence, or proper nouns are capitalised.
  • Punctuation marks are used to flag a sentence. For example a comma indicates a pause, an exclamation mark indicates a command or high pitch, whereas a full-stop indicates a sentence end.

Starting from this premise, we then need to be familiar with the structure of the document we are trying to find something. A page full of text, or even volumes of text requires that before carrying out a search operation, you know where and how to pick the text.

We usually refer to the needle in a haystack: what you are looking for is the needle, and what you are searching in is a haystack.

How Regular Expressions Work

Regular expression syntax works by classifying text:

literal characters
These are characters that represent themselves. If you are looking for the term, hay, then the letters h, a, and y represent themselves. These characters are not used for anything else.
Special characters
symbols like the period and other punctuation marks acquire a new meaning, see the section below.
Character Classes
A convenient classification of characters based on common test patterns. See the section below.
Quantifiers
symbols used to specify how many times the preceding character is to be matched. Useful for repetition or restriction of matches.
The backslash
The backslash is the workhorse of regex: put it before any character, that character will change its meaning. If the character is already a special character, then it loses its meaning. Thus, \a means an anchor, \b means a word boundary, \d means a digit, \s means a space, \t means a tab, \v means a vertical tab, etc.

The full-stop or period

The period (.) means /any character is matched. This is the first port of confusion if you are new to regex: trying to find a word that ends with a period may throw surprising results.

So for example typing in a regex engine, “fo.d” will find both food and fold because the dot means any character between o and d. This will also match a word such as formed because the period will be satisfied by the any character qualifier.

If you want to match the period, you have to type the backslash before it. Thus, to find the actual period after an e in lime. type lime\.

Quantifiers

These are characters that will match the preceding character as fewer or more times. The following are some quantifiers and their meaning:

  • The question mark ? means match the previous character zero or one times. Thus in ordinary English we say, the question mark means to match at most one time.
    • The star * means to match the preceding character zero or many times: In other words, the match is successful whether there was anything found or not. The star will also match as many times as possible. It is often referred to as a greedy character. It will only stop when the haystack is done or you run out of memory unless you limit it by some other method.
    • The plus sign + means to match the preceding character at least once. Thus, for a match to be successful, there should be at least a match. So a plus could be thought of as a star but whereas a star matches even zero times, the plus will only match something. If there is none, a plus will not match.
    • Numeric quantifiers {}: This is when you have to specify the number of matches. So if you want to find God and not good, you want a single o in the string and not two o. So you want just one o. To do that, you type go{1}d.

      The above quantifiers stated by punctuation marks can be represented by these quantifiers as well as shown in the Table below. If you want to state at least, you do it like this: inside the braces, you enter the number followed by a comma then a close brace. For example, to say at least three times you type {3,}

To say at most, you type a number after the comma. So the braces syntax is this: match{n,m} where n=least and m=most

So to find at least two characters, but no more than five, we type .{2,5}1

Getting back to our punctuation quantifiers, the table below shows their equivalents using the number quantifiers:

Symbol Meaning numeric Equivalence
* At least zero times, at most any {0,}
? At least zero times, at most one time {0,1}
+ At least one time {1,}

Character Classes

Besides literal characters and quantifiers, regular expression syntax also recognises character classes. Character classes make it possible to find text based on common orderings. Before we look at these, we can also create our own character classes in regular expressions:

  • Simply use square brackets [] and place whichever characters you like inside them. These characters represent a new class of your own. The good news is that even those special symbols lose their meaning inside the square brackets.
  • Bear in mind that inside these brackets, certain positions are special to the regex engine. This means that certain characters may change their meaning because of where they are. For example, soon after the opening bracket, the caret symbol ^ which you get by shifting the number 6 on your numeric row, means that the characters in this class are not to be matched. So if you type Z[^ao]mb You are telling it that it may find words such as “Zimbabwe” but not “Zambia” or “Zomba”.

    The dash (or hyphen) written as – whenever it is not at the beginning of the bracketing class means a range. So placing a dash between numbers or between characters creates a range. A range as in statistics, means from this point to that point. So typing [a-z] means /any character between a and z lowercase. typing [2-4] means you are interested in numbers starting from 2 up to 4. So like what I said, this meaning is only obtained when the dash is not at the beginning of the class. If you put it at the beginning or end of the character class, it loses that meaning.

    Interesting enough is the close bracket ]. If you want to find it, you have to place it only at the beginning of the character class!

  • A key point to note about the character class is that it only matches at a single point of the string, so if you type a[fmu]rica, though you placed three characters inside the brackets, you are only telling the regex engine that you are looking for just one character which could be f, m or u which is between A and r. You are likely to find Africa rather than America with this search.

So let me underscore this point once again: a character class finds whatever the next quantifier tells it to find, no more or less.

Besides these classes you can craft using the square brackets, you can use the other classes that are built in. These include:

Name of class Meaning Equivalence in bracket notation
alnum alphabetic and numeric characters [a-zA-Z0-9]
alpha Alphabetic crs [a-zA-Z]
digit Numbers [0-9]

To enter a character class, you surround it with a pair of colons and place it inside another pair of square brackets. For example, you write [[:digit:]].

Anchors

The regex syntax also supports finding characters based on where they are in a string. It is possible to only look for words and not part of a word. If you want to find Zim as a word and not as part of a word like Zimbabwe, you need to know word boundaries.

Of interest to us are word boundaries and line anchors.

Word boundaries

Using a backslash followed by the letter b means to create a boundary. So if you enter \bzim\b, you will find Zim as a word.

If you are using either Vim or Emacs, you can also use the backslash followed by a less than symbol for start of a word, or a greater than symbol to indicate the end of a word. For example, \<main\> will match “main” on its own rather than as part of a word like “maintain”.

Line anchors
If your string has lines, it is possible to tell the regex engine to match only at the beginning of a line or at the end. the caret symbol (^) means at the start of a line, while the dollar symbol ($) means at the line end. It is important to note that this is inconsistent with a number of text editors. But basically it means that you can place something at the start of a line or end of the line by just these symbols. In a search and replace operation, it does not mean to remove the start of the line or its end.

Conclusion

The subject of the regular expressions is vast, and it is not possible to cover it in one post. This was an attempt to encourage you to think in regex and see its possibilities. In the next instalment of this category, we will look at search and replace operation.

Footnotes:

1

Remember that the dot here means any character.