What's all that About File Globing and Regular Expressions?
Text patterns are some of the fascinating and illuminating data that one comes across: one’s ability to decode them makes it possible to find whatever they are looking for. In this post, though, I am going to show the difference between globing and regex.
The reason why we often mix these two has to do with the same set of characters they use, and sometimes with differing meanings.
Text patterns are known as templates. Thus, a text pattern makes it possible to:
Think of what you want to look for;
Generalise it in terms of its spelling.
This way, it becomes possible to classify your search in terms of patterns.
This process is known as searching the needle in the hay, and is employed by both the shell and regular expressions.
Wildcard characters and Text Patterns
To make sense of this discussion, I think we need to talk about wildcard characters. These are symbols (such as punctuation marks) that have special meaning to the shell or a regular expression text processor.
- By default, whichever engine is carrying out the search uses the ordinary alphanumeric characters1 entered to match against whatever it is going through. This means that for the shell, this is the filesystem and the regex engine, it is the string.2
- Other than the alphanumeric characters, the following characters
are special for both globing and regex:
*, ?, +, [, ], \.
- As you will see below, the meaning of these characters are the same except for few cases that this post looks at.
Let us start by looking at what file globing is.
The way your shell interprets the wildcard characters to match filenames may be simply thought of as globing. Thus,
If you type a filename as part of a command, usually when passing it to a utility that expects a file, the shell sees the characters in that filename argument and tries to match against the files in the directory.
The matching is tested against each character of your string, and if they are exact, then the file is found.
However, there are some characters that mean something else: think of these characters creating a particular class of your alphanumeric characters.
With file globing, the following wildcard characters create special character classes. A character class in this case means that one character (the wildcard in this case) can mean any of the alphanumeric characters for the class. Thus,
Zero or more characters. Every character falls in this class.
One missing character. Exactly one character of all the available characters.
[ and ]
Creates a character class: look for exactly one character as long as it is in this class.
means that the
lsutility is to list only those files with any letters and digits, of whatever length, that end in
On the other hand,
means that we are looking for a file that has the characters
meeting- followed by exactly two characters, then a
.txt at the
So a file like meeting-01.txt" or “meeting-bt.txt” will match, but not “meeting01.txt” nor “meeting-abc.txt”.
So you can understand a question mark as being a placeholder for the number of missing characters to be filled in by the shell. If there is one question mark, only one charactere will be filled in.
Of interest is the
[ and ] square brackets: these behave almost the
same as they do under regular expressions. They create a class.
So let us say we are looking for a file called “cats.txt” or
“rats.txt”. The difference is in the first letter, either
We need to create a character class of two letters, c and r. To do
that, we must place them inside the square brackets like this:
So we have to type:
While globing works with filenames and file paths, regular expressions work with any text. Usually inside files.
Most text editors support regular expressions. A regular expression also uses a pattern, but what may be important in this post at this time is to point out its difference from globing:
- Support for more wildcard characters
- Regular expressions have more wildcard characters than globing. For
instance, while the period character
., has not meaning in shell globing, it stands for any character (except the newline) in regex.
- Different meaning for the question mark
- In regex, a question mark has a different meaning from that used in
shell globing. In file globing, the question mark is simply a
placeholder: it just means “in place of this exact number of missing
characters” whereas in regular expressions it means zero or one
character. Thus, if we are looking for the spelling of colour spelt
either in American or British English, we would type
which means, find colour with or without a
u in it.
Although globing is different from regexes, it is important in my opinion to think of their similarities rather than their differences:
- Both are intended as building a text pattern to use during a search;
- Both use some wildcard characters to denote a class of characters.
- Their differences lie in the fact that globing is more for files than their contents. As a result, some characters (such as the period) that often form part of filenames do not attain a new meaning as with regexes.
- At the end of the day, it is important to master the syntax of both shell globing and that of regular expressions as a way to maximise your productivity.
Thanks for reading through this post, and happy searching!