Learn the basics of regular expressions in a simple and practical way.
Regular expressions are coded text patterns that are used to find matches in text. For example, we can use a regular expression to find a phone number in a text file. (We’ll see how to do this shortly.)
While many programming languages have regex implementations, such as Java, JavaScript, and C#, we’ll use a web application called RegEx Pal for this tutorial.
In this article, we’ll use the most basic concepts of regular expressions to find phone numbers in the format (XX)XXXXX-XXXX. We’ll see that there are many ways to do this. We’ll start with the simplest methods and work our way up to more complex ones as we learn more concepts.
How to use the RegEx Pal
The figure below shows the RegEx Pal website (https://www.regexpal.com) with two fields highlighted. The first, Regular Expression, is where we write our expression; and the second, Test String, is the text in which we will test our expression.
String literal
Open RegEx Pal and write the phone number, initially without the area code, in the lower field (Test String):
97265–8610
Now let’s write our first regular expression that matches this number. Write the exact same number in the top field (Regular Expression):
97265–8610
We just used something called a string literal in this expression, that is, a literal representation of the string we want to find.
Now replace the expression with just the number 6. This way only the numbers six are selected, giving us two matches.
But what if we want our expression to match other phone numbers?
Matching digits with a character class
Try this expression:
[0–9]
All digits are highlighted (9 matches). What this expression tells the regex processor is, “Find any digit in the range 0 to 9.”
An expression of this form is called a character class or sometimes a character set.
It is important to understand that in this example the brackets are not treated literally because they are metacharacters. In regular expressions, a metacharacter has a special meaning and is reserved. We will look at other metacharacters shortly.
You can change the range to match only the digits between 1 and 5, for example:
[1–5]
Ou pode especificar a lista de dígitos. Digamos que você queira encontrar apenas os dígitos 1, 5, 6 e 9:
[1569]
To find any phone number separated by a hyphen, like the one in our example from the previous section, we can use this expression:
[0–9][0–9][0–9][0–9][0–9]-[0–9][0–9][0–9][0–9]
While it works, this expression is horrible and there is a much better way to do what we want.
Using a character shorthand
Another way to find digits is to use the expression \d, which means, “Find all digits” and is equivalent to [0–9]. This type of regular expression is called a character shorthand.
This is the equivalent expression to the one used at the end of the previous section to find any phone number:
\d\d\d\d\d-\d\d\d\d
Notice that this character shorthand is formed by a backslash followed by a lowercase d. In this case, the function of the backslash is to serve as an escape character, that is, it changes the meaning of its successor. If we did not use this slash, we would have a literal string. Also notice that we are using the hyphen literally. But what if we wanted to find phone numbers with characters other than the hyphen? In that case, we can use \D, which means, “Find anything that is not a digit.”
This expression uses \D instead of the literal hyphen:
\d\d\d\d\d\D\d\d\d\d
Matching any character
In regular expressions, the period (.) is a kind of wildcard and will find any character (except, in some cases, a newline character, such as the line feed). This is what our expression using the period would look like:
\d\d\d\d\d.\d\d\d\d
This expression finds phone numbers that have a hyphen or other type of separator, such as the period, but it also finds numbers with @, #, %, etc. Do some tests in RegEx Pal.
Capturing groups and back references
Now we’ll see how to create capturing groups and refer to these groups using a back reference. These concepts are a bit confusing, so it’s important that you follow the examples for a complete understanding. In the bottom field of RegEx Pal, type the number 505; now type this regular expression in the top field:
(\d)\d\1
The processor found a match. Let’s take a closer look at what this expression does:
- (\d) is our capture group. It was created by wrapping \d in parentheses (note that parentheses are also metacharacters) and what it does is find and capture the first digit, which in our example is the number 5;
- \d finds the next digit (number 0);
- \1 refers to the first captured digit (number 5). We can have more groups in an expression; the number refers to the order in which the group appears. If we want to refer to a second group we use \2, to a third group we use \3…
Now, change the number 505 to 507. No match is found because the number captured by the group is 5, that is, this expression is equivalent to the expression \d\d5.
Let’s do one last example. Change 507 to 8080 and use this expression:
(\d)(\d)\1\2
A match is found. Notice that we have two capturing groups and we reference both of them.
Quantifiers
Try this other way to find our phone number:
\d{5}-?\d{4}
The set of curly brackets surrounding a number defines a quantifier. The number indicates the exact number of digits we want to find, and the curly brackets are metacharacters.
The question mark (?) after the literal hyphen is also a quantifier and indicates that its predecessor, in this example the hyphen, may appear only once or not at all (zero or one). There are two more quantifiers: the plus sign (+), which means “one or more,” and the asterisk (*), which means “zero or more.”
See this other expression using quantifiers:
(\d{4,5}[.-]?)+
In short, what this expression does is search for between four and five digits, followed or not by a period or hyphen, and the plus sign outside the parentheses indicates that this entire set may appear one or more times.
To avoid any doubts, let’s analyze each of the characters in this expression again:
- ( opens a capture group;
- \ start of character shortcut (the backslash is an escape character and as such it changes the meaning of the character to its right);
- d end of character shortcut (\d searches for any digit between 0 and 9);
- { opening quantifier;
- 4 minimum quantity;
- , separates the minimum and maximum quantities;
- 5 maximum quantity;
- } closes the quantifier;
- [ opens the character class;
- . literal period;
- – literal hyphen;
] closes the character class;
? quantifier of zero or one;
) closes the capture group; - + quantifier of one or more.
Note that this expression matches any group of 4 or 5 characters, which may or may not represent a phone number. We can be a little more specific:
(\d{5}[.-]?)\d{4}
This will find five digits followed or not by a period or hyphen and finally the last four digits.
Quoting literals
Finally, we will see an expression to find a phone number that may or may not have an area code. This code, when present, can be enclosed in parentheses, and we must make sure that the expression defines them literally. Here are some examples of phones we want to find:
97265–8610
97265.8610
(11)97265–8610
1197265–8610
11972658610
Test these numbers in RegEx Pal with this expression:
^(\(\d{2}\)|^\d{2})?\d{5}[.-]?\d{4}$
This expression is quite complex, so let’s analyze the role of each character:
- ^ (caret) at the beginning of the regular expression or after the vertical bar (|) means that the phone number must be at the beginning of a line;
- ( opens a capturing group;
- \( is a literal opening parenthesis (note the backslash so that ( is recognized as a literal instead of a metacharacter);
- \d searches for a digit;
- {2} is a quantifier that, after \d, indicates that exactly three digits must be found;
- \) closes a literal parenthesis;
- | (vertical bar) indicates a choice of alternatives. In our example, the vertical bar together with the two expressions that surround it mean “find an area code with or without parentheses”;
- ^ indicates the beginning of a line;
- \d searches for a digit;
- {2} is a quantifier that searches for exactly three digits;
- ? indicates that the previous group is optional (zero or one), that is, the area code is not required;
- \d searches for a digit;
- {5} is a quantifier that searches for exactly five digits;
- [.-]? searches for an optional period or hyphen;
- \d searches for a digit;
- {4} is a quantifier that searches for exactly four digits;
- $ finds the end of a line. This means that there should be no more characters after the phone number.
Try different forms of phone number to see what comes up.