Python Regular Expression Basics

Regular expressions is one of those topics that confuse even advanced programmers. Many people have programmed professionally for years without getting to grips with them. Too often, people just copy and paste Regexes from StackOverflow or other websites without really understanding what’s going on. In this article, I’m going to explain regular expressions from scratch and introduce you to Python’s implementation of them in the re  module.

Regular expressions describe sets of strings

A regular expression is a description of a set of strings. Regular expression matching is a method of finding out if a given string is in the set defined by a certain regular expression. Regular expression search is a method of finding occurrences of strings belonging to that set inside a larger string. Python’s re module provides facilities for search, matching and replacing matched substrings with something else.

The simplest regular expression is just a sequence of ordinary characters. Ordinary characters are those characters that do not have a special meaning in the regular expression syntax.

The re.match function returns a match object if the text matches the pattern. Otherwise it returns None.

Notice how I put the r prefix before the pattern string. Sometimes the regular expression syntax involves backslash-escaped characters that coincide with escape sequences. To prevent those portions of the string from being interpreted as escape sequences, we use the raw r prefix. We don’t actually need it for this pattern, but I always put it in for consistency.

To search in a larger string, we use

So far this is not very useful. We are just matching strings against other strings, which can be achieved more easily with == and in.

However, regular expressions really come into their own when when we start using sets of characters and repetitions.

Sets of characters and repetitions

Let’s say we don’t just want to match the string “cheese”, but any string of lowercase alphabetic characters that is six characters long. In that case we can use the pattern “[a-z]{6}”. The bit in the square brackets – [az] – means that we should match any lowercase alphabetic character from a to z. The bit in the curly brackets – {6} – means that the match should repeat six times.

The dot character . matches any character except a newline.

By the way, if you want to match the dot character itself, you will have to escape it. Special characters can be escaped and made to match their ordinary equivalents by putting a backslash \ before them.

Any other restricted set of characters can be defined, such as the set of all digits – [09] – and the set of all alphanumeric characters – [azAZ09]. There are some shorthand ways of specifying common sets of characters too. For instance, \w is equivalent to the set [azAZ09_], i.e. the set of every alphanumeric character and the underscore.

For a full list of the special character classes supported by re, use the help function.

* and +

So far we have learned how to match a set of characters a specific number of times, but what if we want to match it an indeterminate number of times? That’s where * and . come in.

* is known as the Kleene star, after Stephen Kleene, who invented this notation to describe finite state automata. It means “match the previous character or set of characters zero or more times”.

For instance, the regex a*  matches the following strings:

  • “”
  • “a”
  • “aa”
  • “aaa”
  • etc.

The + character, on the other hand, means “match the previous character or set of characters one or more times”.

So, the regex b+  matches the following strings.

  • “b”
  • “bb”
  • “bbb”
  • etc.

Unlike the previous pattern, this one does not match the empty string.

Repeating matches between x and y times

We’ve seen how to match characters either a definite number of times or an unlimited number of times, but we can also restrict the length of the match using the {x, y} syntax, where x is the lower limit and y is the upper limit.

The pattern a{3,5} will match strings composed of the character a repeated between three and five times.

The strings below match the pattern:

  • “aaa”
  • “aaaa”
  • “aaaaa”

However, the strings “aa”  and “aaaaaa”  do not match.

Excluding characters

Until now, we’ve been defining sets of character by the characters that are included in them, but we can also define sets of excluded characters. That is done using the caret ^ character _inside_ the square brackets.

In the above example, the pattern [^abc]+ matches any string of length one or more that does not contain the characters a, b or c.

Matching the start and end of strings

Regular expressions also support another feature called “anchors”. The caret ^ and the dollar sign $ are the two most common anchors, used to match the start and end of strings respectively. This feature is relevant for searching within strings rather than matching the whole string.

Consider the following example:

The first pattern – ^cheese  – matches the first occurrence of the substring “cheese”  within the search string. The second pattern – cheese$  – matches the second occurrence.

By using the two anchors together, we can match the whole string. Here is a pattern that will match any string starting and ending with “cheese”.

Matching this or that

Sometimes we want to build a pattern that says “match this string, or match that string”. For that, we use the pipe | character.

We can confine it to a certain region using round brackets.

Optional items

What if we wanted to match, say, both the American and English spellings of the word “harbour”. The American version has no “u” in it. Here’s when the optional character ? comes in useful.

The ? in the pattern matches the preceding character u zero or more times.

Groups and named groups

Parts of a regex pattern bounded by round brackets are called “groups”.

These groups are numbered and can be accessed using indexes, but it is also possible to create named groups. These are accessible by name rather just by an index.

Greedy and non-greedy matching

The normal way for regex searches to work is greedily, i.e. matching as much of the search string as possible. Here’s an example.

The pattern <.*>  matched the whole string, right up to the second occurrence of > . However, if we only wanted to match the first <h1>  tag, then we can use the greedy qualifier  *?  that matches as little text as possible.

Now we’re only matching the first tag.

The end

We certainly haven’t covered everything there is to know about regular expressions in this post, but we’ve covered enough to decipher that vast majority of patterns found in the wild, and to invent our own without falling back on cargo-cult copying and pasting.

However, there’s no need to reinvent the wheel, so if you find a good regex that does what you need, you may as well swipe it. Before you do though, test it out with a tool like Regexr or similar.

Tags: No tags

Comments are closed.