Using RegEx Subexpressions in MySQL
As I have discussed, regular expressions can be a powerful tool for developers. When we use regular expressions to match a pattern and need to replace part of that pattern, subexpressions make it much easier. Subexpressions are specific parts of a pattern that we can reference elsewhere in the process.
The Requirement
The requirement for this exercise is simple. If the word ‘ipsum’ exists in our string, and there is a word before the instance of ‘ipsum’, replace that word with four asterisks. For example, if we had a string that contained the text ...Scott ipsum...
, or solution must return ...**** ipsum...
.
The Setup
This exercise will use the following query as our starting point.
select
id,
name,
substr(sample, 1, 50) sample
from ipsum
where id in (1,2,6,10,11);
I chose those specific rows because ipsum
is the second word in each of them. The result of this query resembles the following:
+----+--------------+----------------------------------------------------+
| id | name | sample |
+----+--------------+----------------------------------------------------+
| 1 | Lorem Ipsum | Lorem ipsum dolor sit amet, consectetur adipiscing |
| 2 | Zombie Ipsum | Zombie ipsum reversus ab viral inferno |
| 6 | Cat Ipsum | Cat ipsum dolor sit amet, who's the baby yet bury |
| 10 | Dog Ipsum | Doggo ipsum thicc puggo yapper heck bork borkdrive |
| 11 | Pizza Ipsum | Pizza ipsum dolor amet string cheese bbq rib onion |
+----+--------------+----------------------------------------------------+
5 rows in set (0.0033 sec)
The Solution
As I said above, a subexpression is an annotated portion of our pattern in regex. In our regular expression, we separate subexpressions by wrapping them in parentheses ( )
. Each subexpression can then be referenced using a dollar sign $
followed by the subexpression number you want (numbered from left to right).
Let’s take a look at a query that satisfies our requirements.
select
id,
name,
substr(regexp_replace(sample, '(\\b.+\\b)(ipsum)', '**** $2'), 1, 50) snippet
from ipsum
where id in (1,2,6,10,11);
The second argument of regexp_replace()
shows that two regex parts are wrapped in parentheses. This annotation means that we will have two subexpressions. Let’s break these down individually. We will start with (\\b.*\\b)
.
(
: The left parenthesis indicates the start of our subexpression. It is NOT used as part of the pattern match.\\b
: This part of the regex says we want to match a word boundary..+
: Next, we specify we want to match one or more characters.- The
.
indicates any character. - The
+
indicates one or more.
- The
\\b
: This part of the regex says we want to match a word boundary.)
: The right parenthesis indicates the end of our subexpression. It is NOT used as part of the pattern match.
The second subexpression, (ipsum)
, is easier to read.
(
: The left parenthesis indicates the start of our subexpression. It is NOT used as part of the pattern match.ipsum
: Indicates we want to match the literal string ‘ipsum’.)
: The right parenthesis indicates the end of our subexpression. It is NOT used as part of the pattern match.
When we combine these two, our pattern translates to a word boundary followed by one or more characters, followed by another word boundary, followed by the text ‘ipsum’.
We reference the second subexpression using $2
in the third argument. Now, regexp_repalce()
will replace our matching pattern with the literal string ****
and the value of our second subexpression.
Here are the results of this query:
+----+--------------+----------------------------------------------------+
| id | name | snippet |
+----+--------------+----------------------------------------------------+
| 1 | Lorem Ipsum | **** ipsum dolor sit amet, consectetur adipiscing |
| 2 | Zombie Ipsum | **** ipsum reversus ab viral inferno |
| 6 | Cat Ipsum | **** ipsum dolor sit amet, who's the baby yet bury |
| 10 | Dog Ipsum | **** ipsum thicc puggo yapper heck bork borkdrive |
| 11 | Pizza Ipsum | **** ipsum dolor amet string cheese bbq rib onions |
+----+--------------+----------------------------------------------------+
5 rows in set (0.0447 sec)
The word before ipsum
in each row has been replaced with ****
.
Having Some Fun
The solution above satisfies our requirements, but let’s see how else we can use these subexpressions.
If we wanted to replace ipsum
with ****
, we could use this query:
select
id,
name,
substr(regexp_replace(sample, '(\\b.+\\b)(ipsum)', '$1 ****'), 1, 50) snippet
from ipsum
where id in (1,2,6,10,11);
The only change we made was to the text we want to use as a replacement. We are using the value of the first subexpression followed by ****
. The results of this query look like:
+----+--------------+----------------------------------------------------+
| id | name | snippet |
+----+--------------+----------------------------------------------------+
| 1 | Lorem Ipsum | Lorem **** dolor sit amet, consectetur adipiscing |
| 2 | Zombie Ipsum | Zombie **** reversus ab viral inferno |
| 6 | Cat Ipsum | Cat **** dolor sit amet, who's the baby yet bury |
| 10 | Dog Ipsum | Doggo **** thicc puggo yapper heck bork borkdrive |
| 11 | Pizza Ipsum | Pizza **** dolor amet string cheese bbq rib onion |
+----+--------------+----------------------------------------------------+
5 rows in set (0.0599 sec)
This example shows how to use subexpressions, but we would not need regex to replace the text ipsum
with ****
. But what if we wanted to swap the words that match our pattern?
We can accomplish that simply by updating our replacement text to $2 $1
like in the query below:
select
id,
name,
substr(regexp_replace(sample, '(\\b.+\\b)(ipsum)', '$2 $1'), 1, 50) snippet
from ipsum
where id in (1,2,6,10,11);
The result of this query would be:
+----+--------------+----------------------------------------------------+
| id | name | snippet |
+----+--------------+----------------------------------------------------+
| 1 | Lorem Ipsum | ipsum Lorem dolor sit amet, consectetur adipiscin |
| 2 | Zombie Ipsum | ipsum Zombie reversus ab viral inferno |
| 6 | Cat Ipsum | ipsum Cat dolor sit amet, who's the baby yet bury |
| 10 | Dog Ipsum | ipsum Doggo thicc puggo yapper heck bork borkdriv |
| 11 | Pizza Ipsum | ipsum Pizza dolor amet string cheese bbq rib onio |
+----+--------------+----------------------------------------------------+
5 rows in set (0.0559 sec)
Now, each snippet starts with ipsum
and is followed by the word that preceded it in the original text.
If this were a real-life scenario, I would probably add logic to ensure the capitalization made sense, but I think this is good enough for this example.
Wrap Up
Subexpressions in regex can help isolate parts of a pattern so we can reuse them as part of a replacement process. In this post, I showed how we can use each of two subexpressions in different configurations. For more information on regular expressions in MySQL, check out the documentation.
Photo by Ferdinand Stöhr on Unsplash