Lab
Core Tech

Guided: Advanced Regex Strategies and Optimization

This Guided Code Lab will walk you through some advanced features of regular expressions in Javascript. In particular, you will learn how to match complex patterns, extract specific portions of matched text, and avoid typical performance pitfalls. This lab will level up your regular expression skills!

Get started Contact sales

Path Info

Level

Intermediate

Duration

1h 34m

Published

Feb 16, 2024

Challenge

Introduction
Welcome to Advanced Regex Strategies and Optimization! In this lab, you'll be using advanced regular expressions to assist in searching for information related to financial transactions.

Lab Structure

The files you will interact with are in two folders:
- /src folder This folder contains the code that you will be writing and modifying. You should see four existing files, one for each step in this lab.
- /solutions folder This folder contains solutions for each step. Feel free to refer to these if you get stuck.
Prerequisites

You should already have some familiarity with regular expressions in Javascript. In particular, you should be comfortable with:
- Matching specific characters or ranges of characters with square brackets. Examples:
  - [0-9] matches any digit.
  - [abxy] matches any of the characters a, b, x, or y.
- Quantifiers
  - [abc]* matches any of the characters a, b, or c zero or more times.
  - [abc]+ matches any of the characters a, b, or c one or more times.
  - [abc]? matches any of the characters a, b, or c zero or one times.
- Alternation
  - [abc]|(warning) matches any of the characters a, b, or c or the specific string "warning".
Now you'll get started!
Challenge

Step 1: Using Lookaround Operators
Lookaround operators allow you to match text based on whether it is adjacent to some other text expression. This allows you to fine-tune what text is matched based on the text around it. The lookaround operators you'll be exploring in this step are the lookahead operator ((?=)) and the lookbehind operator ((?<=)).

Lookahead

The lookahead operator allows you to match text that is ahead of the position in the regular expression. For example, consider this regular expression:
```
\w+(?= and fries)
```
The lookahead operator here indicates that whatever this expression matches must be followed by the string and fries; that is, it looks ahead to match the contents of the lookahead expression. Here are some examples of how this expression works:

| Input | Matched Text | |-|-| | burger and fries | burger | | sandwich and fries | sandwich | | fish and chips | <no match> |

Note that the matched text does not include the contents of the lookahead expression.

Lookbehind

You'll now explore using a lookbehind to match text based on what comes before it. You'll revisit the example from earlier, except now, you're interested in the text that comes after the word "and":
```
(?<=and )\w+
```
This lookbehind expression matches text like this: | Input | Matched Text | |-|-| | burger and fries | fries | | sandwich and fries | fries | | fish and chips | chips | | sandwich | <no match> |
Challenge

Step 2: Using Capture Groups and Backreferences
Now, you're going to use capture groups to extract specific portions of text. A capture group is surrounded by parentheses. A regular expression can contain many capture groups. For instance, here is an expression that will match two words separated by " and ":
```
(\w+) and (\w+)
```
The two capture groups are numbered 1 and 2, respectively. When matching text using regular expressions in Javascript, matches are returned as an array. When no capture groups are present, the array contains only one entry: the entire matched text itself. When capture groups are present, they follow the initial entry at index 0.

For example, this Javascript code matches a variable input with a regular expression that uses capture groups:
```
const result = input.match(/(\w+) and (\w+)/);
const matchedText = result[0];
const captureGroup1 = result[1];
const captureGroup2 = result[2];
```
This will match text according to the following examples:

| Input | matchedText | captureGroup1 | captureGroup2 | | -------- | -------- | -------- |-| | Burger and Fries| Burger and Fries | Burger | Fries | | Fish and Chips | Fish and Chips | Fish | Chips | | Beans and Corn Bread | Beans and Corn | Beans | Corn |

info> When the text does not match, the result of calling match(...) will be null.

Extracting Currency and Amount

Now, you'll continue working to identify and extract information from financial transactions. You need to extract both the amount and currency of a financial transaction and return this information as a JavaScript object. The following table shows the expected results of various kinds of input:

| Input | Result | |-|-| | $25 | { amount: '25', currency: '$' } | | ¥ 45 | { amount: '45', currency: '¥' } |

info> There may be a space between the currency symbol and the numeric value. Now, you're going to use capture groups to match the same text multiple times. In order to do this, you will use a backreference. A backreference allows you to match the same text as a capture group in another location. For instance, if you wanted to match text of the form "spam, spam, eggs, and spam", you could use a regular expression like this:
```
(\w+), \1, (\w+), and \1
```
This regular expression only has two capture groups, but it uses a backreference to match the same text in three separate places. Here are some examples of how this expression works:

| Input | Group 1 | Group 2 | |-|-|-| | spam, spam, eggs, and spam | spam | eggs | | beans, beans, pork, and beans | beans | pork | | bacon, eggs, hash, and pancakes | <no match> | <no match> |

Now, you will use backreferences to find payment plans and extract the relevant details. Some example payment plans are:
- $3450 down, $208 per month
- ¥11500 down, ¥550 per month
What is important is that both monetary amounts must be the same currency. Now, you will use both backreferences and lookarounds to match arbitrage transactions. In currency arbitrage, traders exchange one currency for another with the intent of generating profit by exploiting mismatches in exchange rates. Here are some examples of transactions you'd like to search for:
- from $20 to ¥101
- from ¥1892 to €205
Notice that the currency symbols do not match. You do not want to identify text as an arbitrage transaction if the currency symbols are the same.

You can accomplish that by using a negative lookahead . Negative lookaheads are of the form (?!<expression>) where <expression> is the expression you want to prevent matching.
Challenge

Step 3: Anchors

Now, you're going to use anchors to help ensure that the text that you match is adjacent to a specific position in the input. The anchors that you'll be using are the beginning-of-input anchor (^) and the end-of-input anchor (\$).

A beginning anchor (^) ensures that the expression matches only at the beginning of the input. For instance, the expression ^Once upon a time matches the input "Once upon a time there was a hero" since the input begins with the given phrase. Input like "Well, Once upon a time" will not match.

Similarly, an ending anchor ($) matches at the end of the input. An expression like happily ever after.$ will match input like "they lived happily ever after.", but not "they lived happily ever after. The end". ## Anchors and Performance

Anchors serve not only as tools to make regular expression matching more precise, they can also be an effective tool for ensuring that regular expressions perform well. For large inputs, anchors can help prevent the regular expression engine from searching through large portions of the input where matches are impossible.

For example, consider the difference between the expressions ^Once upon a time and Once upon a time. For small inputs, the performance difference may be inconsequential. For larger inputs, however, the latter expression (without the anchor) will search for the phrase "Once upon a time" everywhere. If you were searching the text of a book, for instance, the expression without the anchor would search the text on every page of the book. The expression with the anchor will quickly succeed or fail without needing to examine anything beyond the first few dozen characters.

In the next step, you will look at further performance optimization techniques for regular expressions.
Challenge

Step 4: Avoiding Performance Pitfalls
Greedy and Lazy Quantifiers

Quantifiers in regular expressions (like * and +) are greedy by default, meaning that they match as much text as possible before allowing the next portion of the regular expression to evaluate. This can have some surprising effects both in text matching and in performance. For instance, consider the following regular expression:
```
This is a .+ story
```
Now, consider this input:
```
This is a fun and exciting story but maybe not the story you expect.
```
The expression will match the text "This is a fun and exciting story but maybe not the story". If you were expecting it to only match "This is a fun and exciting story", this reveals why greedy quantifiers can be troublesome.

Not only can greedy quantifiers unexpectedly cause too much text to be matched, they can also cause performance problems. Imagine if the input was something like "This is a fun and exciting story" followed by dozens (or even hundreds) of megabytes of more text, followed by the word "story". The above regular expression would match the entire input. If the input is large enough, it can cause problems both in memory consumption due to keeping a copy of such a large segment of the input in memory as well as CPU performance problems unnecessarily searching through large input.

You can change the behavior of the above regular expression significantly by using a lazy quantifier. Lazy quantifiers match as little as possible before deferring control to later portions of the regular expression. You can change the behavior of a quantifier to lazy by following it with a question mark symbol. For instance, * is greedy, but *? is lazy.

If you modify your expression above to use a lazy quantifier:
```
This is a .+? story
```
This will match only "This is a fun and exciting story" from the example above, eliminating not only the unexpectedly large match, but also many potential performance issues.

Now, you will use a lazy quantifier to help ensure that you can identify financial transactions processed through a brokerage. ## Numeric Quantifiers

Another performance optimization you can use to avoid matching too much text is by using a numeric quantifier. Numeric quantifiers specify the minimum and maximum number of times that a given expression should match. For instance, the expression (hello){2, 4} will match the string "hello" repeated at minimum 2 times, and at maximum four times.

Numeric quantifiers can help serve as a safeguard against matching too much text. If you know that a particular portion of valid input has a maximum size, a numeric quantifier can quickly reject invalid input. ## Non-capturing Groups

Sometimes it's useful to use a capture group as a way to treat a portion of an expression all as one unit; say, to apply a quantifier. However, there are times when you're not interested in the text matched by a group. In this case, you can use a non-capturing group to evaluate that the text matches, but without capturing the matched text. For example, if you wanted to match typical English names, you could use an expression like this:
```
(\w+) (?:[A-Z]\. )?(\w+)
```
This expression will match inputs like "Jane Doe" and "John Q. Smith". However, it only contains two capture groups. The group that matches the optional middle initial ((?:[A-Z]\. )) is a non-capturing group. As you can see, this can be useful when you need to add a quantifier, but have no interest in the matched text. Using non-capturing groups can also improve regular expression performance, especially in cases where inputs are large. ## Catastrophic Backtracking

One of the most notorious performance problems that one might encounter in dealing with regular expressions is a scenario called catastrophic backtracking, where the regular expression engine attempts to process an exponential number of combinations of two or more quantifiers.

In order to understand catastrophic backtracking, you must first understand backtracking. Given this regular expression:
```
((spam|bacon|eggs), )+ eggs, and spam
```
Consider what happens when matching the input "spam, bacon, eggs, and spam": | Step | Matched Text | Remaining Text | | |-|-|-|-| | 1 | spam, | bacon, eggs, and spam | + quantifier will continue greedily matching | | 2 | spam, bacon, | eggs, and spam | + quantifier will continue greedily matching | 3 | spam, bacon, eggs, | and spam | + quantifier has encountered non-matching text and has stopped | 4 | spam, bacon, eggs, | and spam | pattern eggs, and spam does not match | 5 | spam, bacon, | eggs, and spam | + quantifier backtracks | 6 | spam, bacon, | eggs, and spam | pattern eggs, and spam matches

Now, consider the input "spam, spam, bacon, and beans": | Step | Matched Text | Remaining Text | | |-|-|-|-| | 1 | spam, | spam, bacon, and beans | + quantifier will continue greedily matching | | 2 | spam, spam, | bacon, and beans | + quantifier will continue greedily matching | 3 | spam, spam, bacon, | and beans | + quantifier has encountered non-matching text and has stopped | 4 | spam, spam, bacon, | and beans | pattern eggs, and spam does not match | 5 | spam, spam, | bacon, and beans | + quantifier backtracks | 6 | spam, spam, | bacon, and beans | pattern eggs, and spam does not match | 7 | spam, | spam, bacon, and beans | + quantifier backtracks | 8 | spam, | spam, bacon, and beans | pattern eggs, and spam does not match

At this final step, the + quantifier cannot backtrack further since it requires at least one instance of the pattern it matches. Therefore, the input does not match. Notice how it requires more steps to determine non-matching text.

Catastrophic backtracking occurs when two nested or adjacent quantifiers backtrack repeatedly, causing the engine to attempt an exponential combination of two or more quantified expressions. This behavior is usually triggered when encountering text that cannot match. Signs to watch out for to identify the potential for catastrophic backtracking include:
- Adjacent sub-expressions with quantifiers where some given text could satisfy either sub-expression.
  - Example: ((spam|eggs), )+(eggs|bacon), )+; the string "eggs, " can be matched by either of the two sub-expressions with + quantifiers
  - Potential fix: Try to make the sub-expressions mutually exclusive, where only one can match any given portion of text.
- Nested quantifiers
  - Example: (((spam|eggs), )+)*; Note how the expression with the + quantifier is nested inside an expression with a * quantifier.
  - Potential fix: Find a way to rewrite the expression to remove the nested quantifiers.
Thankfully, even if it's not clear how or why catastrophic backtracking occurs, the two fixes above can often solve performance issues.
Challenge

Conclusion
Well done! You've demonstrated your ability to use some advanced regular expression features and to avoid some typical performance pitfalls. In particular, here's what was covered:
1. Using lookaround operators to match text based on other nearby text
2. Using capture groups to extract specific portions of matched expressions
3. Using backreferences to match (or prevent matching) the same expression in multiple places
4. Using lazy quantifiers to prevent matching too much text
5. Using numeric quantifiers to limit the size of text matches
6. Preventing typical cases of catastrophic backtracking
For further, more in-depth reading of advanced regular expression usage, consider reading more at https://www.regular-expressions.info/ .

Author

Floyd May

Developer. Craftsman. Leader. Architect. Mentor. Teacher. Author. Floyd is a veteran software craftsman with broad experience and a passion for teaching.

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Ready to get started?

View individual plans View team plans

Guided: Advanced Regex Strategies and Optimization

Path Info

Table of Contents

Introduction

Lab Structure

Prerequisites

Step 1: Using Lookaround Operators

Lookahead

Lookbehind

Step 2: Using Capture Groups and Backreferences

Extracting Currency and Amount

Step 3: Anchors

Step 4: Avoiding Performance Pitfalls

Greedy and Lazy Quantifiers

Conclusion

What's a lab?

Provided environment for hands-on practice

Guided walkthrough

Did you know?