Regular Expressions (RegExp) in Power Query

If you are at least slightly familiar with regular expressions, then you do not need to advertise them. If you are not quite in the subject, then regular expressions (Regular Expressions = RegExp = “regexps” = “regulars”) is a language where, using special characters and rules, the necessary substrings are searched for in the text, they are extracted or replaced with other text . This is a very powerful and beautiful tool, an order of magnitude superior to all other ways of working with text.

I have already described in detail and with a bunch of examples from life how you can add regular expression support to Excel using simple macros – if you have not read this article, I highly recommend that you read it before continuing. You will discover a lot of new things, I guarantee 🙂

However, the question remains open – how to add the ability to use regular expressions in Power Query? Power Query, of course, is good on its own and can do a lot with text (cutting, gluing, cleaning, etc.), but if you could cross it with the power of regular expressions, it would be just a bomb.

Unfortunately, there are no built-in functions for working with RegExps in Power Query, and official Microsoft help and technical support answer this question in the negative. However, there is a way around this limitation 🙂

The essence of the method

The main idea is simple to disgrace.

In the list of built-in Power Query capabilities, there is a function Web.Page. The description of this function on the official Microsoft help site is extremely concise:

Regular Expressions (RegExp) in Power Query

Translated, this would be: “Returns the contents of the HTML document broken down into its component structures, as well as a representation of the complete document and its body after the tags have been removed.” So-so description, frankly.

Usually this function is used when importing data from the web and is automatically substituted, for example, when we select on the tab Data Command From the Internet (Data — From web). We give the function a web page as an argument, and it returns its contents to us in the form of tables, having previously cleared all the tags.

What the help does NOT say is that in addition to the HTML markup language function Web.Page supports JavaScript scripts, which is now ubiquitous on websites on the Internet. And JavaScript, in turn, has always been able to work with regular expressions and has built-in functions for RegExps! So to implement regular expressions in Power Query, we will need to feed the Web.Page functions as an argument to a small JavaScript program that will do all the work for Power Query.

What it looks like in pure JavaScript

There are a lot of detailed tutorials on working with regular expressions in JavaScript on the Internet (for example, one, two).

In short and simplified, the JavaScript code will look like this:

Regular Expressions (RegExp) in Power Query

Here:

  • var str = ‘Pay bills 123 and 789 for sausage’; – create a variable str and assign it the source text that we will analyze.
  • var pattern = /d+/gi; – create a regular expression and put it in a variable pattern.

    The expression starts with a slash (/).

    The expression itself here, for example, is d+ stands for any sequence of digits.

    Through the fraction after the expression, there are additional search parameters (modifiers) – they can be specified in any order:

    • g – means global search, i.e. after finding a match, you should not stop, but continue the search until the end of the text. If this modifier is not set, then our script will only return the first match (123)
    • i – search without regard to case of letters
    • m – multi-line search (used when the source text is divided into several lines)
  • var result = str.match(pattern).join(‘;’); – perform a search in the source text (str) by the given regular expression (pattern) and put the results in a variable result, concatenating them with a semicolon using the command join
  • document.write(result); – display the contents of the result variable

Also note that text strings (excluding regular expressions) in JavaScript are enclosed in apostrophes, not quotes as they are in Power Query or VBA.

At the output, this script will give us as a result all the numbers found in the source text:

123, 789

JavaScript short course is over, thank you all. Hope you get the logic 🙂

It remains to transfer this construction to Power Query.

Search and Extract Text Function by Regular Expression in Power Query

We do the following:

1. Open Excel and create a new empty Power Query in the tab Data – Get data / Create request – From other sources – Empty request (Data — Get data / New query — From other sources — Blank query). If you have an old version of Excel 2010-2013 and Power Query you do not have built-in, but was installed as a separate add-in, then all this will be on the tab Power QueryAnd not Data.

2. In the empty window of the query editor that opens, in the right panel, immediately enter the name of our future function (for example, fxRegExpExtract)

Regular Expressions (RegExp) in Power Query

3. Let’s go to the tab View – Advanced Editor (View — Advanced Editor), we erase the entire M-code of the empty request and paste the code of our superfunction there:

Regular Expressions (RegExp) in Power Query

Watch your hands:

In the first line, we say that our function will have three text arguments: txt – the original text being analyzed, regex – regular expression pattern, delim — delimiter character for displaying results.

Next we call the function Web.Page, forming the JavaScript code described above in its argument. We paste and substitute our variable arguments into the code.

Fragment:

[Data]{0}[Children]{0}[Children]{1}[Text]{0}

… is needed to “fall through” into the table with the results we need. The point is that the function Web.Page as a result, it produces several nested tables that repeat the structure of a web page. Without this piece of M-code, our function would output this:

Regular Expressions (RegExp) in Power Query

… and we would have to click the word several times Table, successively “falling through” into child nested tables in columns Children:

Regular Expressions (RegExp) in Power Query

Instead of all this quotation, we immediately indicate in the code of our function which nested table and column (Text) we need.

Here, in fact, are all the secrets. It remains to press the button Finish in the window advanced editor, where we inserted our code, and you can proceed to the most delicious – try our function at work.

Here are a couple of seed examples.

Example 1. Retrieving the account number and date from the payment description

We have a bank statement with a description (purpose) of payments, where you need to pull out the numbers and dates of paid invoices into separate columns:

Regular Expressions (RegExp) in Power Query

We load the table into Power Query in the standard way through Data – From Table/Range (Data — From Table/Rangel).

Then we add a calculated column with our function via Add Column – Call Custom Function (Add Column — Invoke Custom Function) and enter its arguments:

Regular Expressions (RegExp) in Power Query

As a regular expression (argument regex) template we use:

(d{3,5}|d{2}.d{2}.d{4})

… translated into human language meaning: 

numbers from 3 to 5 digits (account numbers)

or

fragments of the form “2-bit number – point – 2-bit number – point – 4-bit number”, that is, dates of the form DD.MM.YYYY.

As a delimiter character (argument delim) enter a semicolon.

After clicking on OK our magic function analyzes all the initial data according to our regular expression and forms a column for us with the found numbers and dates of the invoices:

Regular Expressions (RegExp) in Power Query

It remains to separate it by semicolon using the command Home — Split Column — By Delimiter (Home — Split column — By delimiter) and we get what we wanted:

Regular Expressions (RegExp) in Power Query

Beauty!

Example 2: Extract email addresses from text

Suppose we have the following table as initial data:

Regular Expressions (RegExp) in Power Query

… from where we need to pull out the email addresses found there (for clarity, I highlighted them in red in the text).

As in the previous example, we load the table into Power Query in the standard way via Data – From Table/Range (Data — From Table/Rangel).

Then we add a calculated column with our function via Add Column – Call Custom Function (Add Column — Invoke Custom Function) and enter its arguments:

Regular Expressions (RegExp) in Power Query

Parsing email addresses is a more difficult task and there are a bunch of regular expressions of varying degrees of nightmare to solve it. I used one of the simple options – not ideal, but quite working in most cases:

[w|.|-]*@w*.[w|.]*

As separator (delim) you can enter a semicolon and a space.

Click on OK and we get a column with e-mail addresses extracted from the original text “porridge”:

Regular Expressions (RegExp) in Power Query

Magic!

PS

As the saying goes: “There is no such good thing that could not be made even better.” Power Query is cool on its own, and when combined with regular expressions, it gives us completely unrealistic power and flexibility in processing any text data. I hope Microsoft will someday add RegExp support in Power Query and Power BI updates and all the above dances with a tambourine will become a thing of the past. Well, for now, yes.

I also want to add that it is convenient to play with regular expressions on the site https://regexr.com/ – right in the online editor. There in the section Community Patterns There are a huge number of ready-made regular seasons for all occasions. Experiment – all the power of regular expressions is now at your service in Power Query!

  • What are regular expressions (RegExp) and how to use them in Excel
  • Fuzzy text search in Power Query
  • Assembling tables from different files using Power Query

Leave a Reply