Fuzzy text search in Power Query

I once wrote a detailed review of the free Fuzzy Lookup add-on from Microsoft, which allows you to find matches between two lists when data does not exactly match. Recently, with the latest updates to Office 365, a similar functionality has come to Power Query in Excel. By the way, he also got to Power BI Desktop.

Let’s see how this tool works, its pros, cons and nuances of application.

We will train on a slightly modernized example from the last article about the Fuzzy Lookup add-on – two lists that need to be combined into one by matching addresses:

Fuzzy text search in Power Query

Before we start, pay attention to the following points:

  • Exactly in these lists only one address matches – “Pushkino, Naberezhnaya St., 61“. All other addresses differ with a greater or lesser degree of dispersion.
  • Some addresses swapped words – for example “Ulyanovsk, Lermontova St., 63” and “63 Lermontova street, Ulyanovsk«.
  • In some part of the data is missing – for example, there is no city in “Sirenevaya str. d.90in the second table.
  • Somewhere a city with a “g.”, and somewhere without. Streets are the same.
  • There are addresses unique and completely unlike anything and not coinciding with anything (Paris и Rio de Janeiro at the end of each list).
  • There are addresses from spelling mistakes or misspellings within words (Чиlyabinsk, Kоfor..)

Separately, I want to note the problem with St. Petersburg – this city can be written in a bunch of different ways. To take this point into account when linking, we will have to make a special conversion table in advance. The columns in this table must be strictly named from и That’s it and contain all possible names (column from) and their correct counterparts (column That’s it):

Fuzzy text search in Power Query

Step 1. Load the source data into Power Query

First, of course, we need to load all of our three source tables into Power Query. There are several ways to do this (named range, printable area, entire sheet), but the most convenient way is probably to convert to smart tables using a keyboard shortcut Ctrl+T or team Home – Format as a table (Home — Format as Table).

By default, each smart table gets a standard name a la Table 1,2… which can be changed if desired (but I will not be here).

After that, the created “smart table” can be easily uploaded to Power Query using the button From the table (From Table) tab Data (Date) or on the tab Power Query (if you have Excel 2010-2013 and installed Power Query as a separate add-in):

Fuzzy text search in Power Query

In the Power Query query editor window that opens, you can, in principle, “finish” our data if necessary and then save the resulting table as a connection via Home — Close and Load — Close and Load to … (Home — Close&Load — Close&Load to…):

Fuzzy text search in Power Query

And select the option in the next window Just create a connection (Only create connection):

Fuzzy text search in Power Query

All this needs to be done in turn with all three tables, so that in the end all three of our tables in the connected mode appear in the right query panel:

Fuzzy text search in Power Query

Everything. The most boring part is over. Now let’s move on to merging.

Step 2. Perform the merge

On the Advanced tab Data (Date) or on the tab Power Query choose a team Get Data / Create Query – Combine – Combine (Get Data / New Query — Combine queries — Merge):

Fuzzy text search in Power Query

The Merge window will open:

Fuzzy text search in Power Query

In this window you need:

1. Select from drop down lists Tables 1 и 2that we want to merge.

2. Select in both tables the columns by which we link our lists (columns Address и Place, respectively).

3. To see later not only coincidences, but also differences and clearly understand what exactly we found and what not – select the type of connection Full external (Full Outer).

4. Enable (most importantly!) checkbox Use Fuzzy Matches for Merging (Use fuzzy matching to perform the merge). It is he who forces Power Query to look not only for exact matches, but also for approximate ones.

under the link Fuzzy Match Options (Fuzzy matching options) hides a whole block of additional settings for fuzzy merging:

Fuzzy text search in Power Query

Here:

  • similarity threshold (Similarity Threshold) is a fractional factor (from 0 to 1) that determines how strict compliance you require when assembling. When this coefficient is set to one, Power Query will actually look for only exact matches. At values ​​close to zero, the probability of error increases greatly. It makes sense by 2-3 attempts to find the largest possible value (ie, the most rigorous search), but at which all (or most) of the results are found.
  • Ignore case (Ignore case) – By default, Power Query is case-sensitive when searching, i.e. distinguishes Moscow и MOSCOW, for example. Enabling this checkbox allows you to get rid of case sensitivity when merging.
  • Matching by concatenating text fragments (Match by combining text parts) – translated into human language means that when searching for matches, a check will be made for the rearrangement of words within the text (remember Ulyanovsk and Lermontov St.?)
  • If one address in the first table corresponds to several similar addresses in the second (this is especially true for low values ​​of the similarity threshold), then you can limit the number of options found – this is the responsibility of the parameter Maximum number of matches (Maximum number of matches).
  • To take into account the different spellings of St. Petersburg – we indicate our third table as conversion table (Transformation Table).

After completing all the settings, click on OK and expand the second table in the Power Query window that appears using the button in the header (checkbox Use original column name as prefix can be removed):

Fuzzy text search in Power Query

As a result, we get something similar to:

Fuzzy text search in Power Query

As you can see, all addresses have found their counterparts, except for the unique Paris and Rio de Janeiro, which paired with null cells, i.e. emptiness.

Step 3. Write your M-similarity function

In principle, this could have been stopped, but, personally, one moment confuses me in this whole story: how to determine how well Power Query found a match for each address? Imagine that you need to combine tables of several thousand rows in this way – the probability of an error with such a volume of data is already palpable. How to understand where Power Query worked out a fuzzy merge well (the text matches almost exactly), and where it is worth checking the match manually and, possibly, making changes?

Now, if there was (let’s dream!) in our data a column where the similarity coefficient of the found addresses would be indicated, clearly illustrating the accuracy of the selection! How much easier it would be to find suspicious options!

Unfortunately, I didn’t find built-in tools for this in Power Query 🙁 However, we can implement a similar thing on our own by writing our own function of similarity of two text strings in the M language built into Power Query (many thanks for the idea and a nod to Andrey VG with our forum).

We do the following:

1… In the tab Data choose a team Get data / Create request – From other sources – Empty request (Get Data / New Query — From other sources — Blank query).

2. In the Query Editor window that opens, click on The main (Home) or on the tab Review (View) button Advanced Editor (Advanced Editor).

3. In the window that appears, delete everything that is there by default and copy-paste the M-code of our function there:

(text1 as text, text2 as text) as number =>  let      text1 = Text.Upper(text1),      text2 = Text.Upper(text2),      matching_chars = List.Count(List.Intersect({Text.ToList(text1), Text.ToList(text2)})),      average_length = (Text.Length(text1) + Text.Length(text2)) / 2,      coef =  matching_chars / average_length  in      coef  

It should all look like this in the end:

Fuzzy text search in Power Query

If you are interested in details, then this function:

  • converts both text strings to capital letters with the function Text.Upperto avoid case sensitivity
  • parses source strings into individual characters with functions Text.ToList
  • searches for the number of matches of characters by functions List.Intersect и List.Count and put it in a variable matching_chars
  • calculates the average length of the original text strings using the functions Text.Length and put the result into a variable average_length
  • divides the number of matches by the average length to get the similarity coefficient

Of course, this logic is different from the one that Power Query uses when searching for matches (and only developers at Microsoft know exactly how Power Query does this). However, in the vast majority of real cases, our function does its job perfectly – it has been proven by experience.

After clicking on Finish in the right pane of the Power Query window, we can rename our function, giving it a more descriptive name (for example, Similarity Coefficient instead Request1).

Now it remains to apply it to our data. Select on the tab Adding a column Command Call custom function (Add Column — Invoke Custom Function) and enter its arguments in the window that opens:

Fuzzy text search in Power Query

After clicking on OK we will finally get what we want – a column where a numerical similarity coefficient will be visible, clearly displaying the quality of the selection of our addresses:

Fuzzy text search in Power Query

By right-clicking on the resulting column header, you can select the command Replace errors (Replace Errors) and easy to replace the resulting Error in Paris and Rio de Janeiro to zero. Well, then sort our table in descending order by the coefficient column and upload it back to Excel with the already familiar command Home — Close and download (Home — Close&Load):

Fuzzy text search in Power Query

Exact matches at the beginning of the list will not cause problems, but the lines at the end of the list may require your attention and “finishing with a file”. In any case, this variant of the merger seems to me more reliable.

PS And I don’t have this in Excel!

For everyone who, after reading this article, immediately rushes to their Excel to check for a fuzzy search in Power Query, I want to clarify once again:

  • You must have Office 365 Subscription, not Office 2013, 2016, 2019, etc. Unfortunately, Microsoft’s policy at the moment is that only Office 365 subscribers get all the latest goodies and innovations such as fuzzy search, dynamic arrays, the new VLOOKUP (LOOKUPX) function, and so on.
  • You must have all latest updates installed. Keep in mind that in some companies, IT departments deliberately slow down and delay the installation of Office updates, because. they may interfere with the functioning of other programs (Excel and ERP bundles, etc.)
  • Updates are sent to all Office 365 users in waves over several weeks. Inaccurate search appeared on one computer after updates at the end of last year, and on another at the beginning of this. If right now you don’t have this feature yet, wait and everything will be 🙂

  • Free add-on Fuzzy Lookup and search fuzzy text search
  • What is Power Query, Power Pivot, Power BI and why do they need an Excel user
  • Finding exact matches in a case-sensitive way with the EXACT function

Leave a Reply