Contents
One of the most time-consuming and frustrating tasks when working with text in Excel is parsing – parsing the alphanumeric “porridge” into components and extracting the fragments we need from it. For example:
- extracting the zip code from the address (it’s good if the zip code is always at the beginning, but what if it’s not?)
- finding the number and date of the invoice from the description of the payment in the bank statement
- extraction of TIN from motley descriptions of companies in the list of counterparties
- search for a car number or article number in the description, etc.
Usually in such cases, after half an hour of dreary picking in the text manually, thoughts begin to come to mind somehow to automate this process (especially if there is a lot of data). There are several solutions and with varying degrees of complexity-efficiency:
- Use built-in Excel text functions to search-cut-glue text: LEVSIMV (LEFT), RIGHT (RIGHT), PSTR (mid), STsEPIT (CONCATENATE) and its analogues, COMBINE (JOINTEXT), EXACT (EXACT) etc. This method is good if there is a clear logic in the text (for example, the index is always at the beginning of the address). Otherwise, the formulas become much more complicated and, sometimes, it even comes to array formulas, which greatly slows down on large tables.
- Using like text similarity operator from Visual Basic wrapped in a custom macro function. This allows you to implement a more flexible search using wildcard characters (*, #,?, etc.) Unfortunately, this tool cannot extract the desired substring from the text – only check if it is contained in it.
In addition to the above, there is another approach that is very well known in narrow circles of professional programmers, web developers and other techies – this is regular expressions (Regular Expressions = RegExp = “regexps” = “regulars”). Simply put, RegExp is a language where special characters and rules are used to search for the necessary substrings in the text, extract them or replace them with other text. Regular expressions are a very powerful and beautiful tool that surpasses all other ways of working with text by an order of magnitude. Many programming languages (C#, PHP, Perl, JavaScript…) and text editors (Word, Notepad++…) support regular expressions.
Microsoft Excel unfortunately doesn’t have RegExp support out of the box, but this can be easily fixed with VBA. Open the Visual Basic Editor from the tab developer (Developer) or keyboard shortcut Alt+F11. Then insert the new module through the menu Insert – Module and copy the text of the following macro function there:
Public Function RegExpExtract(Text As String, Pattern As String, Optional Item As Integer = 1) As String On Error GoTo ErrHandl Set regex = CreateObject("VBScript.RegExp") regex.Pattern = Pattern regex.Global = True If regex.Test(Text) Then Set matches = regex.Execute(Text) RegExpExtract = matches.Item(Item - 1) Exit Function End If ErrHandl: RegExpExtract = CVErr(xlErrValue) End Function
We can now close the Visual Basic Editor and return to Excel to try out our new feature. Its syntax is the following:
=RegExpExtract( Txt ; Pattern ; Item )
where
- txt – a cell with the text that we are checking and from which we want to extract the substring we need
- pattern – mask (pattern) for substring search
- Item – the sequence number of the substring to be extracted, if there are several of them (if not specified, then the first occurrence is displayed)
The most interesting thing here, of course, is Pattern – a template string of special characters “in the language” of RegExp, which specifies what exactly and where we want to find. Here are the most basic ones to get you started:
Pattern | Description |
. | The simplest is a dot. It matches any character in the pattern at the specified position. |
s | Any character that looks like a space (space, tab, or line break). |
S | An anti-variant of the previous pattern, i.e. any non-whitespace character. |
d | Any number |
D | An anti-variant of the previous one, i.e. any NOT digit |
w | Any Latin character (AZ), digit, or underscore |
W | An anti-variant of the previous one, i.e. not Latin, not a number and not an underscore. |
[characters] | In square brackets, you can specify one or more characters allowed at the specified position in the text. For example Art will match any of the words: table or chair. You can also not enumerate characters, but set them as a range separated by a hyphen, i.e. instead of [ABDCDEF] write [A-F]. or instead [4567] introduce [-4 7]. For example, to designate all Cyrillic characters, you can use the template [a-yaA-YayoYo]. |
[^characters] | If after the opening square bracket add the symbol “lid” ^, then the set will acquire the opposite meaning – at the specified position in the text, all characters will be allowed, except for those listed. Yes, template [^ЖМ]ut will find Path or Substance or Forget, but not Scary or Mut, eg. |
| | Boolean operator OR (OR) to check for any of the specified criteria. For example (сThu|seven|invoice) will search the text for any of the specified words. Typically, a set of options is enclosed in parentheses. |
^ | Beginning of line |
$ | End of line |
b | End of the word |
If we are looking for a certain number of characters, for example, a six-digit postal code or all three-letter product codes, then we come to the rescue quantifiers or quantifiers are special expressions that specify the number of characters to be searched. Quantifiers are applied to the character that comes before it:
Quantor | Description |
? | Zero or one occurrence. For example .? will mean any one character or its absence. |
+ | One or more entries. For example d+ means any number of digits (i.e. any number between 0 and infinity). |
* | Zero or more occurrences, i.e. any quantity. So s* means any number of spaces or no spaces. |
{number} or {number1,number2} | If you need to specify a strictly defined number of occurrences, then it is specified in curly braces. For example d{6} means strictly six digits, and the pattern s{2,5} – two to five spaces |
Now let’s move on to the most interesting part – an analysis of the application of the created function and what we learned about patterns on practical examples from life.
Extracting numbers from text
To begin with, let’s analyze a simple case – you need to extract the first number from alphanumeric porridge, for example, the power of uninterruptible power supplies from the price list:
The logic behind the regular expression is simple: d means any digit, and the quantifier + says that their number should be one or more. The double minus in front of the function is needed to “on the fly” convert the extracted characters into a full number from the number-as-text.
Postcode
At first glance, everything is simple here – we are looking for exactly six digits in a row. We use a special character d for digit and quantifier 6 {} for the number of characters:
However, a situation is possible when, to the left of the index in the line, there is another large set of numbers in a row (phone number, TIN, bank account, etc.) Then our regular season will pull out the first 6 digits from it, i.e. will not work correctly:
To prevent this from happening, we need to add a modifier around the edges of our regular expression b signifying the end of a word. This will make it clear to Excel that the fragment (index) we need should be a separate word, and not part of another fragment (phone number):
Phone
The problem with finding a phone number in the text is that there are so many options for writing numbers – with and without hyphens, through spaces, with or without a region code in brackets, etc. Therefore, in my opinion, it is easier to first clean out all these characters from the source text using several nested functions SUBSTITUTE (SUBSTITUTE)so that it sticks together into a single whole, and then with a primitive regular d{11} pull out 11 digits in a row:
ITN
It’s a little more complicated here, because TIN (in Our Country) can be 10-digit (for legal entities) or 12-digit (for individuals). If you do not find fault especially, then it is quite possible to be satisfied with the regular d{10,12}, but, strictly speaking, it will pull out all numbers from 10 to 12 characters, i.e. and erroneously entered 11 digits. It would be more correct to use two patterns connected by a logical OR operator | (vertical bar):
Please note that in the query we first look for 12-bit numbers, and only then for 10-bit numbers. If we write our regular expression the other way around, then it will pull out for everyone, even long 12-bit TINs, only the first 10 characters. That is, after the first condition is triggered, further verification is no longer performed:
This is the fundamental difference between the operator | from a standard excel logic function OR (OR), where rearranging the arguments does not change the result.
Product SKUs
In many companies, unique identifiers are assigned to goods and services – articles, SAP codes, SKUs, etc. If there is logic in their notation, then they can be easily pulled out of any text using regular expressions. For example, if we know that our articles always consist of three capital English letters, a hyphen and a subsequent three-digit number, then:
The logic behind the template is simple. [AZ] – means any capital letters of the Latin alphabet. The next quantifier 3 {} says that it is important for us that there are exactly three such letters. After the hyphen, we are waiting for three digits, so we add at the end d{3}
Cash amounts
In a similar way to the previous paragraph, you can also pull out prices (costs, VAT …) from the description of goods. If monetary amounts, for example, are indicated with a hyphen, then:
Pattern d with quantifier + searches for any number up to a hyphen, and d{2} will look for pennies (two digits) after.
If you need to extract not prices, but VAT, then you can use the third optional argument of our RegExpExtract function, which specifies the ordinal number of the element to be extracted. And, of course, you can replace the function SUBSTITUTE (SUBSTITUTE) in the results, hyphen to the standard decimal separator and add a double minus at the beginning so that Excel interprets the found VAT as a normal number:
Car plate numbers
If you do not take special vehicles, trailers and other motorcycles, then the standard car number is parsed according to the principle “letter – three numbers – two letters – region code”. Moreover, the region code can be 2- or 3-digit, and only those that are similar in appearance to the Latin alphabet are used as letters. Thus, the following regular expression will help us to extract numbers from the text:
Time
To extract the time in the HH:MM format, the following regular expression is suitable:
After colon fragment [0-5]d, as it is easy to figure out, sets any number in the range 00-59. Before the colon in parentheses, two patterns work, separated by a logical OR (pipe):
- [0-1]d – any number in the range 00-19
- 2[0-3] – any number in the range 20-23
To the result obtained, you can additionally apply the standard Excel function TIME (TEAM)to convert it into a time format that is understandable to the program and suitable for further calculations.
Password check
Suppose that we need to check the list of passwords invented by users for correctness. According to our rules, passwords can only contain English letters (lowercase or uppercase) and numbers. Spaces, underscores and other punctuation marks are not allowed.
Checking can be organized using the following simple regular expression:
In fact, with such a pattern we require that between the beginning (^) and end ($) in our text there were only characters from the set given in square brackets. If you also need to check the length of the password (for example, at least 6 characters), then the quantifier + can be replaced by the interval “six or more” in the form {6,}:
City from address
Let’s say we need to pull the city from the address bar. The regular program will help, extracting the text from “g.” to the next comma:
Let’s take a closer look at this pattern.
If you have read the text above, then you already understood that some characters in regular expressions (periods, asterisks, dollar signs, etc.) have a special meaning. If you need to look for these characters themselves, then they are preceded by a backslash (sometimes called shielding). Therefore, when searching for the fragment “g.” we have to write in regular expression Mr. if we are looking for a plus, then + etc.
The next two characters in our template, the dot and the quantifier asterisk, stand for any number of any characters, i.e. any city name.
There is a comma at the end of the template, because we are looking for text from “g.” to a comma. But there can be several commas in the text, right? Not only after the city, but also after the street, houses, etc. On which of them will our request stop? That’s what the question mark is for. Without it, our regular expression would pull out the longest string possible:
In terms of regular expressions, such a pattern is “greedy”. To correct the situation, a question mark is needed – it makes the quantifier after which it stands “stingy” – and our query takes the text only up to the first counter comma after “g.”:
Filename from full path
Another very common situation is to extract the file name from the full path. A simple regular expression of the form will help here:
The trick here is that the search, in fact, occurs in the opposite direction – from the end to the beginning, because at the end of our template is $, and we’re looking for everything before it up to the first backslash from the right. The backslash is escaped, like the dot in the previous example.
PS
“Towards the end” I want to clarify that all of the above is a small part of all the possibilities that regular expressions provide. There are a lot of special characters and rules for their use, and entire books have been written on this topic (I recommend at least this one for a start). In a way, writing regular expressions is almost an art. Almost always, an invented regular expression can be improved or supplemented, making it more elegant or able to work with a wider range of input data.
To analyze and parse other people’s regular expressions or debug your own, there are several convenient online services: RegEx101, RegExr and more
Unfortunately, not all the features of classic regular expressions are supported in VBA (for example, reverse search or POSIX classes) and can work with Cyrillic, but I think that what is there is enough for the first time to please you.
If you are not new to the topic, and you have something to share, leave regular expressions useful when working in Excel in the comments below. One mind is good, but two boots are a pair!
- Replacing and cleaning up text with the SUBSTITUTE function
- Search and highlighting of Latin characters in text
- Search for the nearest similar text (Ivanov = Ivonov = Ivanof, etc.)