Contents
If your tables have only a few dozen rows, then most likely this article will not be relevant to you. On such small amounts of data, any method will work fast enough that you do not notice it. If the number of rows in your lists is measured in thousands, and there are more than one or two tables themselves, then the time of painful waiting for the recalculation of formulas in Excel can reach up to several minutes.
In this case, the correct choice of the function used to link the tables plays a decisive role – the difference in performance between them, as we will see below, can be more than 20 times!
When I wrote my first book five years ago, I already did a comparative speed test of various methods of searching and substituting data with the functions VLOOKUP, INDEX + MATCH, SUMIF, etc. Since then, three versions of Office have changed, Power Query and Power Pivot add-ons have appeared, radically that changed the whole process of working with data. And last year, the Excel calculation engine was also updated, having received support for dynamic arrays and new functions VIEW, FILTER, etc.
So it’s time to take up the stopwatch again and find out who is the fastest. Well, at the same time, check – what methods of searching and substituting data in Excel do you know 🙂
Guinea pig
We will test on the following example:
This is an Excel workbook with one sheet, where two tables are located: shipments (500 rows) and price list (000 rows). Our task is to substitute the prices from the price list into the shipping table. For each method, we will enter the formula in cell C600 and copy down the entire column, measuring the time it takes Excel to calculate the entire column of half a million cells. The obtained values, of course, depend on many factors (generation of the processor, amount of RAM, current system load, Office version, etc.), but we are not interested in specific numbers, but rather, their comparison with each other. It is important to understand the voracity of each method and their limitations.
Method 1. VLOOKUP
Classic first 🙂 Legendary vertical view feature – VPR (VLOOKUP), which comes to mind first in such situations:
The following arguments are involved here:
- B2 – the desired value, i.e. the name of the product we want to find in the price list
- $G$2:$H$600 – fixed with dollar signs (so that it does not slide down when copying the formula down) an absolute link to the price
- 2 — number of the column in the price list where we want to get the price
- 0 or LYING — switching to the exact match search mode, when any incorrect product name (for example, FONERA) in column B in the shipments table will result in the #N/A error as a result of the function.
Calculation time = 4,3 sec.
Method 2. VLOOKUP with selection of entire columns
Many users, using VLOOKUP, in the second argument of this function, where you need to specify a search table (price), select an unlimited range ($G$2:$H$600), and immediately the columns G:H entirely. It’s easier, faster, allows you not to think about F4 and the fact that tomorrow the price list may be several lines longer. The formula in this case also looks more compact:
In older versions of Excel, such a selection did not greatly affect the speed of calculations, but now (unexpectedly for me, I confess) the result turned out to be several times worse than the previous one.
Calculation time = 14,5 sec.
But.
Method 3. INDEX and MATCH
The next evolutionary step after VLOOKUP for many Microsoft Excel users is usually the transition to using a bunch of functions. INDEX (INDEX) и MORE EXPOSED (MATCH). This formula looks like this:
Here:
The INDEX function retrieves from the range specified in the first argument (column $H$2:$H$600 with prices in the price list) the contents of the cell with the given number. And this number, in turn, is determined by the MATCH function, which has three arguments:
- What you need to find – the name of the product from B2
- Where we are looking for it – a column with the names of goods in the price list ($G$2:$G$600)
- Search mode: 0 – accurate, 1 or -1 – approximate with rounding up or down, respectively.
The formula comes out a little more complicated, but at the same time it has several tangible advantages over the classical VLOOKUP, namely:
- No need to count off the column number (as in the third argument of VLOOKUP).
- You can extract data that is located to the left of the column where the search occurs.
In terms of speed, however, this method loses VLOOKUP almost twice:
Calculation time = 7,8 sec.
If, in addition, you are too lazy and select not limited ranges, but entire columns:
… then the result is quite sad:
Calculation time = 28,5 sec.
28 seconds Carl! 6 times slower than VLOOKUP!
Method 4. SUMIF
If you need to find not textual, but numeric data (as in our case, the price), then instead of VLOOKUP, you can use the function SUMMESLI (SUMIF). Initially, it was conceived as a tool for selectively summing up data according to a condition (find and sum up all sales of cables for me, for example), but you can make it look for the product we need in the price list. If the loads in it are not repeated, then there will be nothing to sum up with, and this function will simply display the desired value:
Here:
- The first argument to SUMIF is the range of cells to check, i.e. product names in the price list ($G$2:$G$600).
- The second argument (B2) is what we are looking for.
- The third argument is the range of cells with prices $H$2:$H$600, the numbers of which we want to sum up, if the neighboring cells of the checked range contain the desired value.
The obvious disadvantage of this approach is that it only works with numbers. Also, this method is not convenient if the price list is in a separate file – you will have to keep it open all the time, because. the SUMIF function does not know how to take data from closed workbooks, unlike VLOOKUP, for which this is not a problem.
On the plus side, you can write down the convenience when searching in several columns at once – a more advanced version of this function is ideal for this – SUMMESLIMN (SUMIFS). The speed of calculations, however, is very mediocre:
Calculation time = 12,8 sec.
When selecting entire columns, i.e. using a formula like =SUMIF(G:G; B2; H:H) is even worse:
Calculation time = 41,7 sec.
This is the worst result in our test.
Method 5. SUMPRODUCT
This approach is now not common, but still fairly regular. Usually, old-school users like to pervert this way, still remembering well those days when Excel had only 255 columns and 56 colors 🙂
The essence of this method is to use the function SUMPRODUCT (SUMPRODUCT), originally intended for element-wise multiplication of several ranges with subsequent summation of the resulting products. In our case, instead of one of the arrays, there will be a condition, and the second will be prices:
Expression ($G$2:$G$600=B2), in fact, checks each name of the cargo in the price list for compliance with the desired value (PLYWOOD PR). The result of each comparison will be the boolean value TRUE (TRUE) or FALSE (FALSE), which in Excel is interpreted as 1 and 0, respectively. The subsequent multiplication of these zeros and ones by prices will keep alive the price of only the product that we, in this case, need.
This formula is, in fact, an array formula, but does not require pressing the usual keyboard shortcut for them Ctrl+Shift+Enter, because the SUMPRODUCT function supports arrays by itself. Perhaps for the same reason (array formulas are always slower than usual ones), such a recalculation speed of such a formula is not very good:
Calculation time = 11,8 sec.
- Compatibility with any, the most ancient versions of Excel.
- Ability to set complex conditions (and several)
- The ability of this formula to work with data from closed files, if you add a double binary negation in front of it (two consecutive minus signs). SUMMESLIMN cannot boast of such.
Method 6. VIEW
Another relatively exotic way of searching and substituting data, along with VLOOKUP, is to use the function VIEW (LOOKUP). Just do not confuse it with a new, literally, recently appeared function VIEW (XLOOKUP) – We’ll talk about it later. The VIEW function has existed in Excel since the earliest versions and can also solve our problem:
Here:
- B2 – the name of the cargo we are looking for
- $G$2:$G$600 – one-dimensional range-vector (column or row) where we are looking for a match
- $H$2:$H$600 – the same size range from which to return the found result (price)
At first glance, everything looks very convenient and logical, but two non-obvious points spoil the whole picture:
- This function requires mandatory sorting of the price list in ascending (alphabetical) order and does not work without it.
- If the desired value is written with a typo in the shipping table (for example, AGЕDOL instead of AGIDOL), then the LOOKUP function will not return the #N/A error, but the price for the nearest previous item:
When working with non-ideal data in the real world, this is guaranteed to create problems, as you understand.
The calculation speed of the VIEW function (LOOKUP) pretty decent:
Calculation time = 7,6 sec.
Method 7. New function VIEW
This feature came with one of the recent updates so far only to Office 365 users and is not yet available in all other versions (Excel 2010, 2013, 2016, 2019). Compared to the classic VLOOKUP, this function has a lot of advantages (simplified syntax, the ability to search not only from top to bottom, the ability to immediately set a value instead of #N/A, etc.) The formula for solving our problem will look like this in this case:
If you do not take into account the optional 4,5,6 arguments, then the syntax of this function is completely the same as its predecessor – the function VIEW (LOOKUP). The calculation speed when testing for our 500000 rows also turned out to be similar:
Calculation time = 7,6 sec.
Almost twice as slow as VLOOKUP, which Microsoft now suggests using VLOOKUP instead. It’s a pity.
And, again, if you are too lazy and select the ranges in the price list in whole columns:
… then the speed drops to completely indecent values:
Calculation time = 28,3 sec.
And if on dynamic arrays?
Last year (fall 2019) update of the Microsoft Excel calculation engine added support for Dynamic Arrays, which I already wrote about. It’s a groundbreaking approach to working with data that can be used with almost any classic Excel function. For example, VLOOKUP would look like this:
The difference with the classic version is that the first argument of the VLOOKUP here is not just the desired value (and the formula then needs to be copied down to the rest of the lines), but the entire array of half a million loads B2:B500000, the prices for which we want to find. In this case, the formula itself spreads down, occupying the required number of cells.
The recalculation speed in this version, frankly speaking, stunned me – a pause between clicking on Enter after entering the formula and getting the results was almost absent.
Calculation time = 1 sec.
Interestingly, both the new VIEWER, and the old VIEW, and the INDEX + MATCH combination in this mode were also very fast – the calculation time was no more than 1 second! Fiction.
But the old-school approaches based on SUMPRODUCT and SUMMESLI(MN) refused to work with dynamic arrays 🙁
What about smart tables?
Rejoiced at the fantastic results obtained on dynamic arrays, I decided to try to test the difference in speed when working with regular and “smart” tables. I mean those “beautiful tables” that you can convert your range into with the command Format as a table tab Home (Home — Format as Table) or with a keyboard shortcut Ctrl+T.
If we first turn our shipments and prices into “smart” ones (by default they will receive the names Table 1 и Table 2, respectively), then the formula with the same VLOOKUP will look like:
Here:
- [@Cargo] – a reference to cell B2, meaning, in this case, that you need to take the value from the same row from the column Cargo current smart table.
- Table 2 – link to price list
A fat plus of this approach will be the ability to easily add data to our tables in the future. When adding new lines to shipments or to the price list, our “smart” tables will be stretched automatically.
The speed, as it turned out, also grows very significantly and is approximately equal to the speed of working on dynamic arrays:
Calculation time = 1 sec.
I have a suspicion that the point here is not in the “smart” tables themselves, but in the same update of the computing engine, because. On older versions of Excel, I don’t remember such an increase in speed on smart tables.
Bonus. Power Query
Freeze, freeze! Let’s, for the sake of completeness, compare our listed methods with a Power Query query, which can also solve our problem. Someone will say that it is incorrect to compare formula recalculation with the query update mechanism, but, frankly, I was just wondering myself – who is faster?
So:
- We turn both of our tables into “smart” ones using the command Format as a table tab Home (Home — Format as Table) or with a keyboard shortcut Ctrl+T.
- We load the tables in Power Query one by one using the command Data – From Table/Range (Data — From Table/Range).
- After loading into Power Query, we return back to Excel, leaving the loaded data as a connection. To do this, in the Power Query window, select Home — Close and load — Close and load to… — Create connection only (Home — Close&Load — Close&Load to… — Only create connection).
- After both source tables are loaded as connections, we will create one more, third query, which will combine them with each other, substituting prices from the price list into shipments. To do this, on the tab Data choose Get Data / Create Query – Combine Requests – Combine (Get Data / New Query — Merge queries — Merge):
- In the window that opens, select the source tables in the drop-down lists and select the columns by which the linking will take place:
- After clicking on OK we will return to the Power Query window, where we will see our table of shipments with a column added to it, where in each cell there will be a fragment of the price list corresponding to this load. Expand the nested tables using the button with double arrows in the column header, selecting the data we need (prices):
- It remains to unload the finished table back onto the sheet using the already familiar command Home — Close and download (Home — Close&Load).
Unlike formulas, Power Query queries are not automatically updated on the fly, but require you to right-click on the table (or query in the right pane) and select the command Update & Save (Refresh). You can also use the command Refresh all (Refresh All) tab Data (Date).
Update time = 8,2 sec.
Final table and conclusions
If you honestly read up to this point, then you probably already made some conclusions on your own. If you missed all the details and immediately went to the results, then here is the overall resulting table for the speed of all methods:
Of course, each of us has our own preferences, tasks and cockroaches, but for myself I formulated the conclusions after this testing as follows:
- VPR is still the main workhorse. After last year’s updates that speed up the VLOOKUP, and the autumn updates of the compute engine, this feature has sparkled with new colors and gives the heat in full.
- No need to be lazy and select entire columns – for all methods, without exception, this worsens the results by almost 3 times.
- Exotic methods from the past type SUMPRODUCT and SUMMESLI – into the furnace. They are very slow and, in addition, do not support dynamic arrays.
- Dynamic arrays and smart tables are the future.
Unfortunately, I did not have the opportunity to fully test these methods on older versions of Excel and on Excel for Mac (running Office emulation in a virtual machine and testing speed is not right). I would be grateful if you can take the time to run these methods on your PCs and versions and share the results and your thoughts in the comments so that together we can get the whole picture.
- How to use the VLOOKUP function to substitute values in Excel
- The LOOKUP function as a descendant of VLOOKUP
- 5 ways to use the INDEX function