Parsing non-table data from websites

Contents

Post navigation
Leave a comment
- Отменить ответ
Latest news
Recent comments
records
Categories

Problem with non-table data

With download to Excel tabular Internet data is no problem. The Power Query add-in in Excel makes this task easy in seconds. It is enough to select on the tab Data Command From the Internet (Data — From internet), paste the address of the desired web page (for example, key indicators of the Central Bank) and click OK:

Power Query will automatically recognize all the tables on the web page and list them in a window. Navigator:

Then it remains to select the desired table by typing and load it into Power Query for further processing (button Convert Data) or directly to the Excel sheet (button Download).

If data is loaded from the site you need according to the above scenario, consider yourself lucky.

Unfortunately, quite often there are sites where, when attempting such a load, Power Query “does not see” tables with the necessary data, i.e. in the window Navigator there simply aren’t any Table 0,1,2… or among them there is no table with the information we need. There may be several reasons for this, but most often this happens because the web designer used a non-standard tag construction in the HTML code of the page when creating the table.

, and its analogue is nested container tags

. This is a very common technique when laying out websites, but, unfortunately, Power Query is not yet able to recognize such markup and load such data into Excel.

However, there is a way around this limitation 😉

As a practice, let’s try to download prices and product descriptions from the Wildberries marketplace – for example, books from the Detectives section:

Loading HTML code instead of a web page

First, we use the same approach – select the command From the Internet tab Data (Data — From internet) and enter the address of the page we need:

https://www.wildberries.ru/catalog/knigi/hudozhestvennaya-literatura/detektivy

After clicking on OK a window will appear Navigator, where we will no longer see any useful tables, except for an incomprehensible Document:

Then the most interesting begins. Click on the button Convert Data (Transform Data)to still load the contents of the table Documentto the Power Query Query Editor. In the window that opens, delete the step Navigation (Navigation) red cross:

… and then click on the gear icon to the right of the step Source (Source)to open its options:

In drop down list Open file as (Open file as) instead of the default HTML page selected there, select Text file (Text file). This will cause Power Query to interpret the downloaded data not as a web page, but as plain text, i.e. Power Query will not try to recognize HTML tags and their attributes, links, pictures, tables, but will simply process the page source code as text.

After clicking on OK we will just see this HTML code (it can be very voluminous – do not be alarmed):

Looking for something to grab onto

Now we need to understand which tags, attributes or labels in the code we can focus on in order to extract the data we need about the products from this heap of text. Of course, it all depends on the specific site and the web programmer who wrote it, and you will have to improvise already.

In the case of Wildberries, scrolling this code down to the products, you can easily find simple logic:

Lines with prices always contain the label lower-price
Lines with the brand name – always with a label brand-name c-text-sm
The name of the product can be found by the tag goods-name c-text-sm

Sometimes the search process can be greatly simplified by using the code debugging tools that are now available in any modern browser. By right-clicking on any element of a web page (for example, a price or product description), you can select the command from the context menu Inspect (Inspect) and then view the code in a convenient window right next to the content of the site:

Filtering the required data

Now, in a completely standard way, let’s filter the lines we need in the page code by detected tags. To do this, select in the Power Query window in the filter [1] an option Text Filters – Contains (Text filters — Contains), switch to the mode Details (Advanced) [2] and enter our criteria:

Adding conditions is done by a button with a funny name Add offer [3]. And do not forget to set a logical connective for all conditions Or (OR) instead И (And) in the drop down lists on the left [4] Otherwise, the filtering simply won’t work.

After clicking on OK only lines with the information we need will remain on the screen:

We clean the trash

It remains to clean all this from debris in any way that is suitable and convenient for you personally (there are many of them). For example, like this:

Remove start tag by replacing with empty: via command Home – Replacing Values (Home — Replace values).
Divide the resulting column by the first delimiter “>» left command Home — Split Column — By Delimiter (Home — Split column — By delimiter) and then divide the resulting column again by the first occurrence of the delimiter “<» on the left to separate the payload from the tags:
Delete extra columns, and in the rest, replace the standard HTML construction “ to normal quotes.

As a result, we get our data in a much more presentable form:

Parsing blocks into columns

If you look closely, the information about each individual product in the resulting list is grouped into blocks of three cells. Of course, it would be much more convenient for us to work with this table if these blocks were turned into separate columns: price, brand (publisher) and name.

You can perform such a transformation very easily – with the help of, literally, one line of code in the M language built into Power Query. To do this, click on the button fx in the formula bar (if you can’t see it, then enable it on the tab Review (View)) and enter the following construction:

= Table.FromRows(List.Split(#”Replaced value1″[Column1.2.1],3))

Here the function List.Split splits the column with the name Column1.2.1 from our table from the previous step #”Replaced value1″ into pieces by 3 cells and then a function Table.FromRows converts the resulting nested lists back into a table – already from three columns:

Well, then it’s a matter of technology – to set up the number formats of the columns, rename them and place them in the right order. And upload the resulting beauty back to the Excel sheet with the command Home — Close and download (Home — Close & Load…)

That’s all the tricks 🙂

Importing a bitcoin rate from a website using Power Query
Parsing text with regular expressions (RegExp) in Power Query
Parameterizing Data Paths in Power Query

Отменить ответ

Your email address will not be published. Required fields are marked *

Comment *

First name *

Email *

Site