Parsing non-table data from websites

Problem with non-table data

With download to Excel tabular Internet data is no problem. The Power Query add-in in Excel makes this task easy in seconds. It is enough to select on the tab Data Command From the Internet (Data — From internet), paste the address of the desired web page (for example, key indicators of the Central Bank) and click OK:

Parsing non-table data from websites

Power Query will automatically recognize all the tables on the web page and list them in a window. Navigator:

Parsing non-table data from websites

Then it remains to select the desired table by typing and load it into Power Query for further processing (button Convert Data) or directly to the Excel sheet (button Download).

If data is loaded from the site you need according to the above scenario, consider yourself lucky.

Unfortunately, quite often there are sites where, when attempting such a load, Power Query “does not see” tables with the necessary data, i.e. in the window Navigator there simply aren’t any Table 0,1,2… or among them there is no table with the information we need. There may be several reasons for this, but most often this happens because the web designer used a non-standard tag construction in the HTML code of the page when creating the table.

, and its analogue is nested container tags

. This is a very common technique when laying out websites, but, unfortunately, Power Query is not yet able to recognize such markup and load such data into Excel.

However, there is a way around this limitation 😉

As a practice, let’s try to download prices and product descriptions from the Wildberries marketplace – for example, books from the Detectives section:

Parsing non-table data from websites

Loading HTML code instead of a web page

First, we use the same approach – select the command From the Internet tab Data (Data — From internet) and enter the address of the page we need:

https://www.wildberries.ru/catalog/knigi/hudozhestvennaya-literatura/detektivy

After clicking on OK a window will appear Navigator, where we will no longer see any useful tables, except for an incomprehensible Document:

Parsing non-table data from websites

Then the most interesting begins. Click on the button Convert Data (Transform Data)to still load the contents of the table Documentto the Power Query Query Editor. In the window that opens, delete the step Navigation (Navigation) red cross:

Parsing non-table data from websites

… and then click on the gear icon to the right of the step Source (Source)to open its options:

Parsing non-table data from websites

In drop down list Open file as (Open file as) instead of the default HTML page selected there, select Text file (Text file). This will cause Power Query to interpret the downloaded data not as a web page, but as plain text, i.e. Power Query will not try to recognize HTML tags and their attributes, links, pictures, tables, but will simply process the page source code as text.

After clicking on OK we will just see this HTML code (it can be very voluminous – do not be alarmed):

Parsing non-table data from websites

Looking for something to grab onto

Now we need to understand which tags, attributes or labels in the code we can focus on in order to extract the data we need about the products from this heap of text. Of course, it all depends on the specific site and the web programmer who wrote it, and you will have to improvise already.

In the case of Wildberries, scrolling this code down to the products, you can easily find simple logic:

Parsing non-table data from websites

  • Lines with prices always contain the label lower-price
  • Lines with the brand name – always with a label brand-name c-text-sm
  • The name of the product can be found by the tag goods-name c-text-sm

Sometimes the search process can be greatly simplified by using the code debugging tools that are now available in any modern browser. By right-clicking on any element of a web page (for example, a price or product description), you can select the command from the context menu Inspect (Inspect) and then view the code in a convenient window right next to the content of the site:

Parsing non-table data from websites

Filtering the required data

Now, in a completely standard way, let’s filter the lines we need in the page code by detected tags. To do this, select in the Power Query window in the filter [1] an option Text Filters – Contains (Text filters — Contains), switch to the mode Details (Advanced) [2] and enter our criteria:

Parsing non-table data from websites

Adding conditions is done by a button with a funny name Add offer [3]. And do not forget to set a logical connective for all conditions Or (OR) instead И (And) in the drop down lists on the left [4] Otherwise, the filtering simply won’t work.

After clicking on OK only lines with the information we need will remain on the screen:

Parsing non-table data from websites

We clean the trash

It remains to clean all this from debris in any way that is suitable and convenient for you personally (there are many of them). For example, like this:

  1. Remove start tag by replacing with empty: via command Home – Replacing Values (Home — Replace values).
  2. Divide the resulting column by the first delimiter “>» left command Home — Split Column — By Delimiter (Home — Split column — By delimiter) and then divide the resulting column again by the first occurrence of the delimiter “<» on the left to separate the payload from the tags:

    Parsing non-table data from websites

  3. Delete extra columns, and in the rest, replace the standard HTML construction to normal quotes.

As a result, we get our data in a much more presentable form:

Parsing non-table data from websites

Parsing blocks into columns

If you look closely, the information about each individual product in the resulting list is grouped into blocks of three cells. Of course, it would be much more convenient for us to work with this table if these blocks were turned into separate columns: price, brand (publisher) and name.

You can perform such a transformation very easily – with the help of, literally, one line of code in the M language built into Power Query. To do this, click on the button fx in the formula bar (if you can’t see it, then enable it on the tab Review (View)) and enter the following construction:

= Table.FromRows(List.Split(#”Replaced value1″[Column1.2.1],3))

Here the function List.Split splits the column with the name Column1.2.1 from our table from the previous step #”Replaced value1″ into pieces by 3 cells and then a function Table.FromRows converts the resulting nested lists back into a table – already from three columns:

Parsing non-table data from websites

Well, then it’s a matter of technology – to set up the number formats of the columns, rename them and place them in the right order. And upload the resulting beauty back to the Excel sheet with the command Home — Close and download (Home — Close & Load…)

Parsing non-table data from websites

That’s all the tricks 🙂

  • Importing a bitcoin rate from a website using Power Query
  • Parsing text with regular expressions (RegExp) in Power Query
  • Parameterizing Data Paths in Power Query
Written by the authoradminWritten in10000

Leave a comment

Отменить ответ

Your email address will not be published. Required fields are marked *

Latest news

  • How to do multiplication in Excel
  • Finding the Last Occurrence (Inverted VLOOKUP)
  • Benefits of Pivot by Data Model
  • What to do if lattices are displayed instead of numbers in Excel
  • Problems with formulas in an Excel spreadsheet

Recent comments

There are no comments to view.

records

  • August 2022

Categories

  • 10000
  • 20000

mid-floridaair.com, Proudly powered by WordPress.

Leave a Reply