webscraping ads.txt banner

I worked most of my career as an actuary at various insurance companies. Most of the time they had systems that were built before I was born, so I spent a good portion of time getting and transforming (see what I just did?) data into a format ready for analysis.

Transforming data with formulas or VBA can be very time consuming, and there is a lot of potential to unknowingly screw it up.

Then I used Power Query and the game totally changed. Now suddenly things that would take hours to create with formulas and VBA take minutes with power query and it’s all intuitive point and click for the most part.

If you’ve never used power query before (now called Get & Transform in Excel 2016), it absolutely needs to be in your arsenal. You can read an introductory power query guide here if you’re not familiar with it yet.

I work in ad tech now, but I’m still finding useful places for power query in my job.

In this post, we are going to take a look at how I used power query to scrape ads.txt information from multiple websites.

What is ads.txt?

Ok, what the heck is ads.txt? It’s an initiative to decrease fraud in online advertising.

Ads.txt is a text file that website owners can put on the root domain of their website to list the companies that are authorized to sell advertising on that website. Buyers can then use this to check the validity of the advertising inventory they purchase.

ads.txt file for washington post

For example, if you visit washingtonpost.com/ads.txt, you’ll see a text file like above. It lists the domain name of the
advertising system, the publisher’s account ID, the type of account, the certification authority ID, and has comments indicated with a hash (#) sign.

Table of Site URL’s

site list

I’ve created a list of all the sites I’m going to gather ads.txt information from and created a table named SiteList. They are all in the same format example.com and the ads.txt will be found at example.com/ads.txt.

We will come back to this table later.

Create a From Web Query for One Site

The first step is to create a query to pull the ads.txt from one of the sites on my list. Let’s start with the washingtonpost.com.

Go to the Data tab and select the From Web command in the Get & Transform Data section.

This will open the From Web dialog box and we can copy and paste in washingtonpost.com/ads.txt into the URL input box and press the OK button.

We don’t need to worry about the https://www part of the URL, power query will figure all that stuff out.

Preview of ads.txt

You should now see a preview of the ads.txt data and we can proceed to the power query editor with the Edit button.

Turn the Query into a Function

Now we are in the query editor with our first ads.txt query, we want to turn it into a function.

Advanced Editor

Go to the Home tab and open the Advanced Editor.

let
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents("washingtonpost.com/ads.txt"), null, null, 1252)})
in
    Source

You should see something like this in the advanced editor. This is the M code for a query that will return the ads.txt from the washingtonpost.com. If we switched washingtonpost.com for nypost.com then it would return the ads.txt for nypost.com.

Instead of having a hardcode site URL in the query, we could turn it into a parameter so we can pass the query any site’s URL. To do this we are going to need to edit the M code a bit.

let GetAdsTXT=(URL) =>
let
    Source = Table.FromColumns({Lines.FromBinary(Web.Contents(URL & "/ads.txt"), null, null, 1252)})
in
    Source
in GetAdsTXT

Change the code like above. Notice the Web.Contents function now references a parameter called URL which we then concatenate /ads.txt onto.

Press the Done button in the advanced editor to keep the changes we made.

Enter Parameter

We should now see the Enter Parameter input box like above in the query editor. We can leave this blank.

Name the query fGetAdsTXT, this is how we will call the query later on. We can now Close & Load the query to save it.

Query Table of Site URL’s

Now we are ready to pull our list of sites into power query. Make sure the list is in an Excel Table. I’ve named the table SiteList.

With the active cell inside the SiteList table, go to the Data tab and use the From Table/Range command in the Get & Transform Data section.

Add a Custom Column with our fGetAdsTXT Function

Custom Column

Now we can add in a column, go to the Add Column tab in the query editor and press the Custom Column button.

=fGetAdsTXT([Site])

Now enter the formula =fGetAdsTXT([Site]) and press the OK button. Remember, this was the query function we created earlier that took a URL and returned the ads.txt information.

Data Privacy

After creating this function, you might see a warning about data privacy. When you click the Continue button, you will be prompted to set the privacy levels for each site.

Privacy Levels

To avoid individually setting these, you can go to Query Options found in the Data tab under the Get Data command. Go to the Current Workbook Privacy settings and select the Ignore the Privacy Levels and potentially improve performance option. We should now not see any privacy warnings.

Expand Tables

We can now expand out the Ads TXT column which contains tables of the ads.txt for each site. Click on the expand icon in the column heading and choose Expand then press the OK button.

Cleaning the Resulting Ads Data

We now have all the ads.txt information for all the sites in our list, but the information will need a bit of cleaning. There are comments in the file which are proceeded by # symbols and the other information is comma delimited.

Split Column

First, we can Split the data out based on the # character to get all the comments into a separate column. Right-click on the ads.txt column and select Split Column then By Delimiter. Now select a Custom delimiter and enter the # mark and press the Ok button.

Split Column by #

Next, we can Split the remaining data based on a comma delimiter. Right-click on the middle column and select Split Column then By Delimiter. Now select a Comma delimiter press the Ok button.

Since the data is actually separated by a comma and space, we will also trim each column. Select all the columns and right-click on any of the column heading and choose Transform then Trim.

We’ll also rename the new columns as Domain, Account ID, Account Type and Certification ID respectively. Double left-click on a column heading to rename the column.

Cleaned Table

Now we can Close & Load the query to a table. We’re done now and have our ads text data in a nice clean table.

We’ve managed to scrape dozens of websites in a matter of minutes. This is the power of “power query”. They should really change the name back!

Did you find this article easier to understand than the average “tech” article?

We like to think that is no accident.  We’re different.  First of a new breed – the kind who can speak tech, biz, and human all at the same time.

Want this kind of readily-absorbable, human-oriented Power BI instruction for your team? Hire us for a private training at your facility, OR attend one of our public workshops!