In the wake of the 2018 Cambridge Analytica scandal, Facebook took a number of steps to try and repair some of the reputational damage that had been done. One of the most publicised steps that Facebook made was to introduce a product called the ad library, a public database of ads that were being run on Facebook at any one time.
What is the ad library?
The ad library was designed as a tool to offer greater transparency into the ads which are being served on Facebook. It lets you see, for a given page, all the ads that that page has run in the last 90 days. It shows you their start date, their creative assets, their copy, and even provides a unique ID for each ad.
The ad library was a good initial step by Facebook to regain user trust after the Cambridge Analytica and Russian ads fiasco, but it doesn’t go far enough. As mentioned above, the ad library only lets you see ads for a particular page.
If you want to monitor the ads which a specific page is running, then the ad library is perfectly fit for purpose. This is only a minority of cases though. Typically we don’t know exactly which pages we want to see ads for. We might want to see all ads that reference a certain term, but won’t know in advance which pages these ads are being run from.
If we want to see all of the ads which mention ISIS, or which use a picture of Donald Trump, the ad library is more or less useless. Assuming we don’t know which pages these ads are being run from, the ad library is of no help.
This is also a particular shame for advertisers who want to use the ad library to inform their competitor research. If you were launching a new perfume, you might want to inform your creative strategy by viewing all ads which contained the phrase perfume. With the ad library the way that it currently is however, there’s no way to achieve this.
Facebook can’t claim that this isn’t possible to implement. I say this as Facebook has a separate version of the ad library just for political ads. You can easily search for political ads which reference a particular term, and see just how much has been spent on these.
If Facebook is able to build a searchable ad library for political ads, it’s hard to believe that it can’t do the same for regular, non-political ads.
The only objection I can see here is one of scale. Political ads are just a fraction of the ads that are served on Facebook. It might be possible for Facebook to build a searchable database of political ads, sure, but to build a searchable database of all ads would be too computationally expensive.
It’s true that it would be more challenging to build a database of every ad currently active on Facebook, but it’s naïve to think that this is a reason not to do it. When you consider the revenue that Facebook makes off of each ad that it serves, it’s hard to imagine that the cost of such a database would come anywhere close to Facebook’s ad revenue.
So, if Facebook won’t give us a searchable ad database, how can we do it ourselves?
Scraping the ad library
Scraping is the process of storing information that’s displayed on a website. If you go to your local weather site, and note down the weather conditions today, you’ve just (in a very manual sense) carried out an act of scraping.
Of course this is a very rudimentary example of scraping, in large part because It’s so manual. When people talk about scraping, they typically are referring to the use of bots to scrape information. With only basic programming knowledge, it’s easy to build bots which can scrape huge troves of online information, and store this in databases.
Given that scraping is such an effective way to store data that’s available online, it shouldn’t come as a surprise that it could offer a way to rebuild the database behind the ad library, one page at a time.
Theoretically speaking, if you could scrape the ad library for every page that was advertising on Facebook, and store this information in a database, you’d have access to every ad on Facebook. You’d then be able to search these ads by keyword, finally letting you surface all ads that contain a particular phrase.
Sounds cool, how do we do it?
Sadly, scraping the ad library for every page on Facebook is easier said than done.
One of the many problems you’d face if you tried to do this comes from the fact that Facebook isn’t going to give you a list of every page available on the platform. In fact, aside from manually writing down page after page, it takes a bit of work to be able to find pages in the first place.
One way that you could find a list of Facebook pages is, drumroll please… scraping. If you’re a regular Facebook user you might be aware that you can search for pages relevant to a particular phrase. By scraping the results page of these searches, you could build up a list of Facebook pages, whose ad libraries you could then go on to scrape.
How do you scrape pages?
Let’s say you go onto Facebook and search for pages related to the term cycling. You do this by typing cycling into the search bar at the top of the page, and clicking on ‘pages’ to restrict your results to pages.
The URL you’re taken to should look something like:
By scraping this page, and storing a list of each page (and its URL), you’re able to build up a database of pages which are relevant to cycling.
Sadly scraping a page like the one mentioned above isn’t the simplest of tasks. The first reason why is that you have to be logged in to Facebook to actually see this page. If you just fire a get request to this URL (a technical way of saying: Hey Facebook, tell me what’s on this page!) you’ll come back empty handed.
To be able to get the content of this page, we first need to log into Facebook. There are a couple of different ways we can do this, but my method of choice is to use a combination of Python, a scripting language, and Selenium, a browser automation library.
What we’ll do is we’ll first navigate to the Facebook homepage. The following code creates an automated browser, and navigates to the Facebook homepage URL.
from selenium import webdriversearchDriver = webdriver.Chrome()searchDriver.get(‘https://facebook.com/')
Once we’ve navigated there, we should (given that we’re not logged in) be presented with email and password fields which we need to complete in order to log in. We can fill in these fields as follows:
usernameBox = searchDriver.find_element_by_name(‘email’)usernameBox.send_keys(“your Facebook email”)passwordBox = searchDriver.find_element_by_name(‘pass’)passwordBox.send_keys(“Your Facebook password”)
This will fill your Facebook details into the two boxes. To click login we need to find the Facebook login button and click it. We can tell selenium to do this by writing:
try: loginBox = searchDriver.find_element_by_id(‘loginbutton’)except: loginBox = searchDriver.find_element_by_name(‘login’)loginBox.click()
The reason that we have to use the try clause is that Facebook often runs A/B tests on its homepage, and in each variant it calls the login box something different. The try clause above lets us handle both cases.
Okay, we’re logged in. Now let’s head to the search results page for the term cycling:
This should open up the search results page that we were on earlier. At this point, we can download the html of the page by calling:
html = searchDriver.page_sourceprint(html)
Now that we’ve printed the html of the page, we can start writing functions which crawl through the HTML and store names & URLs for each of the pages which appear in the results. I won’t cover this as it’s not super interesting, but if you’re stuck then I’d recommend looking at the Extracting the required elements section of this article.
Once you’ve written something the can crawl through the HTML and pull out all the page IDs or URLs within, you should save this data as a CSV.
There are potentially a couple of problems with the HTML we’ve just printed though. One is that it only contains data for the pages which Facebook has decided to place at the top of the results page. You might have noticed that if you scroll down the results page, Facebook will load more results. So, how can you scrape these results using Selenium?
You can actually write a function to simulate a user scrolling down the results page. Facebook will provide more search results each time you hit the bottom of the page, and so in this way you can keep increasing the number of pages that you’ll be able to scrape. The function that you’ll want Is this:
def scrollDown(driver): # Get scroll height.
lastHeight = driver.execute_script(“return document.body.scrollHeight”) while True: # Scroll down to the bottom.
driver.execute_script(“window.scrollTo(0, document.body.scrollHeight);” # Wait to load the page
time.sleep(2) # Calculate new scroll height and compare with last scroll height
newHeight = driver.execute_script(“return document.body.scrollHeight”) # If the browser hasn’t scrolled any more (i.e. it’s reached the end) then stop
if newHeight == lastHeight:
With the function above, we’re able to keep scrolling through the results page, generating more and more results. Be warned though that this function may take several minutes to complete, as it can take a second or so for Facebook to load each new batch of results when the browser scrolls.
This solves one problem with trying to scrape search results pages. The other problem is simply that a lot of the pages which appear in search results can be quite spammy.
Fortunately you can filter for pages which have been verified by Facebook (and so are less likely to be spammy) by appending the following filter to your original search URL:
Your search results will now all be from verified pages, and so more likely to have served ads that you’d want to scrape.
Let’s take stock
Above I’ve run through what the Facebook ad library is, why it’s potentially powerful, but also why it’s ineffective in its current form. By scraping individual pages though, you can access the underlying data behind the ad library and build a searchable database from it.
But in order to scrape the ad library for a selection of pages, you need to first figure out which pages you want to scrape. We’ve looked at a way you can in fact scrape pages from Facebook’s search feature, using Python and Selenium.
In the next article, we’re going to look at how you can use the page data that you’ve scraped to start scraping ads from those pages.