Fixing The Facebook Ad Library (Part 2): Recreating The Ad Library
In the previous article we looked at what the Facebook ad library is, and where it falls short. While it’s a good first attempt at providing greater transparency on the ads that run on Facebook, the inability to search it by anything but a particular page makes it close to useless.
As we touched on though, there are ways around these deficiencies. One of these is scraping. By scraping the ad library for individual pages, or at least as much of it as it’s possible to scrape, we can build our own ads database. As this will give us access to all the underlying data, we’ll be able to search this to find ads that contain a particular term, or whatever other kind of search we can think of.
Back into the code
If you followed part I, by this point you should have access to a list of Facebook page IDs whose ad library page you want to scrape. If you’ve got this saved in a CSV you can open this up with:
import csvwith open(filename, ‘r’, encoding = “utf-8”) as f:
pages = list(csv.reader(f))
Having loaded a list of page IDs into pages, you can then call the following to loop through the pages:
for pageID in pages: #Convert pageID into a string if it isn’t already
if type(pageID) == int:
pageID = str(pageID) url = “https://www.facebook.com/ads/library/?active_status=all&ad_type=all&country=ALL&impression_search_field=has_impressions_lifetime&view_all_page_id=" + pageID driver = webdriver.Chrome()
driver.get(url) #Call the scroll_down function from Part I, to load all ads in the ad library
scroll_down(driver) #Retrieve the HTML of the fully loaded ad library page
html = driver.execute_script(“return document.getElementsByTagName(‘html’).innerHTML”)
This will get you the HTML of the ad library page for each pageID that you scraped in part I. At this point you’ll need to write another function which can crawl through the HTML and pick up on all the fields which you’ll want to store.
Some of the most important fields to capture are:
- The ad’s primary text, which is the main piece of copy that appears above the ad.
- The ad’s asset, i.e. the image or videos that appear inside the ad. You’ll be able to find a URL for where these are stored on Facebook’s CDN.
- Other secondary pieces of copy, such as the ad headline, the call to action and the ad description.
Before moving on, it’s worth noting a few things about how to handle ad assets.
If you’ve followed the above, and have written something capable of scraping the URL for each asset, you might have noticed something interesting about these URLs.
The URLs point to Facebook’s CDN or content delivery network, a domain solely responsible for handling asset requests from Facebook’s services. The URLs all contain a time signature, which is basically an encrypted way of saying the current time is x.
Facebook’s CDN is set up to only allow access to the URLs for a certain amount of time after the time signature contained within the URL. This is to prevent long-term access to files stored on the CDN.
This is obviously a problem if we want to store the assets that we’ve scraped. We can’t just store the URL that we’ve scraped, as this will eventually expire. To get around this, we need to download the assets shortly after we’ve scraped them.
The easiest way to do this in Python is to first download them by calling:
import requests, shutilURL = Facebook CDN URL that you’ve scraped
filename = name of the file you want to save your image as e.g. image.jpgresponse = requests.get(URL, stream=True)with open(filename, ‘wb’) as out_file:
The above will save the asset whose URL you’ve scraped to whatever file name you specified. At this point you have a choice of what to do with this file. You can either store it locally, or you can store it somewhere online and generate a new URL for it.
I personally think the latter is a better option, and would recommend using Google Cloud Storage for hosting these images.
Uploading to Google Cloud Storage
To get started on Google Cloud Storage, you’ll need to set up a Google Cloud developer account here. Once your account is created, create a project, and follow the steps here for generating a service account key. Finally, head to the Cloud Storage console, and create a bucket called scrapingBucket.
Once you’ve downloaded the key, make a note of its path and use it to access Google Cloud from Python:
from google.cloud import storagekeyPath = path to your downloaded key, e.g. ‘C:/project-secret.json’
scrapingBucket = “scrapingBucket”storage_client = storage.Client.from_service_account_json(keyPath)
bucket = storage_client.get_bucket(bucketName)
(If you’re having trouble with the Google Cloud library, you’ll want to run pip install google-cloud-storage.)
You’re now able to access the bucket you just created. To interact with it, we need to create what’s called a blob. What exactly a blob is, is slightly technical. Think of it as something that helps you get files from your computer onto Google Cloud Storage.
To create a blob, we call:
blob = bucket.blob(filename)
Where filename is the path of the asset file that we saved earlier. To upload your asset file using this blob, you can call:
blobURL = blob.public_url
Where we’ve saved the URL to access this file as blobURL. If you’re not able to see the image that you just uploaded by visiting this URL, I’d recommend checking this page, specifically the section titled Making groups of objects publicly readable.
You can then save blobURL in place of the Facebook CDN URL that you had earlier, as the latter will eventually expire.
Now that you’ve scraped the various different parts of the ad that you want to store, and you’ve saved the asset on Google Cloud Storage for permanent viewing, you can start to think about how you want to store your database.
Storing the data
If you’re looking to store ad library data from anything less than a couple hundred pages, simple database solutions, like even just storing the data in a CSV, will work perfectly well. If you want to really scale the amount of data you’re storing, then you’ll have to look at more complex solutions.
One solution that I think works well for this, and which I’ve been using myself, is to store the ad data in another Google Cloud product; BigQuery.
BigQuery is an extremely fast and scalable cloud database solution. It allows you to upload and query vast amounts of data at relatively low costs.
For example, a terabyte of data stored on BigQuery would cost $20/month. Prices scale linearly, meaning that 100 gigabytes would be $2. You even get the first 10 gigabytes free, meaning that you can play around with databases of a decent size, before you even have to think about paying anything.
How do you get data into BigQuery?
Thanks to BigQuery’s Python libraries, getting data in is relatively straightforward. You’ll want to make sure you have the BigQuery library, by calling pip install google-cloud-bigquery in command line.
Once you have this installed, you’ll want to save your scraped data in to a CSV, ready to be uploaded to BigQuery. After doing this, you’ll want to head to the BigQuery console and, with the right project selected, create a dataset and table.
A table is a collection of rows with a shared header, and is analogous to a CSV file. A dataset is a collection of tables within a project.
I’d recommend creating your table with no headers (these can come from the CSV that you upload in a second), but you’re free to change all other settings as you see fit. Once you’ve created both a dataset and a table, you’re ready to upload your CSV from earlier into that table:
#Initialise a BigQuery client
client = bigquery.Client()#Specify the dataset and table names
dataset_id = name of your dataset
table_id = name of your table#Tell the BigQuery client the names of our dataset + table
dataset_ref = client.dataset(dataset_id)
table_ref = dataset_ref.table(table_id)#Create a BigQuery job capable of uploading a CSV
job_config = bigquery.LoadJobConfig()
job_config.source_format = bigquery.SourceFormat.CSV#Skip the header row, but use them to detect column names
job_config.skip_leading_rows = 1
job_config.autodetect = True#Open the CSV saved earlier, and execute the job/upload the CSV to BigQuery
filename = the name of the CSV you’re uploading, or path if not in working directorywith open(filename, “rb”) as source_file:
job = client.load_table_from_file(source_file, table_ref, job_config=job_config)job.result()
This should successfully upload your CSV into the BigQuery table you created earlier. You should see that the table now has column headers based on the headers that exist in your CSV.
Now that your data is in BigQuery, you have a variety of options in front of you. You can query the data in the BigQuery console, or you can keep using the python libraries to write your queries in python. Either way, having your data in BigQuery will mean that you can search through it extremely quickly, and in a way that will scale as you add more rows.
Where does this leave us?
If you’ve gotten this far, you now know everything that you need to in order to start building your own version of the Ad Library. The difference though is that you’ll be able to search your version based on all of the fields that you’ve scraped.
You don’t just have to search the Ad Library by page, which is what Facebook’s version forces you to do. Instead, you can search it by much more useful fields. You can search it for ads whose text contains a certain keyword of phrase, or for ads that use a specific call to action.
In fact, you can go much further than this. You can augment that data that you’ve scraped by adding additional fields which derive from it. Some examples would be writing code which can transcribe the audio from video assets, or a machine learning model which classifies the images that feature in ads.
If you can generate these fields, you can search for ads based on these fields. With the machine learning example, this means you could search for all ads which contain an image of a dog.
In the next (and final) article, I look at how you can build some of these fields and add them to your data.