Crawling Optimizely World using Chrome and C#

Needed to synchronize my OMVP Webperf Leaderboard with the current OMVP list and decided to automate it a little bit.

A while back I set up a site showing Webperf ratings for blogs of Optimizely MVPs.

I recently wanted to synchronize the URLs with the current list of OMVPs and found out that there wasn't a report available where I could get just the home URL of each member's blog.

Initially, I wrongfully suspected that I needed to browse Optimizely World with active JavaScript to find the members and their latest blog posts, but it turned out that I could have solved this task by just using a C# HttpClient and fetching the HTML responses from server.

Crawling in that manor would've been a lot faster than the setup I'm about to describe, but just requesting like that; risk is higher to get flagged as synthetic traffic or get caught in bot protection layers.

Acting through a real browser, I chose Chrome, you browse with a typical useragent-string and you store cookies. This, among other things, makes your requests look more human and legit.

I've been using the NuGet package Selenium WebDriver ChromeDriver for quite a while in various QA setups, normally I add it to an xUnit test project and lets it "do stuff" but for this; I put my code in a Spectre.Console project.

A couple of common things I usually implement in these solutions is this way of Matching ChromeDriver NuGet package with Chrome version and this handy method of waiting for a page to load before interacting further.

using OpenQA.Selenium;
using OpenQA.Selenium.Chrome;
using OpenQA.Selenium.Support.UI;

namespace OmvpCrawler;

public static class WebDriverTasks
{
    public static void WaitForReadyState(ChromeDriver webDriver)
    {
        new WebDriverWait(webDriver, TimeSpan.FromSeconds(60))
          .Until(d => ((IJavaScriptExecutor)d)
            .ExecuteScript("return document.readyState;")
            .Equals("complete"));
    }
}

I then just checked some easy way of selecting each person's Profile page link, let the driver navigate there and select most recent blog post A-element in the list of blog posts.

After that; just a little Uri work to skip blogs hosted on World and to get the root URL of each blog post found.

Solution repository on GitHub

If you follow along in Program.cs I've added comments for:

  • How to hide the browser window.
  • Visit local URLs with dodgy and insecure SSL certs.
  • Execute custom JS in the browser console.
  • Save an HTML file of entire DOM after JS complete.
  • Find A-elements by CSS selector and navigate to each HREF-value.

Just a very basic example of how to get started. You can do pretty much anything using Selenium and you can snap a screenshot of the viewport at any time.

When running as a GitHub Action it's easy to run a project like this with an OS matrix to see that something works in Chrome on Ubuntu, Windows and macOS.

You can also with little effort switch to use Firefox instead of Chrome.

Comments?

Published and tagged with these categories: Optimizely