Improving performance: caching remote requests

On the modern interconnected web, displaying content on a website often involves talking to third-party integrations. From open job positions pulled from hiring software to videos from digital asset management tools, data needs to flow back and forth using HTTP requests to these remote servers.

But these remote requests are nobody’s favorite. They force our site to pay the time penalty that the remote server takes to generate a response, and even if they are mostly fast, occasionally they are slow.

Just like our own sites, these remote integration also employ techniques to protect themselves from high traffic, including rate limiting, throttling and other peculiarities, which can cause engineers some headaches.

Isn’t this why we have transient caching?

I was woken up by an alarm last week due to some fatal errors. One of our applications was throwing an Exception in PHP, because the Google Sheets API was responding with a “rate limited” error.

It was a “recommended posts” section on a site, the IDs for which were populated from an external spreadsheet. While I’m not questioning the solution, the implementation (slightly simplified from the original) caught my eye:

function get_recommended_posts() {
    $ids = get_transient( 'recommended' );
    if ( false === $ids ) {
        $ids = get_ids_from_remote_spreadsheet();
        set_transient( 'recommended', $ids, 86400 );
    }

    // Do stuff with $ids
}

At first glance, there was nothing wrong with this approach: we check a transient, we read the spreadsheet, we cache the results for a day – makes total sense. Indeed, when working with remote requests, it is always advised to cache the results, to avoid repeating the same request over and over again. The WordPress Transients API is indeed perfect for this exact use case.

However, looking deeper into the get_ids_from_remote_spreadsheet() custom function, I noticed that it could return false, and even throw an Exception. In these two cases, our transient would never be set (or set to false), resulting in a fairly useless cache.

This failure caused the site to continue hammering the API while returning fatal errors to the end user. I can think of three solutions to this problem: a good one, a really good one, and a great one.

A good solution: cache the failure

Caching that failure, perhaps for a shorter period of time – a “debounce”, if you will – would allow external services to recover, rate limits to be reset, and would avoid the site going down because of an external service disruption.

$ids = get_transient( 'recommended' );
if ( false === $ids ) {
    try {
        $ids = get_ids_from_remote_spreadsheet();
        if ( empty( $ids ) ) {
            throw new Exception( 'Empty IDs' );
        }

        // Cache success.
        set_transient( 'recommended', $ids, 86400 );

    } catch ( Exception $e ) {

        // Cache failure.
        set_transient( 'recommended', 'error', 600 );
    }
}

if ( $ids === 'error' ) {
    return 'No recommended posts.';
}

// Do stuff with $ids

The catch block accounts for our own Exception that we throw on empty IDs/false, as well as any exception thrown inside the get_ids_from_remote_spreadsheet function. This approach can also handle things like missing data (empty sheet) or even a deleted spreadsheet.

A small improvement would be to display a hard-coded list of recommended posts on failure, or swap the entire block for a “latest” posts section instead.

A really good solution: serve stale on failure

If we can’t obtain freshly recommended posts from a spreadsheet right now, wouldn’t it be neat if we could serve the previous batch? This does require a bit more effort and a slightly different data structure.

Here’s an example of a really good solution:

$recommended = get_transient( 'recommended' );

if ( false === $recommended ) {
    $recommended = [
        'ids' => [],
        'expires' => 0,
    ];
}

// Fresh and valid.
if ( $recommended['expires'] > time() ) {
    return $recommended['ids'];
}

// Expired.
try {
    $ids = get_ids_from_remote_spreadsheet();
    if ( empty( $ids ) ) {
        throw new Exception( 'Empty IDs' );
    }

    // Cache success.
    $recommended['ids'] = $ids;
    $recommended['expires'] = time() + 86400;
    set_transient( 'recommended', $recommended );

} catch ( Exception $e ) {

    // Cache failure.
    $recommended['expires'] = time() + 600;
    set_transient( 'recommended', $recommended );
}

return $recommended['ids'];

We’re still relying on a transient here, but instead of asking WordPress for a timeout, we’re setting one indefinitely, managing the expiration ourselves while retaining our post IDs, even during failure.

There is still a condition where this function would return an empty array (when cache is flushed for example), but in most other cases it will continue to return our recommended posts.

During failure this will not throw any exceptions that would end up in the error log, nor would it display any noticeable user-facing “error condition”, so it might be a good idea to log the failure additionally. This would account for a case where the authentication credentials were rotated for example.

A great solution: revalidate async

Even with the above improvements over the original code, we’re still paying the third-party time penalty, at least once every 24 hours (don’t forget to use time constants!) under normal conditions, and more often than that during failures or cache flushes.

Also worth keeping in mind that WordPress transients are not a guarantee, especially in the world of persistent object caching like Altis Cloud. In such environments, transients are really just an alias for wp_cache_* functions. This means that if a cache key isn’t requested frequently enough, it may be evicted.

A great solution to this problem is to do things asynchronously and persistently. The code for something like this would be a fair bit more complex than the two solutions above. Note that I used short option and function names for readability in these snippets, but you should always remember to prefix and/or namespace your PHP functions, WordPress actions, option and transient names.

The user-facing part will be significantly simpler:

$ids = get_option( 'recommended', [] );
// Do stuff with $ids

We’ll need to schedule our custom events and attach functions to these events:

add_action( 'update_recommended', 'update_recommended' );
add_action( 'update_recommended_retry', 'update_recommended' );

add_action( 'altis.migrate', function() {
    if ( ! wp_next_scheduled( 'update_recommended' ) ) {
        wp_schedule_event( time(), 'daily',
            'update_recommended' );
    }
} );

Finally, we’ll need to implement the function that runs async:

function update_recommended() {
    try {
        $ids = get_ids_from_remote_spreadsheet();
        if ( empty( $ids ) ) {
            throw new Exception( 'Empty IDs' );
        }

        // Cache success.
        update_option( 'recommended', $ids, false );

    } catch ( Exception $e ) {

        // Schedule retry on failure.
        wp_schedule_single_event( time() + 600,
            'update_recommended_retry' );
}

Using this approach, the time penalty will never be paid by an end-user request. It will always run in a background process.

Even if the remote request ultimately times out (remember those cURL operation timeout 5002ms errors?), it will not impact user-facing requests, will not block PHP workers, and will help engineers to hate remote requests a little bit less.

I also used an option here instead of a transient. This will store our results in the database, as well as the object cache, making our recommended posts less vulnerable to cache flushes or evictions. You may have noticed I set the $autoload argument to false for the option. It’s in line with what a transient would do, though having it autoload would make more sense if it was served on the majority of user-facing requests.

Another advantage to refreshing data asynchronously via a cron event is it avoids a cache stampede when the value expires. We took this approach on a client site which included several pieces of information from external APIs (weather reports, school closing warnings, marketing tracking IDs, etc). Moving them out of the page load not only removed the blocking requests for end users, but reduced the heavy but short load spikes that their services saw each time a transient expired.

Final thoughts

I’m not suggesting always using the “great” approach; the complexity may not fit all use cases, time constraints, etc. All four solutions are viable, including the original one. However, if you do stumble upon problems due to third-party request failure, you know you have plenty of options.

Our focus on performance never ends: we’re constantly working side by side with our customers to help find the right solutions for them. Find out how Altis could work for you by booking a demo below.

See Altis in action