What do you do when something does nothing? This was the scenario I faced recently while working on a support ticket.
Race conditions are among the hardest bugs to detect. These occur when two or more threads or processes attempt to change shared data at the same time. There is no guaranteed order of execution in these parallel threads, hence the outcome can be unpredictable and tricky to reproduce.
I stumbled upon one of these recently while helping out an Altis customer with their WordPress rewrite rules. These rules are usually stored and updated in the WordPress database, as well as an external memory cache (such as Redis) for better performance.
The problem occurred when these rules had to be updated, and updating didn’t seem to work at all. This caused some links throughout the site to not work as expected.
I deduced that
wp rewrite flush seemed to have no effect. It didn’t generate new rules, it didn’t delete old rules – it did absolutely nothing.
Upon further digging I discovered that the
rewrite_rules option was missing from the database, but was present in Redis:
# MySQL > SELECT * FROM wp_options WHERE option_name = 'rewrite_rules';, Empty set (0.01 sec) # WP-CLI $ wp option get rewrite_rules # lots of rules $ wp cache get alloptions options # lots of stuff in rewrite_rules
This was quite an unusual case for sure, so obviously I wanted to know how this happened, but also what the consequences were, and why rewrite rules weren’t being flushed correctly.
How did this happen?
There are two possible scenarios I could think of.
- The customer logged into a database shell and deleted that row specifically
- A race condition
I didn’t completely rule out #1. Even the most experienced WordPress developers may overlook the fact that persistent object caching is on while making changes directly to the production database. Luckily it’s not as easy to do on Altis, but still doable.
However, the second scenario is far more interesting. I wasn’t able to pinpoint the exact location of where, when and how this happened, mostly because all of WordPress isn’t thread-safe. It’s also vulnerable to race conditions, and inherently most derivatives are as well. But I do have some ideas which I think are pretty close.
Consider the following code:
delete_option( 'foo' ); update_option( 'foo', microtime() )
The first line will delete the option from MySQL, then delete it from Redis. The next line will insert a new row in MySQL, and then set a value in Redis. As a result, both MySQL and Redis will have the same timestamp stored. Nothing too complicated or inconsistent.
Next, we’re going to launch the same code twice in parallel, but before we do that, let’s expand the operations involved in these two simple function calls:
delete_option( 'foo $wpdb->get_row(); $wpdb->delete(); wp_cache_delete(); update_option( 'foo', microtime() ); $old_value = get_option(); add_option(); $wpdb->query(); wp_cache_set();
Now we’ll fire off two of these in separate threads, at roughly the same time.
thread 1: delete_option( 'foo' ); thread 1: $wpdb->get_row(); thread 1: $wpdb->delete(); thread 1: wp_cache_delete(); // So far so good, nothing in Redis, nothing in MySQL. thread 1: update_option( 'foo', microtime() ); thread 1: $old_value = get_option(); thread 1: add_option(); thread 1: $wpdb->query(); // INSERT // Yeah we're good. Just inserted a row into MySQL, all // we have to do now is set in Redis and we're done. // But here comes the parallel thread. thread 2: delete_option( 'foo' ); thread 2: $wpdb->get_row(); thread 2: $wpdb->delete(); thread 2: wp_cache_delete(); // Ugh, okay, nothing in Redis, nothing in MySQL again. thread 1: wp_cache_set(); // Sure, why not? // Something in Redis, nothing in MySQL. But we // have an update option coming up from thread 2 // which should resolve this, right? thread 2: update_option( 'foo', microtime() ); thread 2: $old_value = get_option(); // FROM CACHE! // The get_option() call will give results from Redis // So update_option now thinks we have an option, and // will attempt to do an UPDATE instead of an INSERT: thread 2: $wpdb->update(); // FAILS! // The update fails because MySQL has no such row, and // update_option() simply returns false.
Now we’re left in a state where Redis has a value for
foo locked in time, which leads us to the consequences.
If you’re not too concerned about time, the consequences are fairly harmless:
get_option( 'foo' ); // 0.35974300 1660217222 update_option( 'foo', microtime() ); // false update_option( 'foo', 0 ); // false update_option( 'foo', 'please!' ); // false delete_option( 'foo' ); // false update_option( 'foo', 'and now?' ); // false get_option( 'foo' ); // 0.35974300 1660217222
In the context of
rewrite_rules however, this becomes a bit more severe, where you’re unable to add, remove, flush or otherwise modify these rules. In the past, I’ve seen this happen to a plugin’s
_version variable, causing upgrade processes to run on every request but never completing.
How likely is this to actually happen?
Not very likely under normal conditions, however, if you’re doing this on
init on every single page load on a high-traffic site, then it is much more likely to happen. I was able to reliably reproduce similar behaviour locally with a simple load testing tool with 1000 requests at 100 concurrent.
Is this a WordPress problem?
Yes. There are plenty of open issues in core, some dating back 6 or more years. This is not an easy problem to fix at its root.
What can I do to avoid this?
Be mindful when using the Options API to write or delete data on high-traffic endpoints. Offloading such things to low-traffic endpoints or CLI commands is a great way to reduce the risk. Doing database operations directly and explicitly following up with object cache functions may also be an acceptable option in cases where you know the risk of a race condition is high.
How can I fix it?
The same way you fix all other problems on the internet: clear the cache.
WordPress is the world’s most popular PHP framework. Find out why.