Snipe.Net Geeky, sweary things.

Quick and Dirty PHP Caching

Q

Caching your database-driven website pages has a plethora of benefits, not the least of which being improved speed and reduced server loads. This article will explain how to set up a simple caching system, and will also address when and where caching might not be appropriate.

For me, the impetus to switch to a caching method for one of my database driven sites was sparked by Mosso, since they bill by cpu cycle, and I have one site that is, well, humongous (60k+ pages), and it happens to the highest traffic site on the account. While the database queries were all very efficient, and each page had, on average, no more than 6 queries, performance and cpu cycles would both be helped quite a lot by implementing a cache. This caching solution was a temporary fix, while we switched to a new CMS that was already using a robust caching system. It’s quick, it’s dirty, but it got the job done for the interim.

We’ll walk through how to execute a simple PHP cache, and then I’m going to explain how doing so without a little forethought will screw you right in the ear. Note that this is called a Quick and Dirty solution for a reason. There are more complex, more efficient methods available, but this covers some basics.

Using output buffering, caching pages is incredibly easy. Simply put, output buffering allows you to control when output is sent from the script. This is particularly handy if you’re using cookies or sessions or some other process that sends headers to the browser before the page loads (as anyone who has gotten those pesky “headers already sent” errors can tell you.)

Please note that this article assumes your cache files will be created in a directory called ‘cache’ – and that this cache directory must be writable by the webserver.

Please also note: the syntax highlighter was made of fail for this article and was double, sometimes triple converting HTML entities. I have fixed it a dozen times, and then every time I edit the post, I have to fix it all over again. So if you notice any funky characters that don’t look like they belong in the script snippets, they probably don’t. Let me know and I’ll fix them, yet *again*.

The basic stuff

In all its 6-lines of glory, this is actual, working caching code.

[source=php]// TOP of your script
ob_start(); // start the output buffer
$cachefile =”cache/cachefile.html”;
// Your normal PHP script and HTML content here
// BOTTOM of your script
$fp = fopen($cachefile, ‘w’); // open the cache file for writing
fwrite($fp, ob_get_contents()); // save the contents of output buffer to the file
fclose($fp); // close the file
ob_end_flush(); // Send the output to the browser[/source]

There are, of course, two major flaws with just using the script above. First, we’re always writing to cachefile.html file, which would only be useful to you if your website was only one page. And second, notice that the script writes to the cache, but never actually retrieves the cache file – it’s still running through the whole script every time. But, this is just the beginning. That’s all there is to the actual caching part – the rest of this article will deal with the when/where of caching, but the how is that right there.

Which brings us to the next step… adding the ability to check whether or not a cache file exists, and use that instead of running through the normal script. We’re going to keep using the one-page website model for now, but I’ll get into creating cache files for different pages later.

Checking for a cache file

Creating the cache file from database-driven content is easy, as we’ve seen – but it’s only useful if we actually check if a cache file exists and serve that instead of live database output. Using he modification below, we are checking to see if a cache file already exists and if it does, include it and exit instead of running through the normal PHP script.

[source=php]// TOP of your script
ob_start(); // start the output buffer
$cachefile = ‘cache/cachefile.html’;
if (file_exists($cachefile)) {
// the page has been cached from an earlier request
include($cachefile); // include the cache file
exit; // exit the script, so that the rest isn’t executed
} [/source]

This is marginally more useful, since it actually prevents the script from executing if a cache file exists, however the way this is currently written, it will include that file for an indefinite time, never actually executing your full script again. Normally, in a cache situation, we want the ability to “expire” content after a certain time, so an updated version will be displayed and cached. You could automatically force a new page cache file to be generated by setting a cron job to automatically delete your cache files every hour/day/week/whatever – or you could handle this on the script level.

Setting cache urls

In our examples, we’ve been using cache/cachefile.html as the filename for the cache file that is generated. As I mentioned, this is great if your site is only one page, but otherwise every page this script is run on will create the same cache file, so you’ll end up serving the same cached file as content for every page on your site. Not awesome.

The easiest way to create individual cache files for each specific page is to do something like this:

[source=php]$cachefile = basename($_SERVER[‘SCRIPT_URI’]);[/source]

This takes the unique url of the page requested and and uses that as the cached filename.

But, there’s a gotcha. If your site uses pages that pass GET requests, such as a search page, etc – the SCRIPT_URI won’t see that as part of the url, so once someone does a search, all subsequent search requests will serve that same cached file unless you make the file name unique to each GET request.

In other words, if your search is located at yoursite.com/search.php, and when someone performs a search, the url looks something like yoursite.com/search.php?q=foo, PHP sees that url as search.php, regardless of the query string. So basically, it will break your search, big time.

NOTE: It may not be worth caching every GET request if your site doesn’t get a lot of traffic to files that use this. Or if disk space is a concern. Since there are a potentially unlimited number of GET strings that could be passed to your script (even bogus ones that don’t return valid results on your site), you may want to evaluate whether or not caching search pages is appropriate. In my case, it was – but it may not be for everyone. At the very least, if you opt to do this, make sure you’ve got some sanity checking in there so some asshole with a grudge can’t just sit there creating new, bogus query strings to eat up your disk space.

If you decide to cache query string data, you could do something like this:

[source=php]$cachefile = basename($_SERVER[‘SCRIPT_URI’]);
if ($_SERVER[‘QUERY_STRING’]!=”) {
$cachefile .= ‘_’.base64_encode($_SERVER[‘QUERY_STRING’]);
}[/source]

This basically just grabs the file name, checks to see if there is any GET data passed and if there it, it generates a url-safe base64-encoded sting that you can use as your cache file name.

Setting an expiration

You have three basic options for expiring your cache:

  1. Set up a cron job to automatically delete all of your cache files at specified intervals
  2. Check the data source file for modification, and expire it if the source file is newer than the cache file
  3. Check the timestamp of the cache file and delete+regenerate if it is older than x

Cron Job: Setting up a cron job to delete your entire cache at specific intervals is arguably the easiest solution, but not really the most efficient, especially with very large websites. Rather than just deleting the page that’s been determined to be expired, you’re deleting (and then subsequently regenerating) a large number of files in one shot.

Data Source: Checking the data source file for modification is potentially the smartest way to handle caching, since it means the cache would never be expired if the data didn’t change. That certainly makes sense to do, since a page that hasn’t been updated doesn’t need to be regenerated, so you’re really getting the most bang for your caching buck.

The problem arises when you’re caching pages that are dynamically generated based on database records. The actual script that generates the data may not have been changed for quite some time, but the data records you’re fetching from the database may have been changed, so just checking the cache file date against the date the script was last modified will not give you what you need.

A workaround there would be that you could do a quick db query at the top of every page to find out when the record was last modified and compare that to the modification time on the cache file, but that means that every page, even your cached pages, will be performing a database hit on every page load. This may be perfectly acceptable to you, but it’s something to consider. Perhaps a better way of handling this would be to modify the content management system by which you publish content, so that the cache file is only deleted when you publish edits. This method would be the most thorough and efficient way, since your cache file would only be updated when you update something, and would be left to be served statically unless the data has changed. Although that’s outside the scope of this quick and dirty article, extending the code below to accommodate that wouldn’t take much work.

Cache Timestamp: We’re going to address the third option, since it’s the most commonly used and would serve as the foundation for the second option anyway.

[source=php]// TOP of your script
$cachefile = basename($_SERVER[‘SCRIPT_URI’]);
$cachetime = 120 * 60; // 2 hours
// Serve from the cache if it is younger than $cachetime
if (file_exists($cachefile) && (time() – $cachetime < filemtime($cachefile))) { include($cachefile); echo "“;
exit;
}
ob_start(); // start the output buffer [/source]

This script gets the file name, sets a cache time, checks to see if the cache file exists, and if it does, it checks if the cache file is younger than the cachetime. If the cache is still valid, it includes the file and exists the script. If not, it will continue on to execute the script and create a new cache file. It also tacks on a comment at the very end of the cache file that tells you when the file was cached. This can be helpful in debugging, and helping you verify that the page you’re seeing is in fact a cached version, not a live version. (You can see this in action if you view the source of this page and look down at the very bottom of the source code.)

The script, the whole script and nothing but the script

Put all together, this is what our caching script looks like:

[source=php]// TOP of your script
$cachefile = ‘cache/’.basename($_SERVER[‘SCRIPT_URI’]);
$cachetime = 120 * 60; // 2 hours
// Serve from the cache if it is younger than $cachetime
if (file_exists($cachefile) && (time() – $cachetime < filemtime($cachefile))) { include($cachefile); echo "“;
exit;
}
ob_start(); // start the output buffer
// Your normal PHP script and HTML content here
// BOTTOM of your script
$fp = fopen($cachefile, ‘w’); // open the cache file for writing
fwrite($fp, ob_get_contents()); // save the contents of output buffer to the file
fclose($fp); // close the file
ob_end_flush(); // Send the output to the browser[/source]

Gee… Oh… Cache challenges

I know. Going to hell for that awful joke. Moving on…

Caching is a great way to speed things up on dynamic sites and save on server resources – however if your site has any kind of more advanced features, you need to be selective about where you apply it. The cache is not smart, so you have to be. Ideally, you’ll be building your caching system into the site as you develop the site and the content administration system – but if you end up having to add caching later, you really have to think everything through.

Examples of things that WILL break if you use caching unless you specifically work around them:

User login: “Welcome, user” logged in functionality (the first user who logs in will create the cache, and everyone else logging in will see their name instead of their own!

Voting: If you have any kind of voting functionality built into your pages, new votes will not be captured and old ratings will be displayed

Anything requiring a POST request: Same as above the first person submitting the form will get correct results, but anyone submitting it after them will get the first user’s cached results.

Geo-IP lookup: If you’re displaying geographically relevant information to the user based on their IP address, the same rules apply. The first user hitting your site will create the cache file and everyone else accessing it will see their geographic results instead of their own.

And so on…

That said, all hope is not lost. Depending on the situation and what functionality I’m trying to preserve, I usually handle this one of two ways:

Only serve cached files to users who are NOT logged in. This takes care of a lot of the issues right there – if a user has a profile preferences page, email preferences page, or whatever – all of these will be cached by the first user accessing them. The easy way around this is simply to serve live data to the user if they are logged in, cached pages if they are not. This will reduce the effectiveness of your caching system to some degree, but many users never both logging in, so you’re still getting a significant savings. (If 90% of your site’s content is only available to logged-in users, you may need to rethink your caching system though.)

Use AJAX. This is one of the few situations where AJAX really can be 100% appropriate. Since AJAX requests are asynchronous and are not cached, this is a great solution for your voting script situations. Mind you, you should make sure your solution degrades gracefully for users who have javascript turned off.

Only cache parts of your page instead of the whole thing. With a little more work, you can set up your caching system to only cache parts of your page, and not the entire page. This may reduce the effectiveness of the caching system, but may be necessary depending on your situation.

One final gotcha

You should consider a graceful way of handling database failures as well. Say you have your cache time set for 3 days – a long time by some standards, but not at all unreasonable if you have disk space to spare and your content doesn’t update that often. If your database throws an error when your cache file is being regenerated, that error will continue to be displayed for 3 days, even if the database error has been corrected. You should consider how to handle that gracefully, even if its a cheap and dirty method. For example, you could set up a website monitoring service that notifies you when your content has changed. If your page isn’t loading properly, you’ll be notified by text or email, and that will give you the opportunity to fix the error and manually blow out your cache so it can regenerate.

A note for WordPress users

If you’re using WordPress and are looking for a way to reduce server load and speed your blog up, you’re in luck. WP-Super Cache is an unparalleled caching solution for WordPress that is basically plug-and-play, no coding required.

Caching Libraries

Thanks to the fabulous comments to this article (and I genuinely do mean that), I am reminded to remind you that this method is exactly what it says it is – quick and dirty – and it makes NO attempt to be the best solution to your caching needs. It is as much an exercise in considering where caching is appropriate (and inappropriate) as much as it is anything else.

For more sophisticated (and certainly more elegant) solutions, check out PEAR’s Cache_Lite , xCache (lighthttpd), eAccelerator and Zend_Cache, and read up on APC and memcached.

About the author

snipe

I'm a tech nerd from NY/CA now living in Lisbon, Portugal. I run Grokability, Inc, and run several open source projects, including Snipe-IT Asset Management. Tweet at me @snipeyhead, skeet me at @snipe.lol, or read more...

By snipe
Snipe.Net Geeky, sweary things.

About Me

I'm a tech nerd from NY/CA now living in Lisbon, Portugal. I run Grokability, Inc, and run several open source projects, including Snipe-IT Asset Management. Tweet at me @snipeyhead, skeet me at @snipe.lol, or read more...

Get in Touch