Adding reverse-proxy caching to PHP applications

Note: This is a cross-post of documentation I am writing about Lazy Sessions.

Why use reverse-proxy caching?

For most public-facing web applications, the significant majority of their traffic is anonymous, non-authenticated users. Even with a variety of internal data-cache mechanisms and other good optimizations, a large amount of code execution goes into executing a PHP application to generate a page even if the content of this page will be the same for many users. Code and query optimization are very important to improving the experience for all users of a web application, but even the most basic “Hello World” script will top out at about 3k requests/second due to the overhead of Apache and PHP — many real applications top out at less than 200 requests/second. Varnish, a light-weight proxy-server that can run on the same host as the webserver, can cache pages in memory and can serve them at rates of more than 10k requests/second with thousands of concurrent connections.

While the point of web-applications is to have content be dynamic and easily changeable, for most applications and most of the anonymous users, receiving content that is slightly stale (cached for 5 minutes or something similar) isn’t a big deal. Sure, visitors to your blog might not see the latest post for a few minutes, but they will get their response in 4 milliseconds rather than 2 seconds.

Should your site get posted on Slashdot, a caching reverse-proxy server will give anonymous visitor #2 and up the same page from cache (until expiration), while authenticated users continue to have their requests passed through to the Apache/PHP back-end. Everyone wins.

Caveats

Before we get into how to set this up, you should be aware of a few caveats (in addition to increased complexity) that come with this scheme.

1. Stale Content

Ideally, pages would always be served from the cache for as long as they don’t change, then the application would expire pages when they are changed on the back-end. Varnish has an API that supports this behavior and Drupal Varnish module is being developed to do this dynamic cache-clearing for Drupal sites, but overall, dynamic cache clearing is much more difficult to set up than time-based cache expiration.

When using time-based cache expiration, the challenge is to balance the needs for content freshness (shorter cache lifetimes) against the efficiency of cache hits (longer cache lifetimes will result in more clients using the cached versions). For content that doesn’t need to be up-to-the-minute fresh, a cache lifetime of around 5 minutes might be a good starting point. If the content only changes daily at certain time, a fixed expiration time (shortly after the data sync) might be appropriate.

2. Cookie Use

If your application only uses a cookies set by PHP’s session_start() function, then lazy_sessions.php should work transparently without modification of either that include file or your application (other than including the file). If your application sets other cookies then these will cause the reverse-proxy not to cache them unless you specifically exclude them in the reverse-proxy server’s configuration.

3. Data Caching in the $_SESSION

If you use the $_SESSION array as a data cache on anonymous requests, then these anonymous requests will be given a session cookie and their requests won’t be served from the reverse-proxy’s cache. Rather than using the $_SESSION array for non-user-specific data, cache such data with APC or memcached. This also has the advantage of such non-user-specific data not having to be rebuilt for every new client.

4. flush() and output buffering

The default PHP session handling mechanism adds the session cookie to the response headers right when session_start() is called and writes the data off to the file-system after the script exits and the data has been sent. This default behavior ensures that users will always get a session cookie and saves the session data as the final processing step after all class destructors have been called.

Since we don’t want to always set a session cookie, we need to remove the Set-Cookie header before headers are sent to the client. Output buffering with ob_start() will ensure that we have a chance to decide to clear the Set-Cookie header at script shutdown.

In some cases (such as incrementally sending large binary files) we want to send the content body (and therefor also the headers) before the script exits using the flush() function. To ensure that the session cookie is properly removed session_write_close() must be called before flush() or any other code that causes headers to be sent.

Implementation

Implementing reverse-proxy caching has three steps: PHP changes to enable lazy sessions, PHP changes to set cache-controlling headers, and finally the reverse-proxy server setup. For this example I’ll use the Varnish reverse-proxy server, but others could be used instead.

1. PHP: Lazy Sessions

The first thing that needs to happen to make anonymous requests cache-able in an application that uses sessions is to ensure that sessions are only started when there is session data to be stored. By default, PHP’s session handling mechanisms add a session cookie to the response header and store a session data file on the server on page-load that calls session_start(). While this behavior makes it easy to write applications that use sessions, it effectively means that there is no way to differentiate between responses that are for a particular user and those that could be for many users.

Including the lazy_sessions.php file before session_start() is called will override the default session-handling mechanism with one that checks to see if there is any data in the $_SESSION array before sending the user a Set-Cookie header and storing a session file:

<?php

// Include files or other pre-session_start code

require_once('lazy_sessions/lazy_sessions.php');
start_session();

// The rest of the application code.
?>

If your application needs to flush content and thereby send headers before script shutdown (such as incrementally sending file data), call session_write_close() if session_start() has been called for that script:

<?php

// Include files or other pre-session_start code

require_once('lazy_sessions/lazy_sessions.php');
start_session();

// other application code.

// If session_write_close() is not called before flushing, then the Set-Cookie
// header will be sent before our custom session handler has a chance to determine
// if a session is even needed.
session_write_close();


print "Hello";
flush();
print " World.";
flush();

?>

2. PHP: Cache-Control headers

Now that we have our cookies straightened out, we need to ensure that our PHP scripts respond with HTTP headers that indicate that downstream clients such as our reverse-proxy and the user’s browser are allowed to cache anonymous pages. There are a number of different Cache-Controlling Headers that may affect whether a particular cache may store a given response. By default, PHP sets all of these headers to indicate that no caches may store any pages, ensuring that they are dynamic.

<?php

// If the session data is empty, then we could assume that there is no per-user data
// and that the response can be cached.
if (!count($_SESSION)) {

// Alternatively, we could check an application-specific value (such as a user-id)
// to determine if the response is for a particular user.
// if (!isset($_SESSION['user_id'])) {

// Cache for 5 minutes
$maxAge = 300;

header('Expires: '.gmdate('D, d M Y H:i:s', time() + $maxAge).' GMT', true);
header('Cache-Control: public, max-age='.$maxAge, true);
header('Pragma: ', true);
}

header('Vary: Cookie,Accept-Encoding', true);

The two most important headers with regard to caching with varnish are the following:

The Cache-Control header.

The Cache-Control: public, max-age=300 header indicates to any clients (such as the Varnish caching proxy) that this response can be cached in public caches valid for many downstream clients. The max-age portion of the header indicates that the cache may store this response for 300 seconds.

As I understand it (possibly wrong) Varnish only looks at the max-age portion of the Cache-Control header when determining how long to store a response. Apparently it ignores the Expires header for its cache-expiration purposes, though this header is passed on to downstream clients.

The Vary header

The Vary: Cookie,Accept-Encoding header tells Varnish (and in-browser caches) that they should not respond with the cached version of a response if the request includes a cookie or a different cookie from the request that previously had its response cached. Similarly, if one client says that it accepts gzip encoding via an Accept-Encoding: gzip request header, then the cached response may be compressed with gzip and should not be sent in response to requests from clients that do not state that they accept gzip encoding.

While Varnish’s behavior is to never cache or respond from cache when cookies are present, without the Vary: Cookie response header, browsers or other downstream caches may respond with a cached response valid for only anonymous users even though a cookie is now present.

See my notes on Cache-Controlling Headers for more details about other headers and how they affect the Varnish cache and in-browser caches.

3. Varnish (Reverse-Proxy) Configuration

The /etc/varnish/default.vcl config file controls how Varnish responds to requests and responses, in particular whether or not it should cache or not. Below is the contents of my default.vcl file.

Notes:

  1. The backend portion is the default, you probably will want to modify this to point at your correct backend hosts and ports.
  2. The vcl_recv and vcl_hash sections come directly from the Pressflow wiki and are set up to allow requests that include Google Analytics cookies to be cached while not caching requests that include other cookies.
  3. The vcl_fetch section is the default with my addition of the lines to unset empty Set-Cookie headers that can’t be removed from within PHP < 5.3.

backend default {
.host = "127.0.0.1";
.port = "80";
}

sub vcl_recv {
// Remove has_js and Google Analytics __* cookies.
set req.http.Cookie = regsuball(req.http.Cookie, "(^|;\s*)(__[a-z]+|has_js)=[^;]*", "");
// Remove a ";" prefix, if present.
set req.http.Cookie = regsub(req.http.Cookie, "^;\s*", "");
// Remove empty cookies.
if (req.http.Cookie ~ "^\s*$") {
unset req.http.Cookie;
}

// Cache all requests by default, overriding the
// standard Varnish behavior.
// if (req.request == "GET" || req.request == "HEAD") {
//   return (lookup);
// }
}

sub vcl_hash {
if (req.http.Cookie) {
set req.hash += req.http.Cookie;
}
}

sub vcl_fetch {
if (!beresp.cacheable) {
	return (pass);
}

// If using PHP < 5.3 there is no way to fully delete headers, so empty
// Set-Cookie headers may be in the response. Ignore these empty headers.
if (beresp.http.Set-Cookie ~ "^\s*$") {
	unset beresp.http.Set-Cookie;
}

if (beresp.http.Set-Cookie) {
	return (pass);
}
return (deliver);
}

Leave a Reply

Your email address will not be published. Required fields are marked *