When I talk to people about their start-ups, most engineers fall into one of two camps: “We’re too small (or broke) to worry about scale yet” or “We have eleventy-billion servers, now we just need users.” I tend to see the latter in well-seeded startups or those who have already secured angel investors where cost isn’t as much of an issue, but what do you do if you’re the former?
Perhaps surprisingly, this is less of a technical post and more of a philosophical one, where I hope to help you get into the mindset of setting yourself up for success if and when you do need to scale, without prematurely optimizing to the point where your code becomes overly complex before you have your first thousand users or blowing all of your seed funding on servers and infrastructure.
I’m not a fan of overly-complicated stacks, but careful planning can help you scale more easily while not breaking the bank and without turning your codebase into something from the Necronomicon.
Back in December, my startup had the honor of being actively tweeted about by Amanda Palmer (with over one million Twitter followers) and shortly after by her husband Neil Gaiman (who has over 2 million followers.
— Amanda Palmer (@amandapalmer) December 1, 2014
We were running everything on a single t1.micro instance at AWS, using the smallest RDS instance available (because un-funded start-up, that’s why).
It happened out of the blue on an otherwise boring Friday afternoon, and while it was incredibly exciting, we weathered that storm remarkably well because of the steps we had taken beforehand to ensure we would be able to scale quickly if needed to. We had a load balancer already in place (although it was only balancing one server – we only had 800 or so users before this happened), but hadn’t set up an autoscaler yet, so we did have a service interruption of about 15 minutes while I finished shitting my pants and scrambled to spin up another few servers and configure the autoscaler.
What we started with was lean enough to still be on the free tier with AWS. We were effectively running the tech in our startup for free, and only needed to throw money at it when the need arose.
We had received an AWS startup grant which gave us more than enough AWS credits to be able to build out a really complex stack, but we decided not to. Why did we do that? Because eventually the AWS credits would run out (they only last a year), and if we weren’t yet making money when they did, we’d now have a more expensive stack to have to pay for out of pocket.
The heavy traffic lasted about a week and a half straight, and at the end of the month, our server bill was well under $100 (or it would be, if we didn’t have the AWS credits). Once the rush was over, the autoscalers did their job and brought us back down to one server.
[tweetable alt=”‘Build for what you need now, but do it in such a way that scaling up (and down) is easier’.”]Build for what you need now, but do it in such a way that scaling up (and down) is easier.[/tweetable]
No Puppy Servers
Whenever possible, you want cow servers and not puppy servers. What are cow servers and puppy servers? Joshua McKenty, co-founder of Piston Cloud, is quoted as saying:
“Piston Enterprise OpenStack is a system for managing your servers like cattle — you number them, and when they get sick and you have to shoot them in the head, the herd can keep moving. It takes a family of three to care for a single puppy, but a few cowboys can drive tens of thousands of cows over great distances, all while drinking whiskey.”
While I don’t like the analogy (because cows are actually pretty cool and do have their own personalities), it serves as a useful (if tacky) metaphor.
A “cow” server is a server that is not special in any way, does not hold any unique data, files, or configurations that the next cow server doesn’t also have. It can be deprovisioned or replaced without notice. The world will keep turning, and your app will keep churning.
That sounds pretty simple, but there are lots of things you need to consider when you’re building out your system. Do your users upload files? No? Not even profile photos? Ohhhh, right. You forgot about those. If those files are being served from the same server the app sits on, that turns it into a puppy server.
If that server gets hoarked, if it gets terminated (with or without notice), all of those user uploads are gone. If you need to scale up and add more servers, you’ve now got the problem that user files only exist whichever server the user uploaded them to, which is going to result in a boat-load of broken images. If you need to scale down, you’ll be taking those user files with you when you terminate that instance.
Same thing if you’re running your database on the same box as your web server.
Or if your deployment solution requires IP addresses or boxes to stay consistent for all time.
Or if sessions are stored in the filesystem.
Or if your mission-critical cron jobs run on one particular server.
All of these scenarios mean terminating a server or adding new ones will cause very real problems for you and very likely your users. And as we all know, when you’re a start-up, every single user counts.
[tweetable alt=”While your tech stack, business needs & budget will vary, spend the time to think through your infrastructure up front”]While your tech stack, business needs, and budget will vary, spend the time to think through your infrastructure up front[/tweetable]. It’s not sexy (and it can be difficult to convey the value to non-technical stakeholders who just want to ship features), but you don’t stand a chance of making it through a scaling event without it, and as your user base and codebase grows, it becomes exponentially harder to retrofit good infrastructure. Sometimes you don’t have any way of knowing when traffic will spike, so plan ahead.
Let’s quickly go through how to handle some of the de-puppification (yes, that’s totally a word now) scenarios mentioned above.
CDN All the Things
As I mentioned, puppy servers can be a real problem with file uploads. The easiest way to handle this is to upload user files to a file storage solution via API. AWS S3 and some other file storage services have SDKs that make it pretty easy to do this. You can choose to upload directly to the storage solution, or if you need to do some image manipulation (cropping, resizing, etc) beforehand, you can upload them to your filesystem and then immediately manipulate and then move to AWS, deleting the version on your server. (It’s important to delete them off of your server or else you could end up eating up your disk space without realizing it.)
AWS makes this very easy with their SDK. With the AWS SDK, this is what it looks like to move your files S3:
$s3 = AWS::get('s3'); $s3->putObject(array( 'Bucket' => 'YOUR_AWS_BUCKET_NAME', 'Key' => 'path/on/s3/filename.jpg', 'SourceFile' => 'path/on/your/server/filename.jpg', 'CacheControl' => 'max-age=172800', "Expires" => gmdate("D, d M Y H:i:s T", strtotime("+3 years")) )); unlink($img_path);
That’s it. About ten lines of code, and your files are happy in S3, where you can set them as the origin for CloudFront and get all the good stuff that comes with a CDN.
CacheControl time and
Expires times may vary based on how often images might change, but do be sure to set those headers, or you’ll be missing out on some of the most helpful features of using a CDN.
This is a little trickier if you’re using autoscalers, since your deployment solution needs to be aware of whatever brand new servers have just been spun up and added to your autoscaler, so that the code can be deployed on them before they’re put into rotation. You could set up a custom init script in AWS to do a git checkout right after server provisioning, or have it call a custom script to trigger a 3rd party service like DeployHQ, or use an out of the box solution like AWS CodeDeploy. You don’t want to rely on your stored AWS images, or every time you deploy, you have to create a new snapshot. Blech.
If your user sessions are stored on the filesystem, this means that if a box goes down (or a new one comes up and a user ends up connecting to that one mid-session for some reason), they’ll get logged out. This is not exactly what I would consider a hard failure, but it’s something that can be easily avoided by storing user sessions in memcached or an RDBMS. Depending on your resources, all active users getting logged out when a server goes down may be a failure you can tolerate, but it’s at least something you should be thinking about and planning for.
This one can still be a bit of a pain in the ass, but solutions exist. Imagine you start off with one server, and you run your cron just like we did back in the days of bare metal servers. That all works fine until you have to scale up to 100 servers. If you haven’t planned for it, you’ll now be executing that cron 100 times. With a cron that is data-destructive or, for example, emails users, that could be a real disaster.
Some folks like to use a third-party job queuing system like Iron.Io to handle cron jobs across multiple servers. I’m not a fan of that solution, as it introduces new dependency issues with your code, their uptime, and their servers. I prefer to have the cron running on all of the servers, and have a database handle the idempotence, but it really depends on the application and what the crons do. You’ll need to use your best judgement here and weigh all of the risks and benefits.
These are just some examples, but hopefully you’re starting to understand how to think about your infrastructure in a way that lets you more easily identify and solve for things that could potentially become bottlenecks during a scaling event.
In the incubator/start-up culture, there is a bizarre status achieved by having your servers melt because of a traffic spike. It’s dumb and it makes me angry. Failing to architect your system to handle traffic isn’t something to be proud of.
[tweetable alt=”Your tech stack is not a burrito. More layers isn’t always better”]Your tech stack is not a burrito. More layers isn’t always better.[/tweetable] Making the most out of the layers you do have, and making them more tolerant of high-traffic events will go a long way.