I’ve seen it a few times now, the operations team transition their platform through various phases, and each transition incurs a heavy time cost of redevelopment, restructuring, moving data and testing. This post is about a pattern for a platform that works well for range of web platforms. It involves some patterns and ideas that I think fit really well together and when used in symphony allow developers to focus on writing code, and operations to scale out using a tried and tested plan.
Many people are currently at some stage of refactoring towards the next platform for their service, and will probably be facing one or more of the following problems: separating a monolithic service into functional components, managing database schema changes without downtime, deploying applications across multiple hosts without race conditions, hand editing configuration for infrastructure changes, a lack of tooling, or environments distributed across regions. Developers are worrying about configuration, syncing up with platform on deployment strategies, handling failure, and investigating outages caused by bad deployments (the highest level of risk). Deployments are happening with huge features and carry a lot of risk. We want to move away from these issues with this pattern.
Disclaimer: I’m aiming at the median in this blueprint, and not the whole distribution of what different platforms require. This does not fit every platform’s needs, take what you want from it, keep it open, talk. I have also not implemented everything I’ve spoken about here, but I want to, and I hope it inspires other people.
I’m going to focus on a specific architecture that I think works really well: Service Orientated Architecture. The concept is that the web front-end controls just the view passed through to the end user. The back-end is responsible for generating user-specific content and making that available to the front-end (rendered HTML, JSON, etc). This back-end will be constructed from a number of separate services that each provide an area of focus for the application. The benefit of separating services out is that we can horizontally scale services behind load balancers, version and deploy services independently, write services in different languages so your team isn’t locked into one option, move services into clusters, and more.
One important feature to mention here is that these services should be stateless where possible. This means that we can scale horizontally here simply by putting them behind a load balancer or reverse proxy of your choosing. If you do need state, I would look at moving the state onto a store of some kind that can be shared such as a memory cache, or database.
Before we can think about configuring systems, we need to look at the provisioning process and answer questions about what hardware we need and how the hardware and OS should be configured.
Those using PaaS (such as Amazon EC2 or Rackspace) will not need to pay much attention to the hardware provisioning here, but its worth knowing what the foundations underpinning your application are. One day you may need to revisit this if you hit the limitations of virtualized PaaS solutions.
Lets talk hardware.
For machines running services, the distilled idea here will be to maximize CPU and RAM where needed; disk speed and size is almost irrelevant for stateless services that are not bound to IO.
Database servers are far more interesting. In the case of Relational databases, you will require high spec machines. For clustered solutions, high spec machines may not be as critical, but will almost always help.
For RDMS, the ideal memory setup is to have the database in memory the entire time, even during schema changes. Swapping data to disk takes a long time, even if it’s infrequent. You should try have enough to be able to account for operating load, schema changes (where temporary tables may be created), memory for connections and overhead, and also the OS. I recommend disabling the swap file/disk as well – if you’re swapping, you’ve already failed. For disk, SSD’s offer significantly better performance, even without configuring them as a RAID array. That said, they also generally have a shorter lifetime (~ 1 year). If you go the disk route, make sure to get the fastest available, usually you’ll be looking at a 15k RPM SAS drive. Whatever the choice in disk, the drives should be configured in RAID 1+0 which improves the fault tolerance and speed. CPU for database machines has been shown to have a relatively low impact on performance. Having 8 cores isn’t going to give you much over 4 cores. That said, I wouldn’t go lower than quad-core.
I suggest Linux, but the flavor is really up to you. Conservative groups tend to use CentOS and a lot of Red Hat Certified admins will be familiar with it. Ubuntu is popular in other circles because it has long term releases as well as intermediary shorter release cycles. It also has community driven package’s (via launchpad) which stay relatively up to date. Packaging is relatively easy, and launchpad offers a private repository space with a continuous build system (PPA’s).
For filesystems, ext4 is the prevalent default in use at the moment, and its a good option to stick with. There are some people who use XFS and so it could be worth looking into that as well. For database machines we want to turn off swap drives, and also make sure to create a separate partition for the database to store data. The reason for the disabling swap is because we never want to be in a situation where the DB is swapping, if you get to this point, you DB will grind to a halt anyway while trying to swap out huge chunks of RAM. The reason we go with a separate partition for data is so that we can tune the filesystem parameters. Depending on the filesystem, you may want to turn off access time updates to prevent updating access time after every read. Some situations also allow for changing the buffer flushing mechanism to be write-back so that data is not validated as stored before returning to the process. This has the obvious drawback that data could be corrupted during a critical failure.
Machine Inventory Service
At this point we have some machines up with some OS installed. Something that’s going to be useful as we go on, is a very basic inventory service that we can build off and use for tooling. When we look at automation later, we’ll see just how much easier it can make things to have an API into the list of machines that you have available.
The basic idea of the service will be to store/retrieve the following kind of facts about your machines: private ips, public ips, mac addresses, model, make, timestamps, rack, cluster, region, and hardware information such as disk sizes, raid configuration, memory size.
It should be pretty easy to populate this information with a basic provisioning script, maybe make use of some open source software like ohai or facter to help out. There may already be some open source implementations out there.
These days, configuration management is pretty central to a lot of web companies. The idea of storing your system configuration as code is pretty cool. You can dynamically create machines and play with various configurations without spending massive amounts of time rebuilding and installing applications.
So the general flow for configuration management is that a host polls the server for a new configuration, the server then looks up the host (usually by hostname) and delivers the configuration as code which can install applications, and configure subsystems. So, one area which I like to change in this system is the host identification. Usually a hostname match is used to push the right recipes, but what I find works better in a dynamic environment is using the Hardware Inventory service we wrote earlier, and pair it with a new Service Inventory Service (very meta), and we can use these to push the appropriate configuration back down. In a sense we keep the information about machines externally, and that data source directs how hosts are built and configured. This service is going to have a huge impact later as well.
Service Inventory Service
So the Service Inventory Service is a pretty simple API as well. We want to store a list of Services (application service, a database, a message queue, etc) and with each service we store the machine and port that it’s attached to, and some dynamic properties that we can change. So as an example, supposing we have a master/slave db setup, we would configure the master to be on hostA (using id from Hardware Inventory), and the slave to be on hostB. We would then create a property called ‘mysql.master’ and set to True/False respectively. If we wanted to, we could take it further and store arbitrary properties such a ‘mysql.conn_timeout’. In general we want to store properties that differentiate services across hosts. The API for this service should be pretty simple, offering functions such as ‘get hosts for service X’, ‘get services on host X’, ‘get hosts with property Y=X’.
Actual Configuration Management
So we have a way to identify machines and what services are running on them. With this information we can take a configuration management tool (most prevalent: chef, puppet, cfengine3) and put together some patterns. Whether home-brewed or from the community, you’ll be looking at getting packages installed and configurations setup dynamically. At this point, you’ll want to be looking at some of the ancillary tools that these engines give us and integrate them with your services. Hopefully what you’ll find is that you need to be less reliant on some assumptions about the way the platform is structured (because you have an api to the source of truth now).
This is pattern for operations, but I’d like to mention how the work done so far impacts the developers, or development in general. As an example issue that is fairly common, a service is configured to use other services at addresses which are set to use something like a VIP or hard-coded to a specific host and managed as a virtual host. This can make things difficult if you decide to move a service to another VIP: You would need to change the application that has moved, and redeploy that, and then change the configuration for all applications that reference that application. So, the pattern helps to solve this by allowing a service to dynamically find services on startup, or as needed, perhaps maintaining a connection pool. This means that a setting can be changed in just a single place, the service inventory. This is all enabled through some small functions and the use of our API’s; it will have to added benefit of simplifying code and configuration, letting the developer connect to the right service based on environment, region, or purpose without worrying about the underlying platform. As an example:
get_host(‘mysql’, master=True, environment=self.environment)
This pattern would not be complete without something on how to choose a database for the right purpose. NoSQL has exploded recently with options for document based, key-value, graph databases, clustering, and more. In all cases, there are two things to think about: Don’t just choose a database because there is a hype around it, and think about the CAP theorem. RDMS together with partitioning/sharding will be enough for a lot of people out there, but can be complex to manager and code against. Clustered architectures offer solutions to a lot of the complexity, but also bring in a number of their own complexities that are still not well understood by a lot of people (automatic data balancing and distribution, automatic master promotion, failure modes). Evaluate the expected growth for you data and choose according to the data-set at hand.
Database Schema Changes
The database is usually the most critical and risk prone areas of a product. Restoring from backups can take hours and getting something wrong here can be detrimental. I’m going to talk around relational databases here. The worst kind of changes are operations that put locks on an entire table or database. To avoid those, some people simply try to only ever append columns instead of trying to keep a clean schema. When that’s not possible, or not wanted, the best approach is to use a form of online schema change where changes are done on a read replica first, and then switched on as a master, and finally traffic is redirected there and the same change is made on the old master. With this in mind, it’s essential to ensure that code always makes explicit reference to column names, otherwise you will get strange behavior. Automating schema changes is also error prone and I would avoid it.
Database slaving gets us scalability of database reads fairly cheaply at the cost of some availability of data (see CAP theorem), but how do we scale writes. This is where partitioning can help us out. A basic partitioning schema to setup is to separate tables onto different machines, often co-locating close to the service that owns the data. This approach has the advantage of segmenting traffic, load, and data across multiple machines. It fits well with the Service Orientated Architecture since each functional unit is likely to own a set of tables, or even a whole database. The next step in partitioning, especially for large tables such as the User table, is to use a sharding. For sharding, we create a number of logical databases such as db0001-db9999 and distribute the table amongst them using a hash algorithm or a look-up table. The look-up table is more scalable but does incur an extra look-up per user.
Caching may be one of the most critical part of scaling an application. Without it, you are likely to be dead in the water.
Database Caching: databases manage their cache automatically. The OS buffer cache can often get in the way here because its essentially double caching the data (in the OS and the DB). There should be a setting to bypass the OS buffer caching (InnoDB allows this), if its not done already.
Application Caching: Applications in the very least should cache locally on the machine, but if you’re working with a clustered environment, it probably makes more sense to run some memcache instances to centralize common cache. If its a large farm, you will also need to come up with a way of sharding the key so that you can hit the right cache for the right data.
Deployment is not simple if you do not want any down time. Its more common than not to see a deployment process of simply pulling the latest code from the repository. This may work well for small deployments, but has some serious drawbacks. It has no mechanism to manage dependent libraries, orchestrate with external dependent actions (such as taking the machine out of the load balancer), and its difficult to scale past a few machines.
We want to build an artifact of some sort that contains the final binary, stores dependencies (or info about them), meta information such as tags, commit messages, release notes, vcs reference, timestamps, and potentially related db schema changes. With this all in one neat package, we can upload it somewhere and make it available to the infrastructure. We can query the packages and find out what has been deployed, where, when, by whom, what’s in the package, etc. There are also some neat tricks to tie this into your bug/task tracking software so that you can automatically see what features/bugs were fixed, etc.
Once we have this package in hand, we also need a way to unroll it, run any pre and post scripts, handle orchestration, and more. I’ve done some work with ssh-based frameworks but I’m generally not a fan of the way in which it limits the communication between hosts. I think a message queue based Command and Control application fits better and can allow hosts to communicate with each other. Where you put your logic is up to you, I find that putting deploy logic with the application makes sense, and the deploy controller will be responsible for higher level orchestration (load balancers, cache invalidation), reporting back, and probing machines for information.
Building packages manually is a pain, a continuous build system makes this easier. All your services can be made available to everyone including major feature branches and trunk. Once you have a continuous build system up and running, you can start moving it towards a continuous deployment system where unit tests, performance tests, and ultimately deployments themselves are done automatically. Some companies find this controversial, but the biggest reason behind a system like this is reducing pain. Deployments are often painful because they go wrong, they have a lot of risk and so tend to be large feature releases that are not done often. The continuous deployment concept is to iterate often with smaller releases. This means less risk, and it means finding pain points fast, and automating them, or improving them.
Once the system is operational, monitoring is the next must-have. From an operational point of view, we’re interested in logs and metrics.
For logging, essentially what we want is for each host to run a daemon which passes logs through the network to an aggregator which can pull out metrics, index and archive them. There are some open source and commercial options available here. With this setup, a prevalent idea is to structure transactional logs into a schema-less format such as JSON so that the loading and analysis can be done fairly trivially. With the logs being indexed centrally somewhere, we can pull audit information for a user, or analyze them for patterns, exceptions, errors and warnings. Another recommendation is to use something called a Trace-ID. Essentially, when a request hits a front-end, a trace-id is generated that gets sent with every service request so that a single request can be traced through various services. This trace-id should be logged so that a faulty request can be debugged, analyzed for performance. Adding meta information to logging opens up a lot of options for adding debugging tools to the tool-chain.
For metrics, one should definitely be collecting both system metrics, and application metrics. Without metrics, its difficult to analyze root issues, debug historic events, and know what’s happening on your system in general. Application metrics could be built on the back of logging, or they can emit them directly. Build dashboards, metrics are of no use if nobody is looking at them.
We want to be able to send alerts based on system state, and metrics is where’s its at. Most monitoring systems come with some basic alerting stuff, but I think we need something more powerful. Ideally we need a proxy which processes logs and metrics, filters them, and routes them to indexers, alerters, and storage. If the system can keep track of some history in a dynamic way, it means that we can get features like digest emails for exceptions and history based alerting for some metrics.
I am not an all knowing being, if you spot errors, or want to discuss something that you think I’ve said in error, omitted, or could improve – let me know. All trolls will be ignored.