25
Sep
12

Nginx service scripts

This is just a really quick note that will surely help some folks out there.

The Nginx service scripts (including Upstart and init.d) use a PID file to track the applications process id. Nginx, however, forks itself on startup and so the pid changes, and the service scripts have no way to know the new PID. This can break the startup scripts (status, and stop) because they can no longer find the process. If you are hitting this issue, the problem is very likely that Nginx is not updating the PID file for you. The PID file specified in the Nginx configuration file should match that specified in the startup scripts. So when Nginx forks, it updates the PID file, and everything is in sync and the service scripts work as advertised.

05
Jul
12

Web Operations Pattern

Introduction

I’ve seen it a few times now, the operations team transition their platform through various phases, and each transition incurs a heavy time cost of redevelopment, restructuring, moving data and testing. This post is about a pattern for a platform that works well for range of web platforms. It involves some patterns and ideas that I think fit really well together and when used in symphony allow developers to focus on writing code, and operations to scale out using a tried and tested plan.

Many people are currently at some stage of refactoring towards the next platform for their service, and will probably be facing one or more of the following problems: separating a monolithic service into functional components, managing database schema changes without downtime, deploying applications across multiple hosts without race conditions, hand editing configuration for infrastructure changes, a lack of tooling, or environments distributed across regions. Developers are worrying about configuration, syncing up with platform on deployment strategies, handling failure, and investigating outages caused by bad deployments (the highest level of risk). Deployments are happening with huge features and carry a lot of risk. We want to move away from these issues with this pattern.

Disclaimer: I’m aiming at the median in this blueprint, and not the whole distribution of what different platforms require. This does not fit every platform’s needs, take what you want from it, keep it open, talk. I have also not implemented everything I’ve spoken about here, but I want to, and I hope it inspires other people.

Web/Service Pattern

I’m going to focus on a specific architecture that I think works really well: Service Orientated Architecture. The concept is that the web front-end controls just the view passed through to the end user. The back-end is responsible for generating user-specific content and making that available to the front-end (rendered HTML, JSON, etc). This back-end will be constructed from a number of separate services that each provide an area of focus for the application. The benefit of separating services out is that we can horizontally scale services behind load balancers, version and deploy services independently, write services in different languages so your team isn’t locked into one option, move services into clusters, and more.

One important feature to mention here is that these services should be stateless where possible. This means that we can scale horizontally here simply by putting them behind a load balancer or reverse proxy of your choosing. If you do need state, I would look at moving the state onto a store of some kind that can be shared such as a memory cache, or database.

Server Provisioning

Before we can think about configuring systems, we need to look at the provisioning process and answer questions about what hardware we need and how the hardware and OS should be configured.

Those using PaaS (such as Amazon EC2 or Rackspace) will not need to pay much attention to the hardware provisioning here, but its worth knowing what the foundations underpinning your application are. One day you may need to revisit this if you hit the limitations of virtualized PaaS solutions.

Lets talk hardware.

For machines running services, the distilled idea here will be to maximize CPU and RAM where needed; disk speed and size is almost irrelevant for stateless services that are not bound to IO.

Database servers are far more interesting. In the case of Relational databases, you will require high spec machines. For clustered solutions, high spec machines may not be as critical, but will almost always help.

For RDMS, the ideal memory setup is to have the database in memory the entire time, even during schema changes. Swapping data to disk takes a long time, even if it’s infrequent. You should try have enough to be able to account for operating load, schema changes (where temporary tables may be created), memory for connections and overhead, and also the OS. I recommend disabling the swap file/disk as well – if you’re swapping, you’ve already failed. For disk, SSD’s offer significantly better performance, even without configuring them as a RAID array. That said, they also generally have a shorter lifetime (~ 1 year). If you go the disk route, make sure to get the fastest available, usually you’ll be looking at a 15k RPM SAS drive. Whatever the choice in disk, the drives should be configured in RAID 1+0 which improves the fault tolerance and speed. CPU for database machines has been shown to have a relatively low impact on performance. Having 8 cores isn’t going to give you much over 4 cores. That said, I wouldn’t go lower than quad-core.

Operating System

I suggest Linux, but the flavor is really up to you. Conservative groups tend to use CentOS and a lot of Red Hat Certified admins will be familiar with it. Ubuntu is popular in other circles because it has long term releases as well as intermediary shorter release cycles. It also has community driven package’s (via launchpad) which stay relatively up to date. Packaging is relatively easy, and launchpad offers a private repository space with a continuous build system (PPA’s).

For filesystems, ext4 is the prevalent default in use at the moment, and its a good option to stick with. There are some people who use XFS and so it could be worth looking into that as well. For database machines we want to turn off swap drives, and also make sure to create a separate partition for the database to store data. The reason for the disabling swap is because we never want to be in a situation where the DB is swapping, if you get to this point, you DB will grind to a halt anyway while trying to swap out huge chunks of RAM. The reason we go with a separate partition for data is so that we can tune the filesystem parameters. Depending on the filesystem, you may want to turn off access time updates to prevent updating access time after every read. Some situations also allow for changing the buffer flushing mechanism to be write-back so that data is not validated as stored before returning to the process. This has the obvious drawback that data could be corrupted during a critical failure.

Machine Inventory Service

At this point we have some machines up with some OS installed. Something that’s going to be useful as we go on, is a very basic inventory service that we can build off and use for tooling. When we look at automation later, we’ll see just how much easier it can make things to have an API into the list of machines that you have available.

The basic idea of the service will be to store/retrieve the following kind of facts about your machines: private ips, public ips, mac addresses, model, make, timestamps, rack, cluster, region, and hardware information such as disk sizes, raid configuration, memory size.

It should be pretty easy to populate this information with a basic provisioning script, maybe make use of some open source software like ohai or facter to help out. There may already be some open source implementations out there.

Configuration Management

These days, configuration management is pretty central to a lot of web companies. The idea of storing your system configuration as code is pretty cool. You can dynamically create machines and play with various configurations without spending massive amounts of time rebuilding and installing applications.

So the general flow for configuration management is that a host polls the server for a new configuration, the server then looks up the host (usually by hostname) and delivers the configuration as code which can install applications, and configure subsystems. So, one area which I like to change in this system is the host identification. Usually a hostname match is used to push the right recipes, but what I find works better in a dynamic environment is using the Hardware Inventory service we wrote earlier, and pair it with a new Service Inventory Service (very meta), and we can use these to push the appropriate configuration back down. In a sense we keep the information about machines externally, and that data source directs how hosts are built and configured. This service is going to have a huge impact later as well.

Service Inventory Service

So the Service Inventory Service is a pretty simple API as well. We want to store a list of Services (application service, a database, a message queue, etc) and with each service we store the machine and port that it’s attached to, and some dynamic properties that we can change. So as an example, supposing we have a master/slave db setup, we would configure the master to be on hostA (using id from Hardware Inventory), and the slave to be on hostB. We would then create a property called ‘mysql.master’ and set to True/False respectively. If we wanted to, we could take it further and store arbitrary properties such a ‘mysql.conn_timeout’. In general we want to store properties that differentiate services across hosts. The API for this service should be pretty simple, offering functions such as ‘get hosts for service X’, ‘get services on host X’, ‘get hosts with property Y=X’.

Actual Configuration Management

So we have a way to identify machines and what services are running on them. With this information we can take a configuration management tool (most prevalent: chef, puppet, cfengine3) and put together some patterns. Whether home-brewed or from the community, you’ll be looking at getting packages installed and configurations setup dynamically. At this point, you’ll want to be looking at some of the ancillary tools that these engines give us and integrate them with your services. Hopefully what you’ll find is that you need to be less reliant on some assumptions about the way the platform is structured (because you have an api to the source of truth now).

Coding

This is pattern for operations, but I’d like to mention how the work done so far impacts the developers, or development in general. As an example issue that is fairly common, a service is configured to use other services at addresses which are set to use something like a VIP or hard-coded to a specific host and managed as a virtual host. This can make things difficult if you decide to move a service to another VIP: You would need to change the application that has moved, and redeploy that, and then change the configuration for all applications that reference that application. So, the pattern helps to solve this by allowing a service to dynamically find services on startup, or as needed, perhaps maintaining a connection pool. This means that a setting can be changed in just a single place, the service inventory. This is all enabled through some small functions and the use of our API’s; it will have to added benefit of simplifying code and configuration, letting the developer connect to the right service based on environment, region, or purpose without worrying about the underlying platform. As an example:
get_host(‘mysql’, master=True, environment=self.environment)

Database Choice

This pattern would not be complete without something on how to choose a database for the right purpose. NoSQL has exploded recently with options for document based, key-value, graph databases, clustering, and more. In all cases, there are two things to think about: Don’t just choose a database because there is a hype around it, and think about the CAP theorem. RDMS together with partitioning/sharding will be enough for a lot of people out there, but can be complex to manager and code against. Clustered architectures offer solutions to a lot of the complexity, but also bring in a number of their own complexities that are still not well understood by a lot of people (automatic data balancing and distribution, automatic master promotion, failure modes). Evaluate the expected growth for you data and choose according to the data-set at hand.

Database Schema Changes

The database is usually the most critical and risk prone areas of a product. Restoring from backups can take hours and getting something wrong here can be detrimental. I’m going to talk around relational databases here. The worst kind of changes are operations that put locks on an entire table or database. To avoid those, some people simply try to only ever append columns instead of trying to keep a clean schema. When that’s not possible, or not wanted, the best approach is to use a form of online schema change where changes are done on a read replica first, and then switched on as a master, and finally traffic is redirected there and the same change is made on the old master. With this in mind, it’s essential to ensure that code always makes explicit reference to column names, otherwise you will get strange behavior. Automating schema changes is also error prone and I would avoid it.

Database Partitioning

Database slaving gets us scalability of database reads fairly cheaply at the cost of some availability of data (see CAP theorem), but how do we scale writes. This is where partitioning can help us out. A basic partitioning schema to setup is to separate tables onto different machines, often co-locating close to the service that owns the data. This approach has the advantage of segmenting traffic, load, and data across multiple machines. It fits well with the Service Orientated Architecture since each functional unit is likely to own a set of tables, or even a whole database. The next step in partitioning, especially for large tables such as the User table, is to use a sharding. For sharding, we create a number of logical databases such as db0001-db9999 and distribute the table amongst them using a hash algorithm or a look-up table. The look-up table is more scalable but does incur an extra look-up per user.

Caching

Caching may be one of the most critical part of scaling an application. Without it, you are likely to be dead in the water.

Database Caching: databases manage their cache automatically. The OS buffer cache can often get in the way here because its essentially double caching the data (in the OS and the DB). There should be a setting to bypass the OS buffer caching (InnoDB allows this), if its not done already.

Application Caching: Applications in the very least should cache locally on the machine, but if you’re working with a clustered environment, it probably makes more sense to run some memcache instances to centralize common cache. If its a large farm, you will also need to come up with a way of sharding the key so that you can hit the right cache for the right data.

Web Caching: If you’re serving static content in any way, use a caching proxy such as Varnish. If you’re serving dynamic content, the caching of your services should save you the majority of the time spent. The latest HTML specs are also giving a lot of leeway for doing dynamic content loading which has led to the huge popularity of frameworks like node.js. In these systems, the content is mostly static, but contains JavaScript which makes calls to backend API’s which perform the necessary logic and return data which is used to render an update.

Deployment

Deployment is not simple if you do not want any down time. Its more common than not to see a deployment process of simply pulling the latest code from the repository. This may work well for small deployments, but has some serious drawbacks. It has no mechanism to manage dependent libraries, orchestrate with external dependent actions (such as taking the machine out of the load balancer), and its difficult to scale past a few machines.

We want to build an artifact of some sort that contains the final binary, stores dependencies (or info about them), meta information such as tags, commit messages, release notes, vcs reference, timestamps, and potentially related db schema changes. With this all in one neat package, we can upload it somewhere and make it available to the infrastructure. We can query the packages and find out what has been deployed, where, when, by whom, what’s in the package, etc. There are also some neat tricks to tie this into your bug/task tracking software so that you can automatically see what features/bugs were fixed, etc.

Once we have this package in hand, we also need a way to unroll it, run any pre and post scripts, handle orchestration, and more. I’ve done some work with ssh-based frameworks but I’m generally not a fan of the way in which it limits the communication between hosts. I think a message queue based Command and Control application fits better and can allow hosts to communicate with each other. Where you put your logic is up to you, I find that putting deploy logic with the application makes sense, and the deploy controller will be responsible for higher level orchestration (load balancers, cache invalidation), reporting back, and probing machines for information.

Building packages manually is a pain, a continuous build system makes this easier. All your services can be made available to everyone including major feature branches and trunk. Once you have a continuous build system up and running, you can start moving it towards a continuous deployment system where unit tests, performance tests, and ultimately deployments themselves are done automatically. Some companies find this controversial, but the biggest reason behind a system like this is reducing pain. Deployments are often painful because they go wrong, they have a lot of risk and so tend to be large feature releases that are not done often. The continuous deployment concept is to iterate often with smaller releases. This means less risk, and it means finding pain points fast, and automating them, or improving them.

Monitoring

Once the system is operational, monitoring is the next must-have. From an operational point of view, we’re interested in logs and metrics.

For logging, essentially what we want is for each host to run a daemon which passes logs through the network to an aggregator which can pull out metrics, index and archive them. There are some open source and commercial options available here. With this setup, a prevalent idea is to structure transactional logs into a schema-less format such as JSON so that the loading and analysis can be done fairly trivially. With the logs being indexed centrally somewhere, we can pull audit information for a user, or analyze them for patterns, exceptions, errors and warnings. Another recommendation is to use something called a Trace-ID. Essentially, when a request hits a front-end, a trace-id is generated that gets sent with every service request so that a single request can be traced through various services. This trace-id should be logged so that a faulty request can be debugged, analyzed for performance. Adding meta information to logging opens up a lot of options for adding debugging tools to the tool-chain.

For metrics, one should definitely be collecting both system metrics, and application metrics. Without metrics, its difficult to analyze root issues, debug historic events, and know what’s happening on your system in general. Application metrics could be built on the back of logging, or they can emit them directly. Build dashboards, metrics are of no use if nobody is looking at them.

We want to be able to send alerts based on system state, and metrics is where’s its at. Most monitoring systems come with some basic alerting stuff, but I think we need something more powerful. Ideally we need a proxy which processes logs and metrics, filters them, and routes them to indexers, alerters, and storage. If the system can keep track of some history in a dynamic way, it means that we can get features like digest emails for exceptions and history based alerting for some metrics.

RFC

I am not an all knowing being, if you spot errors, or want to discuss something that you think I’ve said in error, omitted, or could improve – let me know. All trolls will be ignored.

31
Aug
11

Chef Recipes: ruby and resource execution order

Chef is really powerful because it is written as a layer on top of an interpreted language (ruby). So when you come up against a limitation in the ddl, you are able to dig into the layer beneath the ddl and manipulate objects directly. I want to focus on some of the things i’ve found useful when writing recipes that have let me force execution order, or at least understand it’s boundaries.

Execution order of ruby code

The first thing to understand about writing a recipe is how it is interpreted. When a resource is declared (package, file, …), the resource is tacked onto a list and, as the cookbooks are evaluated, the list grows until. When all cookbooks have been evaluated, the resources in the list are executed in order. So how will code that is outside of a resource block run? What we find is that the resource’s base class does the job of appending itself to this list, and so if you put ruby code outside of a resource block, it will run before the entire list is executed. Just to make sure this is clear, lets take a look at the following recipe:

 
package 'apache2' do
  action :install
end

Chef::Log.info("Apache2 installed")
 

The initial thought might be that the package will be installed and then you will see a log message “Apache2 installed”. What really happens is that the package resource is evaluated, and appended onto the resource list. The log statement is run and outputs “Apache2 installed”, and finally, the package is installed when the resource list is evaluated.

Ruby blocks

So how do we make sure our ruby is run after a specific recipe? Ruby blocks are maybe a little under utilized, they simply let us execute ruby code while taking advantage of the base resource class methods.

 
ruby_block 'ruby-logstuff' do
  block do
    Chef::Log.info("write some stuff out")
  end
end

There is another way, but I haven’t done much looking into it. Simply attach ruby into the resource object. It works, but i’m not sure if the code will run before the resource actions, or after (probably before).

 
package 'apache2' do
  action :install
  Chef::Log.info("Apache2 installed")
end

Forcing a resource to run immediately

We may want a resource to run before anything else in any cookbook. Usually you would seperate this kind of thing out into a separate recipe so that you can ensure it runs first. However, sometimes we may want to keep methods grouped together because its logical. Every resource returns a reference to itself, and this lets us call it! The following is a silly impractical example…


# this will run 2nd
package 'haproxy' do
  action :install
end

# this will run first
package 'apache2' do
  action :nothing
end.run_action(:create)
 

03
Jun
11

First Impressions on Chef

I’m going to start with a quick bit of background on my experience because its important. I am not a Chef evangelist. I arrived on the scene with a company that already used Puppet and had invested some time there. It had been a bit slow moving and so over a few months I helped to pick it up. The devops team brought puppet to a state where it was usable, fitted with our systems and our idea of how data and code should be managed. We developed ideas around provisioning machines, what should be managed, and how. We integrated with products like MCollective to help us manage servers, and integrated with some of our own internal systems. During the last 10-12 months of learning, I have been learning both the intricacies of Puppet and the principles of Configuration Management. A lot of this time was really evolving the puppet architecture and the supporting systems into a state that worked in context.

Over the past week and a bit I have installed a chef server and brought up a fairly complex machine. I’m going to touch on a few areas of Chef that i’ve worked with…

Installation

The Chef installation process is good. For a first time installer, the documentation is there, but you have to jump around the concepts and terminology a bit before you realize what needs to be done. I also hit a bug in rabbitmq where if your chef domain name doesn’t resolve, the whole install will fail. I had setup /etc/hosts, but the resolution doesn’t use it, and so you need to use a resolving domain name.

They’ve put a whole lot of thought into automatically registering nodes. Anything special you want to do like creating self-classifying nodes is easy enough considering you have the flexibility of ruby and all the node information at your disposal, including the ohai data.

Once you understand the authorization mechanisms and how to use the tools, you can install Chef on client and server in under 5 minutes.

Writing Recipes

I’ll go into tools later, but when it comes to a new cookbook, you can just issue a simple knife command to create a template to work from. Remove a few directories you don’t need, and start writing recipes. Recipes are pretty simple to write, and are basically just configuring a resource (packages, files, links, etc).

Chef recipes have two really strong points. They are deterministic, and they can use ruby code.

A deterministic recipe makes the world of difference in some situations. Configuration management systems can use dependencies to specify run order, but I find this can be complex, hard to read, and difficult to debug. I like determinism because it is simple implicit dependency management.

Ruby code is also a big win. You can use data from anywhere to make decisions about what should run. You can make decisions based on whether a file exists on the node, or whether some data has a specific value on another node somewhere. You can construct variables that can be used globally, and really, whatever you want. It opens you up to make the mistake of side-stepping best-practices, but I don’t think Chef lacks features that would make you want to do this.

Managing Data

Data comes from 4 sources: database, ohai, attributes, recipes.

The database, AKA data-bags, is the most removed data source for your more miscellaneous kind of data. It has a very clear place in my mind and I know exactly the sort of data to store here. Information that is relevant to multiple nodes such as usernames and passwords for services. Relevant to that is the added benefit of having encrypted data bags. This adds a level of security and is a great feature. For anyone concerned about keeping information in the database, a best practice is to keep the stored json in subversion, and simply upload to the database. Nice.

Ohai is the node level information reported in by the node such as number of cpu’s, ip-address, ec2-ami-id, load, etc. I find the information to be very good and one of my favourite things about Ohai is that the format is nested (like json). This way its easy to keep the namespace clean and be clear about your data. I found it to be easily extensible as well, but more on that further down.

Attributes are recipe level variables. You would set up something like the default apache port here and use it in your template. There has been a lot of thought on how recipes interact and so there is a nice multi-tier mechanism to override variables. Often you will end up overriding these variables using information from other data sources or just hard-coding them (as an override, not in the module).

Finally, recipes can set variables, change them, etc. These variables are persistent and can be used by other recipes.

These data sources together provide a great deal of flexibility in how you manage your data and ensures separation of data and code.

Extending

I haven’t played a lot with extending chef. I dabbled with an ohai plugin, and wrote a quick Light-weight Resource Provider (LWRP). I found the documentation to be a bit lacking for ohai plugin writing, and the examples given were all very simple. My ruby isn’t amazing by any standards and so I struggled to figure out a few things that might come more naturally to others. Distributing the ohai plugin was easy, and you just need to put it in the run-list high up, then, any cookbooks after that will have access to the data in the plugin.

Writing a LWRP was pretty simple given the documentation and examples. A LWRP lets you abstract away a resource (eg. file, symbolic link, package) and a provider (something that manages the resource on the platform). These LWRP’s let you quickly encapsulate common code into a single resource which helps reduce code size, and improve readability. Its nice to be able to abstract resources and providers away even at a basic level, it makes for much neater and more understandable code.

There are more avenues for extending, but I haven’t explored them yet.

Tools

Chef comes with two main tools that I know of. Knife and Shef.

Knife is, well, the Swiss army knife of chef. It lets you manage chef from anywhere, and includes commands for everything you will need to do. The basic commands cover nodes, clients, cookbooks, data-bags, environments and roles. It also has commands to configure clients through a Q&A interface which is used as part of the install process. Knife connects to the server, and so its strength is that everything you need to control chef can be done remotely from any machine. You do not need to be on the central server to upload new cookbooks, modify data, change parameters, etc…

Shef is a ruby shell running in the Chef client context. You run this from an end node and use it to debug cookbooks and recipes step by step. You can assume roles, run any cookbook, set any variables, etc. It really lets you get involved in the debugging of a cookbook. I haven’t spent much time on Shef yet, but I plan to get to know it very well.

Community

The Chef community is good. I’ve been on IRC asking a lot of questions, and mostly got back really good answers. Opscode and 37-signals have opensourced a whole lot of cookbooks which is really helpful in getting up and running quickly. They are all excellent and I didn’t need to modify anything at all for my purposes. They helped with learning as often you have to read the code to figure out what they do, how they do it, and what data they need. Big thanks to both teams for their hard work in this area.

What next…

I will be looking at bringing up a few more machines and see what challenges come up. I have been impressed on all fronts, and personally, it suits a lot of the ways that I think about problems and how to solve them. I still need to draw my own conclusions on whether Puppet or Chef is right for the systems and architecture I work with, and I advise anyone looking into Configuration Management to do the same and explore their options.

10
May
11

Puppet: Lessons Learned

I’ve been a puppet user for almost a year, and, after embracing every aspect (as much as I had time for), I’ve learned a few lessons. If you’re just starting out with puppet, skim over the following…

Configuration Management, not Machine Provisioning
Puppet is designed for managing your systems’ configurations across hosts to keep them working the way they are intended to. In this light, Puppet is the tool of choice to manage things like ssh keys, apache configs, or the hosts file. Puppet is not the tool of choice when you’re thinking about provisioning a machine with the push of a button that can install your packages, deploy applications, configure dependencies and create databases. While puppet is very capable at doing these tasks, it requires huge amounts of effort to ensure that all the operations are idempotent. While the button may call a once off bootstrap process, the process must be able to run again and leave the system in the same state. If you try to do this, you’re putting a lot of time and effort into something that is better suited to another tool. Use puppet to manage configurations, not to create a server in a working role from scratch.

Learn the language
Puppet uses an extensive language in it’s manifests which includes things like lists and hashes, automatic documentation, global parameters, parametized classes, node definitions, and more. All of these components help create the core of puppet and gives you power to achieve what you want. In the least, you need to familiarize yourself with what is available, so that when you come to a problem you can recognize the right commands to use. There are a lot of best practices (more) which will help to minimize the time spent restructuring manifests and code when you find an exception to what you think is the best way.

Define relationships explicity
Puppet can be though as a stateful machine where classes are mainly independent. As such, there is no deterministic ordering of when classes are called. This can create issues if you don’t define your class relationships explicity. Defining order will ensure that requirements are met before proceeding onto the next step. There has been a lot of work on this recently, and its definitely worth the time to order your dependencies properly from the start.

Don’t be scared to extend puppet
Puppet is designed to be extensible. It has supporting applications (like mcollective) and built-in plugin architecture which allows for various enhancements. You can create your own system to tell puppet which manifest or classes a node should get based on whatever parameters you want by developing your own node classifier. You can create functions which can be used within manifests to modify variables, or fetch information. You can create your own resource types to handle a custom resource in the way you want it handled. What’s great is that none of this development is particularly difficult, and the information is available. There are also 3rd party extensions available, so be sure to search to find out if anyone has already done what you’re looking to do.

Read blogs, get involved
The puppet community is very friendly and helpful. A lot of the best puppet ideas have come from blogs from people talking about how they configured puppet to achieve their goal. Chances are that your organization will find familiarity with some of these patterns. Don’t spend time designing your own pattern when there is already one available that solves for your system. There is a caveat to this though, don’t use someone else’s pattern unless you know it will suit you. Restructuring puppet manifests is slow and you don’t want to be there if you don’t have to be. Find puppet users on twitter, sign up for the puppet mailing list, join #puppet on irc, read blogs.

Separate data from code
Puppet has data, like the ip address of your load balancer for example. You could include data straight in your manifests, but, this would be a mistake. Separate your code from your data. Currently, ext-lookup is a good solution for this, but I believe puppet is looking at restructuring the way code and data and separated, so keep a lookout on the best practice for this.

Document your manifests
Puppet has built in documentation generation. Use it. Six months down the line when someone is trying to figure out where a manifest is used, or what the class parameters do, it sucks having to read the class and follow custom definitions or types, work out what variables are used or even where they come from. Document your manifests with who wrote the class, parameters used, summary, externally sourced parameters, and so on.

Missed anything, let me know… @nuknad on twitter, or comment

01
Mar
11

Puppetdoc css alternative, little easier on the eyes

I wasn’t very keen on the standard skin for the puppet rdocs, and so updated a few colors and padding to make it a little more readable. You just need to replace the rdoc-style.css file in the rdoc generated folder, or serve it seperately using apache. I haven’t tested it on many browsers, but i’m sure it will look fine as it’s fairly standards compliant css.

Pastie’s of the new css or the patch


Edit: I have updated this slightly – 2nd March
Pastie’s of the new css or the patch

/* Reset */
html,body,div,span,applet,object,iframe,h1,h2,h3,h4,h5,h6,p,blockquote,pre,a,abbr,acronym,address,big,cite,code,del,dfn,em,font,img,ins,kbd,q,s,samp,small,strike,strong,sub,sup,tt,var,dl,dt,dd,ol,ul,li,fieldset,form,label,legend,table,caption,tbody,tfoot,thead,tr,th,td{margin:0;padding:0;border:0;outline:0;font-weight:inherit;font-style:inherit;font-size:100%;font-family:inherit;vertical-align:baseline;}
:focus{outline:0;}
body{line-height:1;color:#282828;background:#fff;}
ol,ul{list-style:none;}
table{border-collapse:separate;border-spacing:0;}
caption,th,td{text-align:left;font-weight:normal;}
blockquote:before,blockquote:after,q:before,q:after{content:"";}
blockquote,q{quotes:"""";}

body {
	font-family: Verdana,Arial,Helvetica,sans-serif;
	font-size:12px;
}

pre {
	background: none repeat scroll 0 0 #F7F7F7;
	border: 1px dashed #DDDDDD;
	color: #555555;
	font-family: courier;
	margin: 10px 19px;
	padding: 10px;
 }

h1,h2,h3,h4 { margin: 0; color: #efefef; background: transparent; }
h1 { font-size: 1.2em; }
h2,h3,h4 { margin-top: 1em; color:#558;}

a { color: #037; text-decoration: none; }
a:hover { color: #04d; }

/* Override the base stylesheet's Anchor inside a table cell */
td > a {
  background: transparent;
  color: #039;
  text-decoration: none;
}

/* and inside a section title */
.section-title > a {
  background: transparent;
  color: #eee;
  text-decoration: none;
}

/* === Structural elements =================================== */

/* next two lines a bit of a hack becuase there is nothing 
   to identify this first link which isn't part of index-entries
   could use child selectors, but not available on all browsers */

div#index a {
	display:inline-block;
	padding:10px 10px;
}

div#index div#index-entries a {
	padding:2px 10px;
}


div#index .section-bar {
	background: #ffe;
	padding:10px;
}

div.name-list { line-height: 1.4em; }

div#classHeader, div#fileHeader, div#nodeHeader {
	border-bottom: 1px solid #ddd;
	padding:10px;
	font-size:0.9em;
}

div#classHeader a, div#fileHeader a, div#nodeHeader a{
	background: inherit;
	color: white;
}

div#classHeader td, div#fileHeader td, div#nodeHeader td {
	color: white;
	padding:3px;
	font-size:0.8em;
}


div#fileHeader {
	background: #057;
}

div#classHeader {
	background: #048;
}

div#nodeHeader {
	background: #082;
}

.class-name-in-header {
  font-weight: bold;
}


div#bodyContent, div#section {
	padding: 10px;
}

div#index-entries { padding:10px 0; }

div#description {
	padding: 10px;
	background: #f5f5f5;
	border: 1px dotted #ddd;
	line-height:1.2em;
}

div#description h1,h2,h3,h4,h5,h6 {
	color: #125;;
	background: transparent;
}

div#validator-badges {
	text-align: center;
}
div#validator-badges img { border: 0; }

div#copyright {
	color: #333;
	background: #efefef;
	font: 0.75em sans-serif;
	margin-top: 5em;
	margin-bottom: 0;
	padding: 0.5em 2em;
}


/* === Classes =================================== */

table.header-table {
	color: white;
	font-size: small;
}

.type-note {
	font-size: small;
	color: #DEDEDE;
}

.xxsection-bar {
	background: #eee;
	color: #333;
	padding: 3px;
}

.section-bar {
	color: #333;
	border-bottom: 1px solid #ddd;
	padding:10px 0;
	margin-bottom: 3px;
}

div#class-list, div#methods { padding:10px; line-height: 1.4em; }
div#section div#class-list, div#section div#methods { padding:10px 0; line-height: 1.4em; }

.section-title {
	background: #79a;
	color: #eee;
	padding: 3px;
	margin-top: 2em;
	border: 1px solid #999;
}

.top-aligned-row {  vertical-align: top }
.bottom-aligned-row { vertical-align: bottom }

/* --- Context section classes ----------------------- */

.context-row { }
.context-item-name { font-family: monospace; font-weight: bold; color: black; }
.context-item-value { font-size: small; color: #448; }
.context-item-desc { color: #333; padding-left: 2em; }

/* --- Method classes -------------------------- */
.method-detail {
	background: #f5f5f5;
}
.method-heading {
  color: #333;
	font-style:italic;
  background: #ddd;
	padding:5px 10px;
}
.method-signature { color: black; background: inherit; }
.method-name { font-weight: bold; }
.method-args { font-style: italic; }
.method-description { padding: 10px 10px 20px 10px; }

/* --- Source code sections -------------------- */

a.source-toggle { font-size: 90%; }
div.method-source-code {
	background: #262626;
	color: #ffdead;
	margin: 1em;
	padding: 0.5em;
	border: 1px dashed #999;
	overflow: hidden;
}

div.method-source-code pre { color: #ffdead; overflow: hidden; }

/* --- Ruby keyword styles --------------------- */

.standalone-code { background: #221111; color: #ffdead; overflow: hidden; }

.ruby-constant  { color: #7fffd4; background: transparent; }
.ruby-keyword { color: #00ffff; background: transparent; }
.ruby-ivar	{ color: #eedd82; background: transparent; }
.ruby-operator  { color: #00ffee; background: transparent; }
.ruby-identifier { color: #ffdead; background: transparent; }
.ruby-node	{ color: #ffa07a; background: transparent; }
.ruby-comment { color: #b22222; font-weight: bold; background: transparent; }
.ruby-regexp  { color: #ffa07a; background: transparent; }
.ruby-value	{ color: #7fffd4; background: transparent; }

11
Feb
11

Self-Classifying Puppet Nodes

Puppet has a  very cool node classification system which pretty much lets you do what you want (by writing your own one) if the default classifier doesn’t work for you. So, there are already a couple of good posts around this, and its worth reading some of the following posts: Jordan Sissel , Gary Larizza as well as the official docs on external node classifiers.

So, from the above posts, I’m going to take a few of the ideas, mix them up, and go through the steps to reproduce on your own system. The goal of the end configuration is to have a node come online, identify itself using it’s Role, Platform, and Environment; and then issue it the relevant classes. What’s important here, is that nodes must be classified, before they reach puppet, into Roles and Platforms (as well as Environment, but this is already handled by puppet). Dividing nodes by their Platform/Role gives us the simplicity needed when you’re managing a large number of machines across different clusters. Its easier to group machines than it is to individually assign classes to each node. Of course, not all your puppetized nodes need to belong to a group as they might just be one machine performing a specific action. In cases like this, we must be able to add exceptions easily.

For the purpose of this post, let’s assume we have 2 clusters in Europe and USA, and each cluster has several Application and Web Servers. I’m also assuming you’re following the recommended puppet-mcollective-facter setup, because it works well.

From a high level overview, we want to write a facter plugin for mcollective which will read a facts file on the host. This facts file will contain the Role and Platform information that can be used from Puppet. We then need an mcollective agent so we can update this file if we need to at a later stage. Finally we look at how to create node classification system that can use these facts to hand out the right manifest.

Identification

Facter is the game, and we need a new fact. So facter gives puppet access to information about a host at run-time like what country a host is in or what distribution of linux it’s running. We’re going to put 2 new facts, and for the sake of best practices, we’ll make it extensible. I don’t like polluting the existing facter namespace with odd names, and so i’m going to prefix all facts with a name (use your company name or whatever you want).

The following is a facter plugin that will parse the file /etc/company.facts and append them to the existing facts.


require 'facter'

if File.exist?("/etc/company.facts")
    File.readlines("/etc/company.facts").each do |line|
        if line =~ /^(.+)=(.+)$/
            var = "company_"+$1.strip; 
            val = $2.strip

            Facter.add(var) do
                setcode { val }
            end
        end
    end
end

Given the following facts file /etc/company.facts:

Role = Web
Platform = USA

We will get the following from facter

...
company_role = Web
company_platform = USA
...

These variables are now available straight away in your puppet manifests.

Updating the Facts
Before i continue on using these facts in puppet, its important to have a way to update the facts. Equally important is that you implement the facts into your server deploy process. So, we have a script that installs mcollective and puppet when we commission a new server, and one of the first things that is done is to create this file and automatically populate the Role and Platform based on the commissioning paramaters.

Apart from server deploy-time, we can write a small mcollective RPC agent which will get/set/delete values from our facts file. The file has a simple key-value structure and so the following should do the job

module MCollective
	module Agent
		class Companyfact<RPC::Agent
			metadata	:name		=> "Company Fact Agent",
					:description	=> "Key/values in a text file",
					:author		=> "Puppet Master Guy",
					:license	=> "GPL",
					:version	=> "Version 1",
					:url		=> "www.company.com",
					:timeout	=> 10
			
			companyfile = "/etc/company.facts"
	
			def parse_facts(fname)
				begin
					if File.exist?(fname)
						kv_map = {}
						File.readlines(fname).each do |line|
							if line =~ /^(.+)=(.+)$/	
								@key = $1.strip;				 
								@val = $2.strip				  
								kv_map.update({@key=>@val})
							end						 
						end					 
						return kv_map
					else
						f = File.open(fname,'w')
						f.close
						return {}
					end 			
				rescue
					logger.warn("Could not access company facts file. There was an error in companyfacts.rb:parse_facts")
					return {}
				end
			end

			def write_facts(fname, facts)

				if not File.exists?(File.dirname(fname))
 				   Dir.mkdir(File.dirname(fname))
				end

				begin
					f = File.open(fname,"w+")
					facts.each do |k,v|
						f.puts("#{k} = #{v}")
					end
					f.close
					return true
				rescue
					return false
				end
			end

			action "get" do
				validate :key, String
				
				kv_map = parse_facts(companyfile)
				if kv_map[request[:key]] != nil
					reply[:value] = kv_map[request[:key]]
				end
			end

			action "put" do
				validate :key, String
				validate :value, String

				kv_map = parse_facts(companyfile)
				kv_map.update({request[:key] => request[:value]})

				if write_facts(companyfile,kv_map)
					reply[:msg] = "Settings Updated!"
				else
					reply.fail!  "Could not write file!"
				end

			end
			action "delete" do
				validate :key, String

				kv_map = parse_facts(companyfile)	
				kv_map.delete(request[:key])

				if write_facts(companyfile,kv_map)
					reply[:msg] = "Setting deleted!"
				else
					reply.fail!  "Could not write file!"
				end

			end
		end
	end
end

We also need the ddl:

metadata        :name           => "Company Fact Agent",
		:description    => "Key/values in a text file",
		:author         => "Puppet Master Guy",
		:license        => "GPL",
		:version        => "Version 1",
		:url            => "www.company.com",
		:timeout        => 10

action "get",	:description => "fetches a value from a file" do
	display :failed

	input :key,
		:prompt		=> "Key",
		:description	=> "Key you want from the file",
		:type		=> :string,
		:validation	=> '^[a-zA-Z0-9_]+$',
		:optional	=> false,
		:maxlength	=> 90
	
	output :value,
		:description	=> "Value",
		:display_as	=> "Value" 
end

action "put", :description = "Value to add to file" do
	display :failed

	input :key,
		:prompt		=> "Key",
		:description	=> "Key you want to set in the file",
		:type 		=> :string,
		:validation	=> '^[a-zA-Z0-9_]+$',
		:optional	=> false,
		:maxlength	=> 90

	input :value,
                :prompt         => "Value",
                :description    => "Value you want to set in the file",
                :type           => :string,
                :validation     => '^[a-zA-Z0-9_]+$',
                :optional       => false,
                :maxlength      => 90

	output :msg,
		:description	=> "Status",
		:display_as	=> "Status"
end

action "delete", :description = "Delete a key/value pair if it exists" do
        display :failed

        input :key,
                :prompt         => "Key",
                :description    => "Key you want to change in the file",
                :type           => :string,
                :validation     => '^[a-zA-Z0-9_]+$',
                :optional       => false,
                :maxlength      => 90

        output :msg,
                :description    => "Status",
                :display_as     => "Status"
end

For a quick refresh on using your mc-rpc agent, we can set a key using the following:
mc-rpc -v --agent companyfact --action put --argument key=role --argument value=Web

And we can get a key using the following
mc-rpc -v --agent companyfact --action get --argument key=role

And we can delete a key using the following
mc-rpc -v --agent companyfact --action delete --argument key=role

Self-Classifying Nodes
This is where we want to be. A node comes in and says to puppet, I’m a Web machine on platform USA.

The default basic setup is to use a node definition for each node, or plug some sort of external classifier on. I’m going to build on from Jordan Sissel’s blog that I mentioned at the start. Essentially, every node goes through the ‘default’ node definition, which then goes to the ‘truth enforcer’. This truth enforcer will look at the facts of the node and hand off the relevant classes accordingly. Note that if you want to add exceptions, just create a node definition for the exception node. simple.

So the enforcer node is a very basic definition:

node default {
  include truth::enforcer
}

From here, we create a truth enforcer class like so (using our example). Naturally this is just an example of how it might be used:

class truth::enforcer {

        $groupname = "$company_platform:$company_role"
        case $groupname {
                "USA:Web" : {
                        include roles::web
                }
        }

        case $company_role {
                "Application" : {
                        include roles::application
                }
        }       
}

That’s pretty much it as far as getting a self-classifying puppet node goes. One more thing that’s worth mentioning is that this also ties in well with Extlookup to manage your parameters. You can use something like the following configuration which I find works well:

$extlookup_precedence = ["fqdn_%{fqdn}", "role_%{company_role}-%{company_platform}", "platform_%{company_platform}", "common"]

Comments or questions welcome.




Follow

Get every new post delivered to your Inbox.