This past week, Target open-sourced our first project, a Chef cookbook for Apache Cassandra. This cookbook is the exact version we use to manage our production Cassandra environment. I want to go in to some more detail about how the cookbook is used and how we automate our Cassandra deployment.

Cookbook Layout

The cookbook we open-sourced is the main cookbook we use to install Cassandra. In order to manage multiple clusters with this cookbook I created a wrapper cookbook to use on top of the dse cookbook. A wrapper cookbook is usually used on top of a community cookbook so you don’t have to edit the upstream code. I needed a way to manage our cluster (and its different environments) in a way that was versioned and didn’t change the community cookbook. This meant I couldn’t use roles (at least at the time of writing, they are adding role versioning in Chef 12 I believe.)

I created a wrapper cookbook specifically for each cluster. Lets call the cluster we want to manage Cluster1. I created a cookbook called Cluster1-wrapper and then used this cookbook to manage the different environments for Cluster1. Lets look at an example:

In Cluster1-wrapper, I created a recipe named cassandra.rb. Since we use Datastax Enterprise, I also created one named solr.rb and one named hadoop.rb. These are to distinguish between the Cassandra nodes and Hadoop/Solr nodes. Inside of each of these recipes, there are only a few lines:

include_recipe "Cluster1-wrapper::#{node.chef_environment}"
include_recipe "dse::cassandra"

You’ll notice I include a recipe named for the environment the node is in. This is how I set environment-specific settings for the Cluster. So for the development environment of Cluster1, I have a recipe named development.rb.

Let’s take a look in development.rb:

node.default["java"]["jdk_version"]	= "7"
node.default["cassandra"]["dse_version"] = "4.0.3-1"
node.default["cassandra"]["cluster_name"] = "Cluster1"
node.default["cassandra"]["seeds"] = "192.168.1.1"

You can see the all settings for the development cluster. We also put things like our repair jobs and other environment-specific things in here. One thing to note: you’ll need to pick your seeds and run this recipe on them first, otherwise the other nodes won’t start up since they can’t find the seeds.

Now, on our nodes, each node has a simple runlist of recipe[Cluster1-wrapper::cassandra].

Cookbook Workflow

Now that you know how we manage the cluster, let’s talk about how we manage the cookbooks themselves. We keep all of our private cookbooks in our own Git, but do not keep our community cookbooks there. We want community cookbooks to stay separate, and not have to manage them at all.

To make sure we don’t run in to conflicts with community cookbooks, we use Berkshelf for our uploads. Each cookbook then has a Berksfile and a Berksfile.lock that are checked in to Git. Let’s take a look at the Berksfile for the Cluster1-wrapper cookbook we talked about earlier.

source "http://api.berkshelf.com"
metadata

The first line in the Berksfile tells Berks where to get cookbooks from. Since dse is now an open-source cookbook on the Supermarket, Berks knows how to get it, and we have a simple Berksfile. If there were other cookbooks you included in the wrapper (say a base cookbook to set up user access) then that would need to be specifically added to the Berksfile like this:

cookbook 'users', git: 'git://git.url.com/cookbooks/users.git'

Now we’re ready to make changes and check in code to the Cluster1-wrapper cookbook. We use Jenkins as our orchestrater of all our cookbooks. Any time new code is checked in to the master repo (in this case the Cluster1-wrapper repo), Jenkins sees that change and runs 3 jobs. Each job is a downstream of the job before it in the list. This makes sure a cookbook doesn’t get uploaded if the jobs before it fail.

The first Jenkins job to run is called our Lint job. This runs three things:

  • knife cookbook test
  • foodcritic
  • rspec

The second job is called our Integration job. This uses Test Kitchen to test the cookbook on a VM. It runs one command, kitchen test. We have a separate .kitchen.ci.yaml that we describe the suite Jenkins should tell kitchen to run. Let’s look at an example of our .kitchen.ci.yaml

---
driver:
  name: vagrant

provisioner:
  name: chef_zero

platforms:
  - name: Cluster1
    driver_config:
      vm_hostname: false
      box: centos-6.5

suites:
  - name: ci
    provisioner:
      client_rb:
        environment: ci
    run_list: "recipe[Cluster1-wrapper::cassandra]"

This has Jenkins run the cookbook in a virtual ci environment to ensure it runs properly.

The last job is our upload job. This only runs if the first two succeed. This job does a berks install and then a berks upload to push the cookbook to Chef server. An important thing to note: when Berks uploads the cookbooks, it freezes them automatically in Chef server. We require every change that goes to Chef server, no matter how small, to have a version bump and a corresponding changelog update.

This causes some confusion because if you forget to change the version number the Berks upload will still run, but in the output it will say “skipping Cluster1-wrapper (frozen).” This is something to take note of if you use the Berks commands like we do.

Automating Production Changes

With the above information, I hope it’s clear how we manage our clusters. I will go through some simple changes that we might do on our cluster and how they work.

Settings change (cassandra.yaml)

Let’s say we want to decrease the heap in production. I would make the change to the proper file in the wrapper, update the version, and document that version change in the changelog. I would then make a pull request to the master repo in Git. Once that’s merged, Jenkins will run and push that latest version to Chef. Currently we also have another step, an environment file that needs to be updated to pin this latest version of the wrapper. This is the pull request that would require approvals, since on merge it goes “live” and the nodes will get the new change.

We don’t currently manage specifically when Chef runs on each node. We have our Chef clients set up to run on a 30 minute window with a 5 minute splay. This means the changes will slowly trickle out across the cluster in 35 minutes max. We trust in the randomness of this window to not restart all the nodes at once.

Adding a New Node

To add a new node is pretty straightforward. The only prequisites are that Chef is installed and the node has the proper settings for environment, etc. Then we just add Cluster1-wrapper::cassandra to the runlist and let it go. It will get the proper seed list and join the cluster. Since we use the GossipingPropertyFile snitch setting, none of the other nodes need to restart when a new node is added.

Future Ideas

We do have some ideas for future improvements we’d like to make. One is removing the randomness of the Chef runs by using something like Chef Push Jobs to spread the convergence out in a managed fashion.