Project Purpose

Continuous integration for a large enterprise can become very convoluted very quickly as user membership grows. We have hundreds of teams across the organization that all have slightly different needs for their build processes, and there needed to be a delicate balance between ease of use and autonomy given to teams that manage their own build environments.

Our purpose was to find a better way to deliver and manage Jenkins for the enterprise.

Where were we in 2016?

  • Four large shared monolithic masters with tons of admins who had too much power, including the ability to alter anyone’s configurations.
  • More than 20,000 configured jobs (active and inactive) across all masters, and growing.
  • Several shared build agents (formerly called slaves) across these environments that had every tool under the sun installed/configured on them. This led to configuration drift and collisions between different teams’ needs on the agents, and we were constantly breaking things across the environment. Our team had to manage all tools installed on these shared build agents and configure them for use for anyone that happened to build a job on one.
  • Lack of accountability for maintaining build environments resulted in lack of understanding of the build environment and all of its dependencies to run specific jobs.
  • Due to the sheer size of the masters, hundreds of plugins installed and configured, and build agents connected the entire environment would take several hours of outage to upgrade the operating system and Jenkins at one time.
  • Teams were storing credentials in an exposed manner on the shared masters.
  • Shared environments meant that one team doing something wrong could create a poor experience for all users if it resulted in a service interruption.

Where are we now?

  • We shut down the entire shared Jenkins environment and have been running all production masters inside of a Docker Swarm cluster for more than 6 months.
  • Onboarded more than 150 teams to their own master, empowering teams to create their own build agents.
  • Heavily utilizing the Kubernetes plugin along with our custom build agent images for quick builds that benefit from a consistent build environment.
  • Master upgrades take seconds (as long as it takes for a Docker image to be pulled and run).
  • Entire infrastructure is deployed on top of OpenStack, and codified using Hashicorp Terraform and Chef. Redeploying the whole cluster takes under an hour.
  • Persistent NAS storage for the cluster where each team is given a qtree with a specific space allocation.
  • Monitoring and logging were included in the design of the architecture from the beginning, this was aimed to empower and inform teams of the internal workings of their masters and builds without needing direct access to the individual container their master is running in.
  • For metrics we use the Telegraf Docker Plugin running on our Docker swarm cluster members in tandem with the Graphite Metrics Plugin to ship metrics to a graphite host.
  • For logging, we use the Docker gelf Logging Driver upon Docker service creation to ship logs to an Elasticsearch Cluster.
  • There is now a very limited ‘blast radius’ when things go wrong or if something is misconfigured. If somebody changes an important setting or installs a plugin that causes an unexpected crash or outage, it only negatively impacts that team’s master and not the entire environment.

How did we get here?

We deploy each individual team’s Jenkins master as a single Docker service running inside of a Docker Swarm cluster. We create Docker services, and Docker Swarm schedules these services to run on a Docker Swarm worker node. We set up an NGINX router container inside the cluster, which allows users to access the Jenkins masters at a certain specified DNS address.

Why Docker Swarm?

The original plan was to run all masters in Target Kubernetes cluster, but that cluster was built for deploying apps in an opinionated way that was not quite compatible with running persistent Jenkins instances. Docker Swarm was the secondary and ultimate best option due the ease of set-up and management from an admin perspective. Docker Swarm has proved to be a pretty stable platform for container orchestration. However, we have experienced some issues that include:

  • Intermittent service discovery failures
  • Unexplained network blips

The majority of issues were able to be resolved by recreating the Docker network or restarting the Docker daemon on hosts. Regardless of the previous issues, this results in comparatively fewer outages for users, and is much easier to manage than our previous environment.

Why use Single Replicas?

Docker Swarm allows for multiple replicas of a service to be running at the same time, but Jenkins is a stateful program, so you may only have a single instance of a master running at a time. This reduces the HA capabilities of a single master, but Docker Swarm allows for a service to be rescheduled on a different host, so you don’t need to worry about hardware failure inside your cluster. As long as you have enough hosts with adequate resources in your swarm cluster, containers that fail on one node will be scheduled on a different node in a healthy state.

Containerized VMs?

  • It seems like you are just creating a bunch of containerized VM’s inside of Docker Swarm. Isn’t that against cloud best practices?

Not necessarily, we are just creating stateful containers that run a single process. This architecture allows for ease of management, loss of physical hardware, and faster restart times compared to a typical installation of Jenkins on top of a VM.

What are the Growing Pains of Teams Running their Own Master?

  • Master Configuration/Plugins
    • This was probably the biggest pain point in migrating to this new architecture. Our legacy enterprise instances did not allow for standard users to configure top level Jenkins settings or download Plugins. As a result, a very low percentage of our users had the knowledge or expertise of how to do this.
    • We reduced this pain by doing most of the basic configuration up front for teams. We did this mainly by running scripts using the Jenkins Groovy API Docs in the creation process of each master. We also install what we consider to be base plugins for each master upon creation. Any plugin management beyond that is taken care of by the team itself.
  • Build Agent Management
    • In the same vein as the master configuration, many users were not well versed in creating/maintaining their own build agents, so we tried to make the process of running basic jobs as easy as possible.
    • We have created several Docker images for use by the Kubernetes plugin to run basic jobs that any standard user would need. These images are based on very lightweight Linux images, with only things installed that are necessary to run very basic jobs (jdk, git, chefdk, etc.). We used another groovy script to configure a couple of these base images by default on each master.
  • Access Control
    • We create each master by default to grant access to a GitHub Enterprise Team within a specific organization of the requestor’s choice. Users can specify two groups, one for overall admin access of the master and one that just has general job management permissions. Team admins are given the leeway to manage the permissions on the master how they like after that.
  • Job Management
    • We encouraged users to get their long-lived jobs entirely in code, using Jenkins Pipeline or Jenkins Job DSL. This was a major transition for users that had only previously used the freestyle option to create jobs. Large, unwieldy freestyle jobs are much more difficult to migrate to a new Jenkins instance, so we opted not to migrate jobs from our four Enterprise environments to their new masters to discourage this behavior.

Architecture Diagram

Jenkins Infrastructure Docker Images

To replicate the setup that we run in production, we have released several Docker images that we use in house to make this all work. Instructions are included in the README of each repo on how to use the image independently. There is a step-by-step process outlined below on how to deploy everything together.

Master Deployment CLI (gelvedere)

This CLI will be used after the infrastructure and network components are stood up to create a Jenkins master using JSON config files.

  • Jenkins Master Deployment CLI called gelvedere
  • We initially took inspiration for our implementation from a post in the eBay tech blog called Delivering eBay’s CI Solution with Apache Mesos. We used some different tools to get the job done, but share the commonality in running Jenkins masters inside of Containers.
  • We also really liked the several helpful posts by the Riot Games Engineering team. A collection of their works can be found here. They provide insight on running Jenkins inside of a Docker container.
  • We use a lot of plugins and examples provided by Github user carlossg. He makes working with the Kubernetes plugin a breeze for us.

Walkthrough

Prerequisites

  • Basic Docker, Jenkins knowledge
  • Network-connected Virtual Machines (for best results, three or more.) with Docker Community Edition installed.
  • Wildcard DNS & Certs setup for *.jenkins.<YOUR_DOMAIN>.com
  • Mountable Storage Volume
  • gelvedere CLI binary installed, see instructions here.
  1. Create a Docker Swarm Cluster

    There is a good walkthrough of how to setup a three-node swarm cluster here.

  2. Mount the persistent NFS to all nodes across your swarm cluster

    We mount on /jenkins across all of our cluster members, so we’ll use that as the example directory to mount containers moving forward in this walkthrough. This volume will hold all JSON configuration files used to in creation of masters, and will hold separate directories for each master deployed to your cluster. This is where Jenkins will place all the persistent files it generates upon startup, when configurations change inside the app.

  3. Create a Docker Swarm overlay network called jenkins

     docker network create --driver overlay --subnet <Subnet in CIDR format> jenkins
    
  4. Create Docker secrets that contains SSL certificates

     docker secret create cert.crt <path to certificate>
     docker secret create cert.key <path to key>
    
  5. Stand up your router container inside the cluster.

    Create a Docker Stack using this file. Before you run the command below, ensure you replace the values inside the docker stack file with your personal/company desired domains where you want to reach masters. If your DNS/certs are registered for .jenkins.acme.com, your docker stack values will look like:

     environment:
       - DOMAIN=acme.com
       - SUB_DOMAIN=jenkins
    

    After you replace those values in your stack file, run:

     docker stack deploy -c docker-compose.yml service
    
  6. Volume Management

    At Target we use qtrees under a large allocated storage volume that holds all of the files Jenkins creates to run. If you want to use the gelvedere Deployment CLI to create masters mounted to a volume on the filesystem of your VM, just make sure you create separate directories created under the /jenkins/ directory for each master you create. Ensure those directories are allocated enough storage to function, we use 2GB as a default to start for each master. In the below examples, we’ll create a master called test.

     mkdir -p /jenkins/test
    
  7. Create two directories also under /jenkins/

     mkdir /jenkins/user-configs
     mkdir /jenkins/admin-configs
    
  8. Create two json files with the following contents.

    /jenkins/user-configs/test.json:

     {
       "name": "test",
       "admins": "githubOrg*githubTeam",
       "members": ""
     }
    

    /jenkins/admin-configs/test.json:

     {
       "ghe_key": "OAUTH_CLIENT_ID",
       "ghe_secret": "OAUTH_CLIENT_SECRET",
       "port": "RANDOM_NUM_BETWEEN_50000-60000",
       "admin_ssh_pubkey": "ADMIN_SSH_PUBLIC_KEY",
       "size": "small",
       "image": "target/jenkins-docker-master:2.107.1-1"
     }
    
  9. Use the gelvedere CLI to deploy the master

    The following command is an example, make sure replace --subdomain jenkins and --domain acme.com with the domain and subdomain you specified in your Docker stack file for creating your router above, by default the subdomain is jenkins, if not defined:

    gelvedere --user-config /jenkins/user-configs/test.json --admin-config /jenkins/admin-configs/test.json --mount-path /jenkins/test --subdomain jenkins --domain acme.com
    
  10. Ensure everything was started successfully

     $ docker service ls
    
     ID                  NAME                MODE                REPLICAS            IMAGE                                               PORTS
     0um8mp2uejda        service_router      replicated          1/1                 target/jenkins-docker-nginx:latest        *:80->80/tcp,*:443->443/tcp
     kbz60bo74hh3        test                replicated          1/1                 target/jenkins-docker-master:2.107.1-1     *:RANDOM_PORT->RANDOM_PORT/tcp
    
  11. That’s it! You have a working Docker Swarm cluster running a single Jenkins master. You should be able to visit your master at: https://test.<YOUR_SUBDOMAIN>.<YOUR_DOMAIN>.com

Visit the repo for the master image on how to modify it to your liking and advanced usage.

Contributors

The information provided herein is for general information purposes only and, as applicable, shall be subject to Target’s Terms & Conditions. © 2018 Target Brands, Inc.