Target Technology Services (TTS) Operations has a massive responsibility to our guests and team members to enable great experiences when they use Target technology. Automation drives faster recovery times for incidents, lowers cost resolutions, and empowers engineers with more time to perform root cause analysis work with their solutions-level partners, driving even further reliability. The Guest Reliability Engineering Automated Service Engine GREASE is the Target implementation of automation in operations. We’ve already saved 700+ hours of our store team members’ time, and 10,000+ hours of IT operations!

The Inspiration

Multiple groups within Target have been attempting to build an automation solution for their specific systems, with varying levels of success. I am still fairly new to Target, but on my second day, I was introduced to a tool built by our Point Of Sale Guest Reliability Engineering (GRE) support group for automating some recoveries of common issues in the POS space. After meeting the developer, Grant Gordon, I found he had a vision of a general purpose automation solution to be used across the organization. I tasked myself with building that platform.

Building GREASE

We began spending many hours after the regular shift work day writing the initial concept of GREASE. Soon GRE held its quarterly hackathon, allowing us to expand our team, learn the features not found in other tools and begin testing its functionality. We came away from the hackathon with our first automated recoveries and our newly learned requirements:

  1. General Purpose & Extensible: The pitfall of most other internal automation solutions scripted before was the tight coupling of the automation engine to the specific teams’ domain. GREASE needed to be different in order to solve the problem of standardizing our operational workloads.
  2. Distribute the Compute: Target operates at truly enterprise scale, and we needed an automation solution that could handle those workloads. Additionally, we operate in many different environments, from stores to data centers, which meant that any system we write needed to exist in multiple network segments.
  3. Write it Open First: A key take away from that hackathon was the power of cross-team collaboration. We needed to leverage the experience and tooling other teams developed, so we very early made development visible to the organization.

Coming back from the hackathon, it was time to start turning this side project into an enterprise system. Our initial workload placement on the engine was back in July 2017. We found great success and began rapidly expanding our automation engine’s workload. Soon we released GREASE into the open source landscape on September 22, 2017. We continued development and created a volunteer-based inner-source team.

Work began on Version II to create an even better experience for the automation engineer and remove some dependencies. We eventually found how to remove PostgreSQL from our backing services and finally guaranteed Python 3 comparability. After release, we began migrating workloads to version two. Currently, we run workloads across the Target retail environment from stores, to our distribution network, all the way back to HQ here in Minneapolis.

How GREASE Works

GREASE is built with Python using MongoDB as a backing services with support for Windows7+, macOS, and Linux (CentOS6+ or similiar RPM based distros, Ubuntu 12.10+ or similiar DEB based distros). There are four types of “roles” for a node (Bare Metal Machine, VM or Docker/K8s) in a GREASE cluster:

  1. Sourcing Server: Reading information from your environments
    • Sourcing is the process in GREASE of sampling systems from your environment (think of help desk tickets, SQL databases, HTTP endpoints) or being pushed information (think of Kafka Streams).
    • Once data has been received in a sourcing server, this data is de-duplicated and stored in MongoDB.
    • A detection node will be selected (via round robin) and scheduled to detect conditions in the source data.
  2. Detection Server: Inspecting Information from your environments
    • Data is “picked up” from the sourcing system and then parsed for conditions defined in a special .config.json file.
    • This “query language” allows for operations to focus on root cause analysis work by identifying the conditions of an incident once, then letting computers handle the repeatable bits while engineers focus on prevention and patches.
    • The parsed data is then assigned to a scheduling server if conditions are met to run a job.
  3. Scheduling Server: Determining where jobs should run in your environment
    • Once data has been sourced and detected, scheduling takes over figuring out where to run your jobs and what nodes to run them on.
    • The nodes selected must be in the configured “execution environment,” meaning the server must be registered in that environment. This enables GREASE to operate in multiple network segments since these execution servers simply query to a central database for their execution schedules.
  4. Execution Server: Executing your jobs to affect production
    • Technically every node in a GREASE cluster is an execution server. This is because they all run jobs; just sourcing, detection and scheduling are scheduled/persistent jobs.
    • This is where execution actually occurs, calling subclasses of the base GREASE command class.
    • You can run normal python code in your class, call Powershell/Shell scripts, talk to databases, communicate to REST endpoints - it’s up to you!
    • Once completed, execution statistics are reported back to the database.

GREASE is just simply the engine, though. Internally, we develop our add-on to this engine, which contains Target-specific recoveries and configurations. It is our plugin to this engine we use to automate our efforts. But the core engine, GREASE, will be developed externally from now on, to continue our commitment to the open source way.

Recently, we released version two! We removed our dependency on PostgreSQL, added support for Python 3 and much more. Click here to learn more.

Future Efforts

We are continuing to developing GREASE. I, James Bell, serve as the principal project leader and architect. Make sure to stay up to date with us by following our progress and hitting that star button! If you feel up to helping out, feel free to sign our contribution agreement and git to forking our repo.

To learn more about GREASE or try it out, check out our documentation site here!


About the Author

James Bell is a guest reliability engineer working with Target Technology Services Operations. He is passionate about open source, software architecture and empowering developers of all ages. Keep up with him via GitHub or Twitter.