Target’s ECCO Platform: Achieving Resiliency and High Availability

Dec 1, 2022

Shesadri Parthasarathy

Principal Engineer

Target’s purpose of delivering joy is at the center of our decision making. We strive to offer guests a delightful experience at every interaction, and the checkout process is no different. With millions of guests shopping Target in both physical and digital channels, it’s crucial to provide a similar checkout workflow no matter how and where they choose to shop: a uniform, omnichannel experience.

Enterprise Cart & Checkout (ECCO) is the underlying technology platform that powers our checkout experience across Target.com, the Target mobile app, and nearly 2,000 stores. Target is somewhat unique for a large-scale, brick-and-mortar retailer in that all transactions flow through this same platform which enables quick building and scaling of new features and allows guests to seamlessly switch between shopping channels if they choose.

ECCO is built on a mission-critical, cloud-native microservices architecture that equips us to deliver this experience while ensuring our technology systems remain robust, highly available, and resilient. Here are a few practices that have helped us keep ECCO in top shape and should help other platforms and microservices API with uptime as well.

Health Monitoring

APIs served by microservices offer high scalability and easy service management. However, they also have a lot of interrelated service calls. Monitoring microservices in a scalable way thus becomes important. Health checks are one of the most useful methods to anticipate and resolve application issues before they assume greater consequence. These checks are typically used to verify the health of the application – to determine if the performance is up-to-standard and whether the API call leads to the intended outcome.

Health checks help us understand the availability of any microservice – whether the service is functional and able to take guest requests. It can also provide health information that can be leveraged by the orchestrator’s cluster, so that the cluster can act accordingly. For example, Kubernetes has Liveness and Readiness probes that we can use to address health check URLs.

simple diagram on a light blue background, titled 'health monitoring', showing a column with the words 'aggregator,' 'orchestrator,' and 'tenant' on the left side, and a column with 'cart service' and 'order service' on the right, both joined in the middle with the word 'healthchecks'

Having said that, a certain degree of unpredictability needs to be accounted for and we should build protocols that can deal with unexpected situations such as network or container challenges. With periodic health checks, engineering teams can identify any deviations in performance and build apps that can sustain incoming requests. Identifying potential issues early on can help alert users of possible impact.

Services Observability with Distributed Logging Using Elasticsearch & Kibana

Microservices architecture application logs are generated in containers with complex dependencies. It becomes important for these logs to be stored in centralized locations, especially for analyzing application failures.

Distributed logging is a method used to make these logs available for debugging and analysis. It primarily enables microservices observability, crucial in gaining operational insights into application performance and stability. The most widely used stack is the ELK stack (Elastic, Logstash, and Kibana) that helps teams aggregate and analyze logs stored across systems.

Logging is a critical component in identifying problems in distributed architecture. Centralized logging using ELK makes debugging/logging statements captured by the platform easily accessible and searchable using simple query language coupled with the Kibana interface.

diagram titled 'backend' on a light blue background, with two categories: the left titlted 'cart microservice' shows a rectangle titled 'service' and an arrow pointing to a cylinder titled 'database' with a 'log with id' icon below, and the right diagram titlted 'checkout microservice' with the same 'service' to 'database' diagram and 'log with id', both sides pointing to a central icon titled 'centralized log store'

Metrics & Visualization

Data, measures, and information on API performance and availability help engineers build features that enhance application resilience. These metrics across systems and the applications captured using metric frameworks are published to a centralized store that data points in a time-series database like InfluxDB. InfluxDB helps aggregate data based on application need, be it in a service host group or time granularity of hour or day or more.

Dashboards and alerts equipped with Grafana help us visualize this metric data while also helping us set up alerts based on thresholds appropriate for each service method. These alerts set as conditions within the Grafana dashboard use metrics that are published and get notified when conditions are met. This alerting system is primarily used to inform engineers of any service issues.

Certain critical alerts that are configured include:

  • Inter-data center failover in case of database services
  • Service failure rate beyond set threshold value (for e.g., 4xx alerts if count (service request) is > 10%)
  • Response time alerts (Response time > 1 sec for 5 minute duration)

diagram titled 'metrics and visualisation' on a light blue background with three white boxes titled 'cart microservice,' 'checkout microservice,' and 'centralised measurement store.' the first two boxes have arrows pointing down to the third which sits at the bottom of the diagram. 'cart microservice' and 'checkout microservice' boxes are identical with a rectangle titled 'service' pointing to a cylinder titled 'database as cache.' the 'service' box points to a trio of rectangles in brackets named 'system,' 'business,' and 'application.' The bottom centralised measurement store box has a cylinder called 'time series database' that points to a box called 'Grafana'

Summary

Ensuring high resilience and availability of a platform, periodic health checks to alert teams, distributed logging providing relevant information for debugging and API metrics with alerts at required threshold have been our keys to success. These practices help achieve uptime and availability on our services, translating into a more stable and seamless checkout experience for the millions of Target guests shopping with us each week.