Better Insight, Smarter Usage: How we helped teams better manage costs with a public cloud provider
The information below was also presented at Google Cloud Next ’19. Presentation here.
When Target first started moving workloads to Google Cloud Platform (GCP) several years ago, it wasn’t in a “big bang” kind of way. We migrated applications in small groups. The goal was to get our infrastructure and applications into GCP while moving as fast as possible. As we increased our footprint at Google, application teams wanted to understand how to better control spending and costs. They wanted past, current, and forecasted costs. And we needed to figure out how to best get teams that information. However, Google doesn’t provide out-of-the-box solutions to track spend at the most basic compute level, so Target needed to come up with a way that allowed us to group resources together, and tie that information back to a single application team, all within a shared environment.
We followed similar hosting patterns that we used before and put everything into a single project. We also had established that applications be named according to an ‘application-component’ combination. We built our platform around the idea that the app-component combo would also provide security around platform “startup” activities like secrets management, key/value pair management, service registration, etc. Our billing data is exported into a GCS bucket, which then flows into BigQuery. From there, data feeds into another Google product, Data Studio, which is a visualization tool.
The first thing we did was apply labels using that ‘application-component’ convention. Labeling is a very popular way to add more information to your resources. And they end up in your billing exports! But, not all of Google’s resources can be labeled. Remember, we started with a single project and just kept moving new applications in. The more apps we had, the more important it became to appropriately label everything.
However, teams weren’t applying labels when they deployed at first, so we created a script that retroactively applied a label based off the instance name. Using filters in Data Studio, we created a handful of dashboards that allowed teams to filter based off different ‘application-component’. We show the breakdown between projects, the breakdown per component, and a time series of each component - all for a specific application.
But the downside to this was how we grouped the data together – everything was a manual filter. For example, we would create a filter and group all the application-components that we knew of that belonged to Price application. This worked at the beginning - but then more teams started onboarding. And then ownership of a specific application would change teams. Or teams would change their app’s name or component name. This quickly became unsustainable.
We also brought our own Kubernetes clusters to Google Cloud, and tracking who used what became hard. We were going to need a way to apportion our Kubernetes bill according to how a namespace used that resource. When our script tagged those Kubernetes clusters, which are actually VMs, their label value appeared as the cluster name, instead of the namespace.
We built an API that pulled usage information from our clusters. It assigned an estimated cost to a namespace based on the CPU and memory requests of that namespace. We used labels indicating namespace to find the total cost of the cluster to understand what the relative cost of a namespace would be. All of this data was then sent to a tool called Grafana. We roughly showed the same thing as we did in Data Studio - you can see the total namespace costs, per environment, and over time.
At this point, we were capturing most of our compute engine costs - we were able break down our instance costs - but we were still missing other cost breakdowns, such as networking charges or bucket costs, cloud sql instances, and more. We were also building this data flow off a single project - so we also needed to address the one-off projects or proof-of-concepts that were also in Google Cloud. At this point, teams know how much they’re spending, but we really had no way of showing them where they were inefficiently using resources.
In July of 2018, we migrated other applications and had a chance to change the way we did things. Each application received their own test and production projects, that connected to other service and host projects via Service Accounts.
Here were some of the benefits:
- The platform components that we deploy are isolated and in their own project. We now get the same benefits that the app teams get. And we all get project level IAM for access.
- App teams can consume Google specific services and their usage will be attached to that project. Load balancer logs? Doesn’t matter that they can’t be tagged. Everything in the project belongs to the app.
- Quota management! So. Much. Better. We’re sure Google appreciated the cessation of our constant quota increase requests.
There’s still a problem, though. Each app team could be putting more than one application in a project. This is where we needed better application identification. We needed some sort of fingerprint.
Introducing the Configuration Item ID or “CI”: an 8-digit, unique number across all of Target’s business units. CIs are a Target identifier. They are our lowest common denominator. What if we could apply them to everything deployed to GCP?
Great idea, right? It’s not a new one either, but it’s only part of the solution. How do you enforce the presence of a CI wherever possible? If an application team has no reason to want to voluntarily label their resources, how can we leverage the tech to do it for us? How do we enforce data consistency? How can we ensure complete coverage of all resources at GCP?
Hello Cloud Functions! This sanitizes and standardizes the data as their instances launch, and it also applies labels to instances so we can account for these resources in our billing. But we have a problem when teams don’t supply a CI at all. If we really cared to, we could let the functions prevent instances from ever spinning up in the first place, but if our goal is to make it easy for app teams to easily consume infrastructure, that seems extreme.
Hello Forseti! It’s pitched as a security tool, but it’s really just about finding deviations from a desired standard. We run reports to identify things that can have labels but don’t. Notification is manual for now, but there’s a future possibility of identifying an owner of the application and notifying them automatically. Now we can use CIs across all of our compute infrastructure and it’s applied upon deployment.
So, what’s next? We hadn’t yet come up with a way to show costs for applications in our own data centers. Instead we focused on things like utilization and allocation of our compute resources. But if you think about finding waste or inefficiencies, you can’t do that with the reports that we provided in GCP. We started building a single, explorable dashboard that teams with resourced deployed across all our platforms can use, that will help them make better decisions – it includes cost information, utilization and allocation, all by configuration item. By combining all of this information, teams can see where the ‘waste’ is - they are able to identify where they can improve efficiency, and ultimately lower their operating costs.