Practical Rightsizing for AWS EC2 based applications

AWS

EC2

CloudHealth

BY Prabhu Barathi

Mar 29 2019

8 Min

In the past few months, as I meet with customers to discuss the VMware Cloud Services portfolio, I often end up discussing the topic of Rightsizing pretty extensively. As a team, we have covered the topic of Rightsizing in a few prior blog posts, but I think it is worth exploring this topic a little more in detail as a stand-alone topic. One of the topics that led me to this was the topic of lift and shift v/s re-factor. Personally, I see right-sizing as a key aspect as you plan your move to the public cloud in general. This is often de-prioritized for speed, performance, time to market which results in oversized/overprovisioned instances and a lot of wasted spend on unused resources. Don’t get me wrong, the factors mentioned above are perfectly valid reasons and are key to the business. A well-tuned perfectly right-sized application which is two years late and behind schedule is of no use to anyone. The primary reason we’ve found for oversized resources is that operators have for years historically planned their on-premises infrastructure needs based on peak usage, rather than steady-state or average usage. Cloud-based infrastructure by nature of its elasticity provides operators the ability to “flex” as needed based on increased usage, seasonality, and other aspects. The key to rightsizing is to understand the steady state, usage patterns and know how to take advantage of the elasticity of public cloud and respond to it.

Right-sizing is also an on-going operation, its good practice to do a majority of it upfront, but continuing to do so in an on-going manner is equally critical. This is very important to consider since the nature of the workload might change, the seasonality that we anticipated has changed to a more even spread through the year, or worst case it is not being used at all! AWS, Azure, and others provide several ways continuously to save on instance costs. As an example, if the application is not time sensitive consider using Spot Instances or leverage the increased discounts by going the Reserved Instance approach.

In this blog, we’ll focus on a few areas primarily :

Rightsizing using Performance Data
Choosing the Right Instance Family
Leveraging Policy and Collaborative approach for right-sizing

This is not an exhaustive methodology by any means. There are some aspects, which we’ve explored in the previous blog on eliminating Zombie instances and Optimizing EBS volumes.

Rightsizing using Performance Data

The first step in rightsizing is to gain a holistic understanding of the application metrics, instance metrics, and usage patterns. To gather sufficient data that we can use as a benchmark, we ideally monitor the resources for at least a two-week period (a month ideally) to capture the peak and average usage of the instances and associated metrics. There are several tools at your disposal to go through the rightsizing exercise. In this blog, we’ll look at an approach is based on leveraging a combination of Amazon CloudWatch Metrics and CloudHealth by VMware and also augmenting this by gathering application metrics data by capturing metrics information from Wavefront.

So in the first step, let’s focus on the application and associated instances. Here we’ll leverage Perspective that maps to AWS tags. We’ve covered Perspectives in other blogs, so we’ll skip that in this blog. If you’re interested in reading more, please see the link here: https://www.cloudhealthtech.com/solutions/improve-cloud-cost-management/cloudhealth-perspectives

Once we have one or more of these Perspectives setup, we go through an exercise similar to what was described in the previous blog to eliminate instances that are not being used. We can set up a policy-based approach to identify those instances and terminate those. If you haven’t read through it, the blog is available here: https://www.beyondvirtual.io/blog/ch-zombie-pb/

So let’s dig deeper into how we can rightsize using performance data. Earlier this year, the CloudHealth platform was integrated with Wavefront that allows us to use Wavefront as a metrics source and leverage that information to get actionable recommendations and insights. I do want to highlight that CloudHealth as a platform provides the option to bring in other platforms such as Datadog, New Relic as well. In the absence of Wavefront or a similar metrics platform, CloudHealth will collect these metrics from AWS Cloudwatch. The metrics that CloudHealth mainly uses to determine how the system resources are being used are:

CPU - CPU Utilization scaled to VCPU’s being utilized
Memory - Memory utilization
Network - Combined IO (Input/Output)
Storage - Type and Size of attached storage

Based on the above metrics, CloudHealth provides cross-family recommendations for AWS instances. There are some exceptions such as accelerated computing family instances and other corner cases to provide rightsizing recommendations. The recommendation that is provided, is the least costly instance type that is able to handle the workload. It is critical to note that CloudHealth will take into account instance-specific attributes, hence we may not see the same instance type across the board for the same application. Take the example shown below. The application that we actually spun up has t2.large instances across the board. Based on the usage and performance data, we’re actually seeing recommendations for t3.micro instances and t3.nano instance types.

Setting up Wavefront inside CloudHealth is pretty straightforward. We can find Wavefront under Setup > Accounts > Wavefront.

Once in there, we provide an Account Name, API Token and Endpoint address as shown below. It takes ~15 minutes or so for the sources to become available inside of CloudHealth.

When the assets start populating inside CloudHealth we’re able to start using those discovered assets if we chose and the platform also provides in-context switching to Wavefront UI as well. Pretty slick! Stay tuned as we’ll be adding more functionality in this space in the coming months.

The key benefit of using a platform like CloudHealth for this exercise is that it helps massively in identifying the right resources, analyze their usage needs and patterns and provide a recommendation that the end user can consume. For temporary workloads with flexible start times and such AWS Spot Instances might be a good option to consider. We’ll cover that in a separate blog post.

Choosing the Right Instance Family

In most cases, we find that when developers and operators pick EC2 or RDS instances based on what closely matches their needs. The main vectors which chosing the instance type is usually CPU and memory. However, once the application is built there is an opportunity to look for obvious cases where the family just doesn’t fit your workload. For e.g. maybe the instance is compute heavy, but uses relatively low memory. In that example, switching to a compute optimized instance at the same size would save a lot more and provides an opportunity to downsize.

In both the examples above we can see CloudHealth is actually making a recommendation to downgrade from from a general purpose m3.xlarge to a compute optimized c3.large. Another key recommendation here is to get to new instance families as quickly as possible. While AWS does maintain support for older generation instances, there are significant savings by moving to the latest generation to get the best performance. You can get a list of all older generation instances by navigating through the AWS assets to generate a report.

A few other observations that I have come across in the last few years, if the instance spikes at regular intervals, the burstable instance family of t2s are a good option. Simple example is, if we deployed m4’s or a general purpose instance a t2.medium is likely a good option to move to. The burstable instances are great as they allow for the burst at a much lower rate. This is a great option for dev/test type instances.

Leveraging Policy and Collaborative approach for right-sizing

One question I tend to get frequently is “how often should one go through a rightsizing exercise?”. It’s best to visit these on a monthly basis, anytime sooner than that we’re optimizing based on partial data at best. Once you have gone through the exercise it’s your choice on how soon to revisit. My personal opinion - going through this multiple times in a week or month might be counterproductive.

The most productive approach to right-sizing involves empowering the finance, cloud operations team to collaborate with the DevOps teams directly. This involves enabling a workflow that allows the right-sizing recommendations to be distributed directly to them.

On the Rightsizing recommendations page, click on New Subscription option on the far right.

Provide a Name/Description and if we chose to make this a publicly accessible report.

Once we’ve created the subscription, we have the option to create a subscription to email the application owners a rightsizing report and list of associated recommendations.

Lastly, to make it easy to consume the Instance Rightsizing as a Governance policy, CloudHealth provides an out-of-the-box experience with an Instance Rightsizing Template. Once we have these in place, we have a framework to provide a more closed-loop approach to involving both the cloud operations, finance and developer teams with the rightsizing recommendations.

Conclusion and Summary

Rightsizing is an effective way to control costs in addition to eliminating rogue or underutilized instances. There is a closed loop process to continuously analyze the workloads performance, usage needs and patterns to either turn off idle instances. Matching the workload with an appropriate instance family is equally important. This can be made into a smooth process by establishing the right schedule by working with a Cloud Center of Excellence team and collaborative approach as described above. What do you use in your organization? Any ideas or thoughts you’d like to share? Drop us a line below! Til next time…

AWS

EC2

CloudHealth