Proactive Policy management of RDS usage in dev/test environments

BY Bill Shetti
Feb 7 2019
8 Min

As a DevOps individual, managing the deployment of an application is paramount. Managing performance, cost, security are important components of day 2 operations. But while production environments are tightly managed and monitored, what about dev/test environments?

Unlike production environments, dev/test environments need to be secure and provide a level of flexibility for developers. This flexibility allows developers to innovate and release applications to production faster. AWS provides this flexibility and innovation speed. However its much harder to manage due to the rapid change in services, features, and capabilities on AWS.

Unlike on-prem where there is ultimate control, AWS has to be properly managed through policies governing the developer’s use, cost, and security.

There are two types of policies that need managing:

  1. IAM policies on AWS, through:
    • Policies
    • Roles
    • Users
    • Keys
    • etc

Proper configuration of these policies enables the management of resource usage, cost and application, resource, and user security. IAM allows for tight or loose control of AWS usage BEFORE deployment or usage (consider it a “preventative” policy)

These policies are generally set once, but rather iterated on a regular basis due to company needs, AWS service/feature changes, etc. Hence its not perfect and a second layer of policy and governance is needed.

  1. Since IAM doesn’t always catch and prevent deployment of an application with the right settings and resources, a second layer of** “proactive” policy**. This policy requires an ability to remediate violations and a strong knowledge of services on AWS.

In this blog we will cover how to build a “proactive” policy to manage RDS, which is one of the more widely used services in AWS, and it can run up costs quickly. We will cover “preventative” policies in another blog.

We will explore:

  1. What a usage policy is and how its developed
  2. How to alert and remediate on a violation
  3. Highlight some tools which will help (i.e. CloudHealth by VMware) simplify the solution.

RDS Usage & Remediation Policy

In a previous blog we showed expensive AWS RDS is compared to deploying a database on EC2 (in particular we explored using mysql)

As a quick recap — the previous blog showed how RDS is upwards of 2x the cost of using EC2 for mysql. While this is costly there are any benefits that can out weigh the cost:

  1. Simplicity in configuration — whether its the use of the UI or the AWS API, creation and management is easy and can be automated and much simpler than standard mysql deployment on a VM.
  2. High availability — simple features enabling clusters, master-slave, proxy, cross-region replication
  3. Built in back up and redundancy mechanisms (vs programing this manually)

All these capabilities are highly useful and generally used in production environments and testing environments (where its necessary to test for scale, and performance).

However, each capability listed above can add to the cost. i.e. multi-az deployments can double the cost of RDS.

Hence RDS use can sky rocket cost quickly.

Here are some examples of the costs related to RDS:

  1. Production RDS mysql based deployments start at around $900/month vs $132/month dev/test use case on AWS vs <$100/month using EC2
  2. Depending on the instance type, costs could also double/triple (T2.large with 20G is only $132.70/month vs M4.large with 20G is $300/month)
  3. Database storage size increases cost increases(500G vs 2T which is $200/month more)
  4. etc

In a standard dev/test environment, you want to limit the use of RDS to minimize the cost without restricting innovation, speed of delivery or flexibility.

How to achieve this?

a)Set a usage policy:

  1. For the general use case — Cost is the simplest parameter to check. Check if any RDS instance exceed $XX/month. This allows for any configuration without limiting innovation/test but restricts the usage.
  2. For exceptions — Create a procedure to ask for exceptions. This will be required.

b)Create a procedure to review and react to policy violations

  1. Notifications — how do notifications want to be sent? Slack/email/etc
  2. Remediation procedure — Depending on what your organization allows, there can be many options here. One action could be right size if there is wastage. Another action could be disable specific configuration components (i.e. shut off backups, etc).

In the rest of this blog we will explore how to manage a policy that shuts down any instance in dev/test (non except) that surpasses $200/month. Understandably this is extreme, but I am using this to highlight an “action”.

How do I implement this policy and the procedure? Two great options:

  1. Use AWS APIs and build a set of code/scripts to check and remediate
  2. Use tools like CloudHealth by VMware to simplify your operations

Depending on the depth of experience of your operations staff, creating a simple remediation mechanism might be easy or extremely hard. Majority of the time, this experience is NOT in house. Its is also expensive to hire consultants. etc.

CloudHealth by VMware provides an easy and simple mechanism to create and manage the policy and notification/remediation procedure.

How to implement policy and act on policy violations

In our example us-east-1 is the dev/test location. In CloudHealth by VMware, we create a simple policy targeted at us-east-1 to list any RDS instance that exceeds $200/month spend.

Managing RDS usage

We’ve also configured the system to

  1. notify by email
  2. react by executing a lambda function to shut off the violating RDS instances in dev/test.
  3. Check for violations every day.

While I choose to use lambda to shut down the RDS instance, I also enabled a CHECK. It must ask for permission before executing the action.

This allows for validation with the developer before the action is taken and/or gives time to the developer to backup or ask for an exception.

Email notification on RDS instances in violation

As noted in the image above — three RDS instances are in violation:

  1. prodrds
  2. prod2
  3. exampledb

The lambda function also executed to stop all three of these instances.

RDS instances being stopped

The lambda function created to stop the RDS instance uses specific IAM policies allowing it to stop any instance in the us-east-1 region.

The lambda function is as follows (please use as needed):

    import sys
    import botocore
    import boto3
    from botocore.exceptions import ClientError
    from arnparse import arnparse

    def lambda_handler(event, context):
        rds = boto3.client('rds')
        lambdaFunc = boto3.client('lambda')
        print ('Trying to get Environment variable')
        if event['resource_arns']:
            for item in event['resource_arns']:
             dbitem=arnparse(item)
             print('resources found are', dbitem.service, dbitem.resource_type, dbitem.resource)
             if dbitem.resource_type=='db':
                 DBinstance=str(dbitem.resource)
                 try:
                     response = rds.stop_db_instance( DBInstanceIdentifier=DBinstance)
                     print ('Success in shutting down:: ', dbitem.resource)
                 except ClientError as e:
                     print(e)
        else:
            print("No resources found")
        return { 'message' : "Script execution completed. See Cloudwatch logs for complete output" }

As noted in the function above, it iterates through the entire list of violating RDS instances that CloudHealth passes. This list is a RDS ARNs that meet the policy:

    {
      "resource_arns": [
        "arn:aws:rds:us-east-1:123456789:db:exampledb",
        "arn:aws:rds:us-east-1:123456789:db:prodrds"
      ],
      "function_name": "e2e-testing-lambda",
      "region": "us-east-1"

Finally — what policies are needed to enable CloudHealth to execute the lambda function. Here is a screen shot of the policies attached to the CH_ROLE in AWS.

Policies needed to run lambda for RDS shut down

As shown in the image there are a few extra policies added to the standard CH_ROLE:

  1. AWSLambdaExecute
  2. AWSLambdaVPCAccessExectiontionRole ← this specific policy give the lambda function access to ALL VPCs.
  3. StopOverUsedRDS — specific access to the lambda function that shuts down RDS in our example

    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": "lambda:GetFunctionConfiguration",
                "Resource": "arn:aws:lambda:us-east-1:123456789:function:StopOverUsedRDS"
            }
        ]
    }
    
    
  4. RDS_start_stop — another version of the lambda function (will talk about this in another blog)

Process Next Steps

As a cloud ops/devops individual this “proactive” policy helps catch “holes” in the “pre-ventative” policy. Once these are found and are not exceptions, the IAM based pre-ventitive policy needs to be updated.

This will be continuous process of re-iterating the “pre-ventative” policy.

In the long run this helps save costs on the application, in AWS and specifically on RDS.

Summary

As described in this blog:

  1. Two types of policies “preventative” and “proactive”. Both are needed and we described developing and implementing a “proactive” policy.
  2. Detailed out how a lambda function can be used to manage “actions” on AWS.
  3. Described how CloudHealth simplifies the creation, notification, and action execution of a “proactive” policy.