A few years ago, it was often mistaken that security in the cloud was the sole responsibility of the Cloud Provider. That’s not true. It has always been and will be a shared responsibility model, where the Cloud Provider is accountable for “Securing the infrastructure that runs the services offered in the cloud” and you are responsible for managing and securing user access, network and resource configurations. (You can find more information on that here)

In a multi-cloud world, these responsibilities just multiply.

Security in the cloud is multi-faceted. It can be grouped into 3 major categories based on the risk they possess.

1. Network Risks - Related to DDoS, Intrusion
2. Application Risks - Related to Host Vulnerabilities
3. Compliance and Governance Risks - Related to Identity and Access Mgmt, Resource configurations and standards like CIS, NIST, PCI etc.
The first 2, Network and Application risks, can be solved with traditional methods by using a firewall (or a SaaS offering) and maybe an agent within the host instance or container to monitor for vulnerabilities, respectively.

Whereas, Compliance and Governance related risks are more complicated in the public cloud for 3 primary reasons:

1.Ever Growing Cloud Native Services - Cloud Providers are releasing new services every quarter and adding features to existing services.

2.Service Dependencies - As the number of services grows the inter-connectivity between them increases. Not to forget the evolving relationships between existing services as well. For ex: In 2013, you could create an EC2 instance outside of a VPC (EC2-Classic). That relationship has now changed. You CANNOT have your EC2 instance deployed without a VPC.

3.Identity and Access Management - IAM in the cloud allows for assigning Policies to the “Users” as well as “Roles” to instances themselves. Assigning “Roles” to instances is a good feature because the developer does not have to worry about storing their keys within the instance.

While this is great, it also means that now, not just your users have access to services (like databases, s3 buckets or even IAM itself) but even the application instances also have access. This is totally different from traditional on-prem data centers. It requires looking at cloud security from a different lens.

Irrespective of the type of incident/violation whether it’s in public cloud or private, you should always try to follow these steps:

Response Process

In this two-part blog, we will address some of the most frequently asked questions and also configure entire Response Process as shown above.

  1. What are the most common violations in the public cloud?

  2. How to detect these violations?

  3. How to design compliance and governance policies to evaluate violations?

  4. Okay, I can detect violations. How to fix/remediate these violations?

  5. The Continuous Security Model

  6. How to select Cloud Governance and compliance tools that fit in the Continuous Security Model?

  7. What are some of the potential options for such tools? Open Source and Enterprise.

  8. How to automate this process using Jenkins and Terraform ? (Blog: Part 2)

This blog will cover the concepts and provide a deep dive on the overall “Continuous Security Model”.

Violation Types and Examples

There are common and complex violations in the public cloud.

Here are some of the common violations:

1. S3 Bucket or Blob Storage is open and accessible publicly on the internet. (Unprotected Resource)

2. Instances have Administrator Role assigned to them. (Privileged Access)

3. User has active keys but has not rotated them or have pushed the keys to a public “git” repository. (Compromised Credentials)

4. Database instances are not encrypted. (Un-encrypted resources)

5. SSH port 22 is publicly accessible.

Complex violations are harder to find and are not evident at first glance.

For ex: A publicly routable and addressable instance has the same ssh key as an instance with administrative policy.

This complex violation is comprised of disparate violations:
1. Instance 1 is publicly routable and accessible.
2. Instance 2 is private but has an IAM attached to it with “admin” policy. (With this access the instance can perform any action on the account)
3. Instance 1 and 2 share the same ssh-key.

In this scenario, the user, who has the SSH-Key but NO privileged access, can ssh into the public instance (Instance 1) and then access the instance with admin privileges (Instance 2) and potentially make any number of changes within the account. So even if you had logging enabled, the logs would show that the “instance” was the one which made changes with no trace of the user.

This is dangerous !!

This exact nature of complex violations makes it harder to detect. In the next section, we will discuss detection techniques for such violations.

Detecting Violations

While some of the violations, mentioned in the previous section, (like S3 bucket open, DB not encrypted) can be detected with simple scripts, running in VMs/lambda functions (or Azure Functions), others might need careful analysis and correlation of data from various native services and overlaying that information on top of cloud data model (how the services/resources are connected to each other).

Some of these native services are:

1. AWS VPC Flow logs / Azure Network Security Group Flow Logs - “What’s happening in my cloud network?”

2. AWS Cloud Trail / Azure Activity Log - “Who, when and what CRUD operations were performed on which services in the account?”

3. AWS Guard Duty / Azure Advanced Threat Protection - “That looks sketchy !!”

NOTE: It's always a best practice to enable these services within your cloud environment.

These tools combined with configuration change data from your deployed resources can help you identify violations in the public cloud.

You must also remember that these services should be enabled for every cloud account and every region within your organization. This may cause sprawl of dashboards (across different cloud providers) and alerts being generated from all of the different services, which makes it difficult to manage.

Another alternative is to leverage multi-cloud tools (Open Source or Enterprise) that works across AWS, Azure and GCP environments and provide a single point to manage all of the data (events, violations) and alerts. We will look at some features and potential tools that you can use, at the end of this blog.

In the next section, we will learn how to design policies that these tools can use to evaluate for violations.

Designing Effective Policies

Policies in the private cloud (on-prem DC) have been traditionally designed with narrower guardrails. This worked fine in on-prem environment but in the public cloud, where the infrastructure is so dynamic and elastic, this often results in rigidity around usage of cloud native services (Not to forget, developers complaining that ‘Security (Team) is slowing down their product release !!’, Sounds Familiar ?)

To overcome these challenges, the policies should be designed such that it allows for enough flexibility to the developers to configure and deploy applications as they need to (Fences), yet maintaining points of enforcement (Gates) to ensure secure operation of your infrastructure.

Here is a good article about the thought-process behind Fences and Gates that encapsulates operations in a multi-cloud world.

A policy consists of one or more permission(s) and constraints for resources, objects or services.

Remember, permissions can be assigned either to “Users” or to the “Instances” (Functions).

Let’s look at a few examples:

Policy 1 -  
    Permissions:   
                - Create, Read, Update, Delete  
    Resources: 
                - EC2  
    Constraints:  
                - EC2 Instances should have unique SSH-KEY pair
Policy 2 -
    Permissions:                       
                - Create, Read, Update, Delete 
    Resources: 
                - RDS 
    Constraints: 
                - RDS Snapshots must be encrypted
Policy 3 - 
    Permissions:    
                - Create, List
    Resources:  
                - S3
    Constraints:
                - S3 bucket should be encrypted and accessible only by authenticated users

Policy 1 - Provides CRUD permissions to operate on EC2 instances. It also has a constraint that EC2 instances should have unique SSH-Keys. If while deploying, the instances end up sharing the same ssh-key, then a violation alert will be triggered.

With Policy 2 and 3, the alerts are triggered when RDS snapshots are NOT encrypted or the S3 bucket is publicly accessible, respectively.

It's important to note that the "Policy" design is NOT a one-time initiative. It's an iterative process.

For ex: In Policy - 3, none of the S3 buckets can be accessed publicly. What if you have a marketing team that wants to host external facing content for all the customers? This will require updating the constraints in the policy. It can be achieved either by Tagging those resources [“Content = Marketing”] and deciding based on the tags as to what constraints will be applied.

You might also notice a trend on some of the resources, for ex: the instances with public access seem to have SSH Brute force attacks from the same source for over a week. This means that the policy needs to be modified such that public access is allowed only from “Known IP Addresses”. This should be reflected in Security Groups.

The “continuous feedback” will enable you to refine the policies as you add more services/resources.

Once the violations have been detected and the interested “teams/people” have been informed, the next logical step is to remediate it.

4 Ways to Remediate Violations

The Remediation action can either be the responsibility of SecOps teams or they might inform the development (DevOps) team with a possible solution for the issue. Whatever may be the process at your organization, any alert/action should always be tied to a “Policy”, as described in the previous section.

The constraint in a policy defines the desired state of configuration of the resource, object or service. The remediation action should modify the current state to meet the desired state. For example: If encryption is disabled on S3 bucket, then according to policy-3 from previous sections, the remediation action would enable encryption.

There is no “One Size Fits All” way of remediating the violations. It all depends on the policy and resource type. But in general they fall under these 4 buckets:

1. Modify the configuration of the resource - Ex: Remove the rule from Security Group (or Azure Network Security Group) that allowed SSH access to the instance from anywhere (0.0.0.0/0).

2. Re-deploy the resource with new configuration - In some cases, configuration change can result in terminating existing resource and re-deploying it. Ex: If incorrect SSH key pairs are used on instances, the only possible action is to re-deploy it.

3. Terminate the resource - You may have an action that will trigger the resource to be deleted if it is part of a violation. This might be useful in heavily regulated organizations.

4. Suppress the violation on a resource - This implies that the violation won’t trigger notifications in future for the selected resource.

Ex: S3 bucket which has marketing content for customers, can be open to public. Hence, the violation does not apply to the resource.

Caution must be exercised whenever you choose to "suppress" any violation.

The remediation can be performed either by the compliance and governance tools mentioned in previous sections or by using your own script(s) that is integrated with your DevOps workflow or a combination of both.

Once the remediation action is taken, the infrastructure should be still monitored for any violations, continuously.

Now that we have an understanding of aspects around compliance and governance in cloud, let’s look at how we can implement some of these practices.

The Continuous Security Model

In a “Continuous Security Model”, the goal is to build “security checks” into the CI/CD tool itself. It enables continuous monitoring for violations and using the trend as continuous feedback to refine policies and security posture of the infrastructure. This inturn, allows both DevOps and SecOps teams to collaborate without stepping on each other’s toes.

Here is an overview of the implementation.

Pipeline Overview

This implementation comprises of various pipelines (Jenkins Pipeline). There is also feedback from the “Post-deployment” pipeline back into “Deployment Pipeline” (Re-deployment Pipeline), which enables sharing of data between these 2 pipelines.

Let’s dig a bit deeper to understand these pipelines:

1. Pre-Deployment Pipeline - This pipeline(s) might already exist within your CI/CD tool (Jenkins). They are responsible for performing builds, tests and static code analysis.

2. Deployment Pipeline - This pipeline(s) is responsible for deploying the application(s) into your choice of public cloud. Usually, it integrates with tools like Terraform, Ansible, Cloud Formation, Spinnaker etc.

3. Post-Deployment Pipeline - This pipeline is triggered right after the deployment pipeline. It’s responsible for monitoring any kind of violations and vulnerabilities. This pipeline should also be triggered periodically, to ensure that no one has gone around the system to modify any configuration that might cause adverse effects.

There are 2 types of Post-Deployment Pipelines -

a. Host Vulnerability Monitoring Pipeline - It scans for vulnerabilities within the hosts and also scanning for package / application related vulnerability. A host vulnerability scanning tools should be used here to gather information.

b. Governance and Compliance Monitoring Pipeline - It scans for any “Policy Violations” caused by resources provisioned by the “Deployment pipeline”. We leverage one of the Cloud Compliance and Governance tools to gather all the violations based on the rules/constraints within our policy.

This pipeline may re-trigger the deployment pipeline with new inputs (Remediation Step). These inputs can be provided either by the User or it can be automated entirely for some scenarios. For ex: If violations with “instances are deployed with shared ssh-keys” occurs, then on re-triggering deployment pipeline, the user could provide unique ssh-keys as input.

Whereas, if a violation with “S3 bucket encryption is disabled” occurs, then the re-triggered deployment pipeline can automatically provide input to enable it.

This gives an overall picture of the “Continuous Security Model”. Let’s look at some key features of tools that you can employ within your DevSecOps process.

Selecting the Right Tool

Whether you choose to build your own in-house tool (Not Recommended. Do this only if the requirements below can be satisfied) or use an Open Source or Enterprise Solution for detecting violations, there are some key characteristics to look for in these tools.

Must Have

1. It should work across multiple cloud providers like AWS, Azure. The last thing you want is to use disparate tools for every cloud. This also implies that the tool should keep up with the new services being released by these providers.

2. It must provide out of the box support for standards like CIS, NIST, PCI etc. which are very critical in evaluating compliance.

3. It should provide real-time alerts.

4. It should leverage cloud-native metrics as an additional data source to detect violations. This is important because unlike on-prem data center, you won’t have access to all the API data/logs/metrics/flows in Public Cloud.

5. Last but not least, it should be scalable, allowing the ability to manage multiple accounts across multiple cloud providers.

Must Avoid

Apart from looking out for these features, you must avoid one major pitfall described below.

Making frequent API calls at short and regular intervals to get the current configuration and taking a diff with the previous state, sounds like the easiest way to get real-time info about changes in your infra.

This will result in the following issues:

a. The number of API Requests that can be made within a specific interval is limited. If this limit is exceeded, then you might expect an error “RequestLimitExceeded” , in case of AWS, and a similar API Call limit on Azure. The problem arises when the tools make continuous API requests to get the state of resources/objects. If the resource limit is exceeded then the end user won’t be able to deploy, modify or query the cloud environment until that time interval has passed. This can result in unnecessary delays in critical situations.

b. The “fix” (hack) to this problem would be to reduce the interval at which the API Requests are made. Again, this is problematic because the alerts WON’T be real-time anymore. Not a good architecture in the security realm.

Now that we understand both the key features and pitfalls, let’s look at a few potential tools in the next section.

Potential Tools for Cloud Governance and Compliance

Some of the potential solutions out there are : Cloud Custodian, VMware Secure State, Dome9, RedLock and a few others.

One of the OpenSource Tools is:

Cloud Custodian is an Open Source tool that unifies dozens of tools and scripts most organizations use for detecting violations and taking actions on them.

Key Features:

1. It provides a simple, lightweight (written in python) rules engine that allows users to define policies in “YAML” format.

2. It can evaluate cloud objects for violations based on the rule-set. For ex: Security group has port 22 open to the world (0.0.0.0/0) or S3 bucket is not encrypted.

3. Integrates with serverless services from cloud providers to provide real-time enforcement.

4. It has a CLI that executes all these commands. (No UI yet)

If you are looking for an Enterprise Grade Solution then,

VMware Secure State is a SaaS Offering.

Key Features:
1. In addition to the features provided by Cloud Custodian, VMware Secure State can detect and remediate complex policy violations, provide a view of how resources are “connected” (dependencies) as well as support various standards like PCI, NIST, CIS out of the box.

2. It also provides RBAC (Role based Access Control), Team management and Report generation, which is very useful to report overall security and compliance posture to the management (like CSOs).

3. It integrates with cloud native services like Cloud Trail, Guard Duty and others to correlate and analyze the data to provide most valuable insights.

4. It provides a Dashboard as well as a CLI.

5. It has real-time alerts that uses “API calls on as needed” basis to your infrastructure, thereby overcoming the pitfalls from the previous section.

In my next blog (Part 2), we will focus on step by step configuration and setup of the “Response Process” starting from deployment pipelines (using Jenkins and Terraform) and post-deployment pipelines (governance and compliance monitoring).

Conclusion

Security in public cloud is quite complicated. You will require an iterative approach in designing your strategy and policies around it. There is no “Silver Bullet” when it comes to getting the policy right in the first run. It’s an iterative process and the policies will change as the services in cloud change.

In all of this, the methodology should always be a “Continuous Security Model”. It enables DevOps and SecOps teams to inter-operate well together.