VictorOps is now Splunk On-Call! Learn More.
Security breaches aren’t usually polite, well-behaved events. They show up unannounced, do things you don’t want them to do and can leave an unpleasantly large amount of damage in their wake. This is as true in DevOps as it is in more traditional methods of software development and delivery.
In order to manage security-related incidents and to limit the amount of damage they cause, you need to set up a framework for incident response. Vulnerabilities can exist at any point in the continuous integration and continuous delivery (CI/CD) process, so this framework should be fully integrated into your CI/CD pipeline.
The first step in setting up an incident response framework is to identify the potential points of vulnerability in your system. What kind of attack or weakness does a CI/CD pipeline typically present?
External connections associated with pipeline infrastructure: These include external resources used in development, links to services used by the application and links to cloud and other platforms.
Container repositories: Inadequately managed repositories can be compromised, typically with malicious payloads.
Development and pipeline management tools: These may have vulnerabilities, which can include malicious payloads.
The application code itself: It would be nice to deliver perfect code 100% of the time, but the truth is that your application may contain undetected vulnerabilities even after it goes live.
Create a map of possible points of attack in your pipeline, rating them by the degree of vulnerability and potential for damage. The mind-map doesn’t have to be precise; and in practice, it probably won’t be much more than approximate, but that’s OK. An approximate picture of your system’s vulnerabilities is likely to be more useful than one with an unrealistic degree of precision. By their nature, security-related issues generally involve an unpleasantly large number of unknowns.
A large part of any incident response framework should be centered around prevention. Prevention is forward defense and as such, it is as much a part of your response system as the reactive measures that are required if and when your system’s defenses are breached.
Strict control of access is a key element of any forward defense strategy. Use role-based access wherever it’s practical. You should always limit a given role’s privileges and access to those required for that specific role to perform the tasks assigned to it – this includes development and infrastructure resources. Look at what each tool needs to be able to do and give it a role with the required privileges, and nothing more.
And, always change any default admin account passwords. Even a single admin password left unchanged is a possible point of entry, and in a system with inadequate or inconsistently-applied role-based access, it presents a potential opportunity for privilege escalation.
You should routinely scan container images for vulnerabilities, malware and outdated components. Scan and monitor infrastructure, dependencies and third-party resources. A single tool, infrastructure element, or container image with a vulnerability or a malicious payload can compromise your entire pipeline.
Scan your application code as well and include security testing at all stages of development. Your test regime should ideally give security at least as high a priority as key functional and performance issues. This is true for testing at the level of both individual components and the system as a whole.
Your inner lines of defense should include whitelist-based management to control which container sources, dependencies and tools are allowable. For all of your tools and infrastructure elements, you need to keep track of known and newly-reported vulnerabilities, and apply the appropriate security measures. You should also understand the built-in security features for key infrastructure tools (such as Jenkins and Kubernetes) and use them where appropriate.
Ultimately, your best in-depth defense is to shift the culture of your entire pipeline from DevOps to DevSecOps – building security into the CI/CD pipeline, from early design stages through code design, testing, deployment and ongoing operations.
As important as prevention is, you can’t count on outer defenses alone. If and when those are breached, you need to mount a strong response.
The response part of your incident response framework starts with monitoring. Monitor logs, registries and other key points where system information is recorded or which include system/infrastructure settings. This should include control documents for development, deployment, container management, and infrastructure tools.
Your monitoring system should pick up suspicious patterns of user activity, traffic, page access, failed logins, errors or other anomalies, as well as obvious system failures or break-ins. If you can detect an attempted breach before it is successful, you are in a much better position to maintain your system’s integrity than if you find out about it only after the fact.
A security breach is a real-time event, and in order to respond effectively in real-time, you need a first-rate alert system – which is much more than just a phone service with an on-call list. Alerting should start with automated, rule-based filtering and triage.
The alert system needs to be able to filter the monitoring data, identify possible incidents, determine their basic nature, and apply rules based on that information. These rules should include initial prioritization and basic alert routing so the system knows which teams, roles or individuals should be alerted and which alerts should generate a call, rather than simply being reported for later action.
A first-rate alert system should also include a dashboard and reporting system allowing responders to provide and view real-time incident information during the response process, along with facilities for such things as on-demand conference calls and one-to-one communication.
The investigation and remediation stages also need to be part of the basic incident response framework. To a degree, each incident dictates the nature of its own response. But, in general, investigation and remediation should include the following elements:
Depending on the nature of the incident, this step may have lower priority than handling immediate damage. But, its priority should still be high, and it’s likely to be the key to stopping an ongoing attack.
This includes finding out whether live service operations have been affected, and to what extent: performance and functionality only, compromised user data, compromised user systems, etc. It also includes determining the sensitivity of any compromised data. These factors can have a major impact on your immediate response priorities.
And, which can be kept running and connected, including which can be kept public-facing. If parts of your live operation do need to be shut down, switch to uncompromised backup/alternate systems whenever possible.
But, don’t just patch the hole and undo the visible damage. You need to understand the nature of the breach, fix all associated vulnerabilities, and repair all of the damage. An ad-hoc fix is technical debt, and the price may come in the form of additional security problems. Post-incident reviews can help you expose where to dedicate resources and repair the real core issue(s) of application delivery and upkeep.
If you can’t fix the underlying problems as part of the response, the incident report should include a full description and analysis of those problems, and the appropriate teams should be notified.
Recording the details of the incident, the response, any repairs, and any unpatched vulnerabilities is an intrinsic part of the incident response framework. Incident records provide valuable information for later responders, they alert designers and developers to vulnerabilities and ad-hoc fixes that need attention, and they allow engineering and IT teams to map out potential system-wide, structural problems.
Security breaches are chaos in action – and your best defense against chaos is a strong, flexible framework for response. You need to start putting that framework in place right now.
Develop and implement your CI/CD-friendly incident response framework in a centralized place with a highly collaborative, transparent on-call incident management tool. Test out a 14-day free trial of VictorOps or request a free personalized demo to learn more about faster alerting and incident response without hindering CI/CD.
Michael Churchman started out in the early years of the game industry as a scriptwriter, editor, and producer. During the 90s, he worked in the high-pressure bundled software industry, where the move from waterfall to faster release was well under way, and near-continuous release cycles and automated deployment were already de facto standards. There, he developed a semi-automated system for managing localization in over fifteen languages. For over a decade, he has been involved in the analysis of software development processes and related engineering management issues. He is a regular Fixate.io contributor.