Technical Remedies for Patching Toil

Patching is toil? Here are some technical solutions that will make you a security engineering architecture hero in your organization.

Utilize Prioritization to Gain Time for Automation

In recent discussions and articles, there's been a growing focus on strategies that help teams rank priority of patching tasks. These tasks are often best deferred, allowing teams to concentrate on more critical patches that might put your system at risk such as patching vulnerabilities that are known to be actively exploited. Prioritization is the name of the game here. It enables you to zero in on what truly matters and helps you make the most of your resources.

If you're a security analyst or engineer, prioritization frameworks are invaluable in your work. Prioritization tasks are exposed when patching is a difficult task. This will happen when there are patch conflicts, testing is difficult, deployments are difficult and several other scenarios. Automating prioritization is a win and decision trees have been presented as a potential solution for this by several vendors and teams. Prioritization decisions are nonetheless difficult in many instances, we should also be aware of the large number of patching problems that can be reduced because ease of patching was improved to such an extent that only a few cases need prioritization intervention. 

You may even find two different user groups of security are served here, security managers and security analysts hugely appreciate and value the positive benefits of automated prioritization on the very complex task of managing patch priorities in a large system (such as one where there are hundreds of applications and hundreds of individual teams involved). Similarly engineering leaders with infrastructure and application teams will often value easing the burden of patching. 

Both of these perspectives, security analyst leaders and engineering leaders are actually vital to the life of a large infrastructure and the success of the security mission in facilitating the objectives of the business.

If you can avoid having to direct a team to patch, or being directed to patch, because automation is in place, thats a win. It means you don't have to have the talk - either to be directed to patch, or directing a team to patch. What most would really prefer is relatively easy operational excellence for our teams, making them successful so that only exceptions need be ranked and discussed in detail, and when these are discussed they are likely to be relevant risk items.

Don't Overlook the Simplest Solution - Lean Towards Simple Automated Patching Unless Prevented - Rank the Priority of Exceptions to this Rule

A crucial question to ask is, "Is this a low-risk system with redundant backups in place?" If the answer is yes, it's often more prudent to implement basic, automated patching at staggered intervals across redundant copies of your application. This might even be done by means of a simple cron script. Don't underestimate the power of simplicity. However, it's essential to have a monitoring plan in case a patch disrupts a redundant copy of your application. In low-risk scenarios, a straightforward, imperfect process is superior to waiting for a more sophisticated implementation of automated patching. Biasing toward automated patching in the absence of significant risk can save systems from potential threats. Since the systems that are lower risk often end up being neglected they can be a painful source of vulnerabilities that can lead to lateral movement. This approach may not work for high-risk applications or systems lacking reliable redundancy. In our experience, imaging systems every four hours effectively safeguards low-risk systems and allows for easy rollback of problematic patches. Neglecting the security of a low-risk system with a known exploitable vulnerability can lead to dire consequences, but rolling back a patch is often a straightforward task when you have regular system images or continuous backups. Some linux distro's provide automated update capabilities eg dnf-automatic or yum-cron, numerous approaches allow you to build update packs that are consistent over your fleet of services.

Sometimes the direct approach can be highly beneficial when you have an advantage against potential risks and bugs already. 

In many production cases you will want to make sure you have multiple redundant copies of your application running in multiple availability zones to guarantee high availability during patching campaigns, the ability to deploy more than one version of your code at any time etc.

Who Am I to Speak of Risk, Priorities and Complex Systems

Before delving further, it's worth asking if the author has the experience required to make potentially risky patch management recommendations and trade-offs between different scenarios such as biasing towards automated patching. The author has held responsibilities in both developing and operating security applications, as well as overseeing vulnerability management technicalities for a government infrastructure with an annual financial impact exceeding $100 billion. Similar responsibilities were undertaken in the financial sector. This firsthand exposure has given the author a unique perspective on both successful and challenging scenarios, from various angles. That having been said, you should meter this advice according to how your infrastructure works and the risks you face.

Decision-Making and Prioritization

Recent discussions, technological advancements, and conferences have explored the management and prioritization of vulnerabilities for patching. Here are some critical insights in my opinion: 

Opportunities for automating decision-making processes can be found in new technologies that utilize Bayesian decision trees to determine when patching would be vital to the safety of the system, or conversely when a vulnerability can be deferred to routine periodic patching cycles.

No-Decision Patching

In certain cases, you can create an architecture where patching is so straightforward and low-risk that many difficult decisions and challenging prioritization tasks become unnecessary. Instead, you can channel your decision-making efforts toward difficult high-value patching opportunities where your decision matrix or tree yields the most significant security benefits, for instance in the case of a patch that is known to conflict or produce errors, requires changes on upstream code etc. In these situations the ranking of priorities can be a huge value add in simplifying the problems we face.

Where possible making patching easy with no decision involved is a win. Here are some of the ways patching can be made simple and automatic

The peregrine falcon is the fastest bird in nature. I have definitely seen teams that have been able to improve their ability to add new business features faster with the help of integration and deployment automation that was originally built in part to aid vulnerability management. Building this ability to go faster takes some effort. That additional speed can really help if you uncover a zero day vulnerability and need to be able to test and patch quickly or recover quickly.

Architectural Data Protection - Separating Front-End and Application Patching from Data - 12 Factor Apps

First protect your data and make your application server stateless. The principle of stateless applications design, a principle of the widely recognized 12-factor app guidelines, can be a valuable ally in this endeavor (see 12factor.net, stateless processes). These principles can help transform your organization into one with a highly adaptable architecture that can be easily patched. Additionally, it's essential to maintain high-quality, immutable backups of your data (also you will see benefits from these backups as a safeguard against ransomware). Progressing incrementally toward this goal holds great value for organizations. Notice I say incremental progress holds value, and by this I mean you should not feel discouraged if your architecture or application does not yet meet all of the factors recommended, steady progress should be your goal. Don't be discouraged by not meeting all of these concerns, that is actually relatively normal for systems to take some effort and time to reach these objectives.

Make sure you store your data separately from your application which needs to be regularly patched and updated. Separating data and application is a principle of 12 factor apps.

Simplify - Remove Operating System and Application Components that you Don't Use

If you can avoid deploying a component, you won't have to patch it or discuss its priority ever again. For instance if your system doesn't use DNSMasq, remove it from your base machine image or docker container.

Testing Automation

The primary risk associated with patching often stems from patch compatibility issues with existing code or customer code. To mitigate this risk routinely, build tests covering key functionalities. While testing automation is continuously improving, it's worth noting that the value of testing automation can often surpass that of the production code itself, as testing automation can serve to advance or rebuild the production code when needed or desired for new technology stacks or approaches. Although testing code is valuable, it's also expensive to develop and maintain, so strive for appropriately sized and targeted testing whenever possible. 

Continuous Integration and Continuous Deployment

Minimize the human effort required for deployment decisions by embedding automation into your deployment process, enabling continuous integration and continuous deployment wherever feasible. Don't forget the Continuous Integration service is itself a high value target quite often, so patching it and protecting it are of high value for your team.

Crows like humans feel more optimistic once they have built a tool to tackle a problem. So if you are facing challenges with patching on a large system, you might consider building or reusing some tooling to help you. A commonplace tool to assist is Jenkins, CircleCI or Github actions, but each platform in the cloud commonly offers continuous integration and continuous deployment options. Various vendors offer implementations of Stakeholder Specific Vulnerability Classification  to help automate assessing priority of vulnerabilities.

Incremental Deployment, Feature Flags, and Production Code Path Validation

In systems serving a large audience, incremental feature deployment is often achievable and allows deployment of new features in code prior to exposing the new functionality. A method we endorse are deploying a version of the application to production, selectively directing traffic to it for validation before rolling it out to all users.

To execute this, you'll need the ability to control traffic routing to specific versions of your application. Numerous technologies can aid in this process; your current application load balancer may offer support, or you may need to introduce new routing capabilities for incremental deployment and A/B testing of deployments. Implementing feature flags can significantly simplify the process of enabling or disabling code behaviors for smoother code deployment. We highly recommend the integration of feature flags into your systems.

Automate and script your deployment process, use feature flags.

Comprehensive Monitoring and Tracing

Maintaining detailed monitoring and tracing is crucial to resolving any issues related to code deployment. One thing that may not be obvious is the value of being able to mark particular traffic for routing to a particular version of your application, for instance by means of a cookie setting. This can allow you to trace a test of a new version of a service deep within your infrastructure.

Make sure your monitoring and tracing is in place so you can detect problems and recover easily.

Retries

Enhance the resilience of your system by ensuring your client code supports retrying failed attempts. This not only aids in safely testing patching incompatibilities but also safeguards against various errors not related to patching. The incremental backoff of retries can further enhance system reliability by reducing retry load on a service that is struggling.

In many cases, your organization may not yet have these capabilities in place. Building and testing these features is not always straightforward or cost-free. However, the value they bring to stable systems in terms of operational excellence is immense. The key is to celebrate and acknowledge the teams that achieve operational excellence.

Roll Back Your Patches when you Find Errors

In some instances, the risk incurred by not patching far exceeds the risk of patching. Implementing a robust imaging strategy can facilitate seamless rollbacks when errors occur. Automating your rollback and/or traffic management strategy can make the political and operational process far easier. When you know you have a routine and easy rollback strategy it can ease the push back that you may find against patching when this process is not easy or contains risks that are difficult to control.

Centralize

You may find benefits from centralizing your base image creation so that it is regularly updated, using a central standard set of docker images that have been hardened. This can spread the effort among similar teams. 

Centralize your base image creation capability where possible so that similar teams can reuse the effort involved in base image creation and patching.

Conclusion

In conclusion, prioritization, automation, and thoughtful architectural choices can significantly enhance the security and efficiency of your systems. By applying these principles and leveraging modern technologies, you can minimize risks, streamline processes, and ultimately achieve operational excellence. 

Don't be discouraged, your system may not have or need all of these approaches, but with time you can improve the ease with which you keep your infrastructure up to date and avoid needless toil.

Acknowledgments

The author acknowledges the guidance and inspiration received from prominent figures in the field, including Adrian Cockcroft (Architect at Netflix), Erik Maland (Twitter and USDS), Adam Wiggins (Heroku, 12-factor app), and numerous colleagues and coworkers who have played pivotal roles in operating large-scale government systems over the years, including Stephen Shaffer, James Connor (Corbalt), Jen Leech (Truss), Ketan Patel (Archesys), and John Booth (Leadership).

Thanks to all who have helped and contributed to operational patching efforts over the years!