Its a good feeling to look at your security reports in a large infrastructure and to see good results or progress towards that. Having the capability to detect and remediate security problems in near real time is a huge benefit in getting there.
Here are some tips I learned in the process of dealing with very large infrastructures as security technical lead. By large I mean between 100 million to 500 million dollars in annual cloud spend in these cases, but these tips could also help you become a standout contributer to many types of infrastructures at many scales.
I personally worked on code that performed hundreds of thousands of remediations in production systems, and together with the team, including efforts from many security staff and engineers we were able to build a significant measured improvement in security posture. Doing this work is not without resistance, you shouldn't expect continuous remediation to be extremely easy or to never be opposed in conducting this work. I don't want to lead you to believe there are no political, personal or technical hurdles to overcome, but I think its both feasible and an extremely valuable for security. The rewards are large in my opinion.
fig: You can be as at home hunting and remediating complex security issues as a cat in a jungle. These principles, tips and tricks will make you a formidable force in hunting these down and protecting your systems with your colleagues.
The organization in these cases had many cloud products and development efforts occurring across hundreds of teams. Doing security remediation work on this scale takes sensitivity to the objectives of the teams and also the vital protective need for security engineers as a specialization. Fortunately we found that there were big wins to be had in helping teams remediate. The help provided with automation often relieved a burden on teams and allowed them to focus on unique security needs of their product whereas our central team was able to target common issues seen in larger volumes.
In some cases some teams were very keen to have hands on all changes made, though these teams were a smaller set of individual teams. Getting to know your teams and building a friendly connection to security is one win that comes with being able to offer broad, quick, safe and realtime fixes to security misconfiguration problems. To give you an idea of the scale at which this was done, one of our fixes reconfigured about 15 thousand s3 storage buckets to meet a new standard. So there are good wins to be had from engineering an automation that helps your team as the volume of work for individual engineers to perform that work would have been significant if done by hand. As you might imagine this set of buckets contained large volumes of data (many petabytes) and backed live production systems so this had to be handled with care.
fig: There are big opportunities in security engineering teams helping fix problems alongside application security teams and we have seen big value in building the security engineering speciality in our staffing. Security engineering teams can compliment app development teams in tackling security remediations that are faced by many teams.
Here are some of the tips and tricks gained in the process of exploring and engineering many campaigns of security fixes:
Automated fixes are the way to go - with some exceptions
Event oriented automated fixes are what I preferred to use after experience with this
Real time fixes are a win - we used a subscription model in which teams could subscribe to particular automated remediations
Divide your target systems into risk categories and redundancy categories so that you can run on low risk systems first
We were able to develop a tagging and selector system that was able to flexibly target particular systems and cloud accounts. Your tools and code may offer you various options (feel free to reach out to me if you would like details).
Fig: Automation is vital. Real time automated remediations allow you to quickly respond to problems.
Manual fixes can be a good idea for low volume items that have not yet gotten automation. It helps to have a great SOC team that can apply some fixes for you - which we did.
If the fix is manual its likely to revert as developers make changes.
When the fixes were applied manually - know you will have to manage the morale of your crew, because you are going to have regressions as teams make changes to deploy new code or capabilities.
Nonetheless manual fixes are a good bridge.
Build documentation when fixing manually including asking teams to contribute to this when they fix.
Divide your effort into campaigns
Campaigns of specific remediations can help focus efforts and build a strong teaming with developers and CISO's.
This allows you to report progress and celebrate the teams that have contributed to that progress with their management and the overall infra management.
You may not have enough engineering resources to implement all the fixes needed for your particular infrastructure quickly, so pick a smaller subset initially to guage the challenge so that you can mobilize teams, build communication and so forth.
Campaign targets - in general we recommend you pick high risk and high volume security issues to remediate quickly via automation and higher risk very low volume items for initial manual remediation.
Fig: Tackle a larger infrastructures by dividing issues into campaigns. This allows you to make reportable progress without getting bogged down or overwhelming your team or the application teams.
Communicate often and via multiple channels
The various developer teams are vital to your success. In most cases they'll be happy to not have to deal with some fixes so they can focus on their deliverables. However in some cases they will want more control of what changes are made. Automation of fixes require great communication and building trust. If your group of teams is big enough someone is going to miss your communications and warnings that you will be taking action on resolving their security issues.
Fig: Make sure you are using multiple channels to communicate to your teams. Understand people are busy and may not see your email, use your ticket system, instant message boards etc to make sure you have given your teams an opportunity to understand what needs to be fixed.
Fix via the cloud API directly. You can, but don't necessarily have to, fix the terraform code or cloudformation code as long as you always can catch the actual deployed configuration and correct it.
You don't need to fix infrastructure as code eg applying a fix directly to terraform or cloudformation code. I preferred to use direct cloud api code fixes after time elapsed mainly because I was in a group of hundreds of teams with diverse methods to interact with the cloud and on-prem infrastructure. Your mileage may vary - you may need to apply your fixes in terraform code if your company uses a single Infrastructure as code library.
Chances are good devs are modifying infrastructure manually in development systems.
Terraform and cloudformation will lag behind the direct cloud API's in terms of features and functionality. Meaning in rare edge cases you may not be able to fix items you might otherwise hope to resolve. This is just another reason to use the cloud API's when nothing opposes you using it.
Fig: If you take on too many representations of infrastructure as code to remediate you may end up creating an insurmountable volume of work and locations to change. Its enough to remediate the most embodied form of the infrastructure, in my opinion that is the direct cloud API if you have built automation that will do that in real time.
Automate your workflow for approvals where possible, but don't let waiting for this automation stop you from taking action. So start fixing even if you don't have full workflow and approvals automation and instead must get approvals manually from stakeholders in tickets etc.
A general principle of minimum surprise is worth noting,
get ticketed approvals that communicate very clearly what changes you will make
test thoroughly in monitored lower risk environments before promoting a fix to high impact environments.
if possible use feature flags so you can get your code deployed safely and selectively enable a remediation to reduce risks of a big deploy
fig: Minimum surprises is best. Having workflow automation for approvals from teams if you need them is well worth it as an investment in a large infrastructure. Similarly the importance of testing and working in lower environments first as well as careful monitoring also contributes to minimum surprise once you get to more impactful higher environments. On the other hand you should not become unable to take action on issues even if it causes some surprises. This is an issue that requires balance.
Provide teams a way to provide evidence of compliance to get an exception to automated fixes that would be inappropriate for their configuration.
Appreciate that investigating and documenting an exception is a process that is and should likely be vastly more difficult than applying an automated fix.
Bare in mind that this is a difficult process in some cases and may take a skilled application or security engineer to assess, investigate and approve - so be careful to make sure that this leads to a material reduction in risk for your teams and that this effort is preserved for future audits so it can be reused.
If a finding calls out an issue that is not a material risk, don't hesitate to retire that finding if its not applicable in your infra, or don't hesitate to develop prepackaged advice and documentation for teams for whom this finding is not material. The burden of documenting as a task shouldn't be treated lightly - if you are the one assessing the attestations and claims of a team claiming a finding doesn't apply to them - this burden will also fall on you or your analysts as even assessing the veracity of these claims and providing advice to teams is a significant effort.
Fig: You'll need to be able to document exceptions and compensating controls with teams. We feel teams should own and retain this documentation and the security team should maintain access to it and archive copies of these documents that are collaboratively edited as well as review and sign off on it as an expert advisory service to the risk owners on the system (usually the system CISO). The most important thing is that you want your teams to be able to leverage past time invested in documenting when a control would not achieve security objectives so that they don't have to repeat that effort on each audit. Don't underestimate the challenge of responding requests for exceptions - this can be a challenge for a small team to analyze a large infrastructure in this way.
Hopefully this helps on some of the general overview and architecture questions you may run into. My intent here is not to sell anything, but to make the barrier to entry lower for other teams facing this sort of challenge and in so doing leveraging the hard work done on our various projects for further benefit to citizens if nothing prevents it.
I would like to thank the many dedicated security colleagues, engineers and teams that have helped resolve hundreds of thousands of findings over the years. The dedication and determination of these individuals may be why you or a loved one enjoys protection from identity theft today or retains privacy choices. In my experience seeing a team of engineers, CISO's and others come together to solve a seemingly very challenging security task is very inspiring - much good has been done by these teams to protect a great many citizens.