Security Operations Framework
Security Operations is the technology, controls, and processes that allow a security organization to be able to prevent, detect, and identify cyber intrusions and be able to accurately and efficiently respond to and recover from them.
After building a few security operations teams and programs, I was inspired to write down my basic framework. Your team may look and be structured differently due to different business risks or a higher tolerance for risk, but I believe everything listed below is necessary for a successful and informed security operations team.
The first thing you may notice about this framework is that it’s not focused around a security operations center (SOC) and it’s not focused around any individual technology like a SIEM. Modern security operations teams are responsible for much more than alerts and investigations. A security operations team should use their technology to enable identification, detection, response, and recovery, without relying on their technology stack. Always be willing to build or purchase new technology if your current tools and capabilities are not meeting your goals and requirements.
You should describe a set of goals or a set of risks you want to be able to respond to and recover from. In order to serve these goals and risks, you may need a variety of engineers, analysts, and practitioners performing a variety of tasks. You may need 24/7/365 coverage or you may need a focus on internal threat or fraud risk or you may need a focus on physical security and employee safety. You may need digital forensics expertise or malware reversing expertise or vulnerability research expertise on your team. You may prioritize different projects and parts of the team depending on the different risks and the way the organization is structured.
Your security operations team will work closely with many different parts of the business, including the entire security team. Security operations should work closely with product security to make sure all products are being monitored and all infrastructure base images have log collection agents on them. Any known vulnerabilities from product security should be shared with security operations for proactive monitoring. Security operations should share insights about scans, tests, and exploits with product security.
Security operations should also work closely with enterprise security on programs like endpoint protection, deception technology, e-mail monitoring, brand monitoring, executive protection, and many other programs for effective controls design, response and recovery, and telemetry collection.
Log Management and Telemetry Collection
Security operations should be working with engineering and IT to make sure that all logs and relevant telemetry are being sent to a central location for security operations to perform analysis and conduct investigations.
Some basic features a good log management system must have include: structured and unstructured log ingestion, integrations with common systems and technology, ability to write complex detection and alerting rules, and integrations into investigation management systems and orchestration and automation systems. Depending on your team and their requirements, this system may be part of existing engineering infrastructure, may be a commercial security product, or may be managed by an external vendor.
Log and telemetry management is a continuous process that will get better iteratively as operations uncovers new places to collect telemetry and develops new types of telemetry to collect.
Security operations should be continuously writing new detections and alerting rules to increase their ability to find new intrusions and more sophisticated adversaries.
While you may be building most of your detection engineering in your log management system, you shouldn’t be constrained by it if your engineers are creating detections that the system doesn’t support. If a detection needs to span a large amount of time or a large amount of data or requires external data sources or correlation or statistics functions that the log management system doesn’t support, utilize external systems to make those detections possible.
Your team needs an investigation management system to track all ongoing investigations and unresolved events. This system can also be a place where your analysts record all actions and evidence for an investigation and your source of truth for incident timelines, evidence, investigation results, and legal decisions. From a technology perspective, this system can be as complicated as something custom or purpose build for tracking investigations or something as simple as a generic ticket management system.
Most of your investigations may close automatically as your automation and orchestration technology takes care of any items in your response and recovery runbooks. Some of your investigations my stay open while waiting for automation to record a response from an analyst or an employee. Your investigation management system can be a place where open and recently closed investigations are reviewed by analysts for accuracy and completeness.
Response and Recovery
Have clear rules and runbooks for when an event or alert becomes an incident or a breach. At this point, you should declare an incident commander (usually the person conducting the investigation) and you should start to collect forensic evidence, including building a timeline and performing remote imaging when applicable. Ideally, your investigation management system will assist in all of these processes.
If you have a runbook for the type of event your are responding and recovering from, follow it until it does not apply. If you don’t have a runbook, gather experts from your organization and use their experience and knowledge to investigate this event to resolution.
Every type of event or alert that happens more than once should be runbooked and every type of event or alert that happens more than twice should be automated. Utilize modern techniques like chatops and orchestration to prevent security operations analysts from having to get involved in every event.
For example, if you raise an alert for an employee logging in from a new device or country, you might have your automated runbook start by firing off a chatops-style message to that employee asking if the login was legitimate. If they answer no or don’t answer in a reasonable amount of time, you can fire off an account lockout via orchestration and have the event flagged for review by security operations. If they answer yes and there’s no other correlated evidence of intrusion, you might close out the event without human interaction.
Your security operations team is responsible for documenting and analyzing security failures. Write careful, accurate, and honest postmortems that influence how security controls and processes are improved and prioritized.
Other than your regular postmortems, your security operations team should also be reviewing tape of manual and automatic response and recovery incidents and running tabletop exercises and drills. Primarily, to improve efficacy, build camaraderie, reduce mistakes, and normalize different high-stress situations across the the team. But also, reviewing incidents in detail can help uncover new sources of telemetry, new detections, and new controls that can improve overall security of the organization.
Threat Feeds and Information Sharing
Part of your detection engineering process should include trusted information feeds. You may ingest a variety of private, public, and shared, feeds into your log management system to reduce false positive and increase true positive detections.
Make sure that you’re using trustworthy threat feeds that go through strict quality assurance processes and that the context around untrustworthy feeds are well understood for the detections they are used in. The last thing you want is some automated process to block legitimate users from legitimate systems because of a shitty threat feed.
In stark contrast to threat feeds, your security operations team should also have a strong pulse on who your adversaries are, how they plan and operate, and what their motivations, goals, resources, and constraints look like.
Once your security operations team is large enough, you should have a dedicated person or team in your security operations team for collection, analysis, and dissemination of threat intelligence for security partners and business partners within the organization.
The security operations team should also be responsible for organization-wide threat modeling since they are responsible for organization-wide threat intelligence and intrusion response and recovery. Security operations will be intimately familiar with the organization’s adversaries, past intrusions, and evolving adversary operations, all of which will be valuable in building an accurate and complete threat model.
Security operations usually also works closely with different teams and employees from all over the organization and needs a way to communicate with them. This can happen officially or organically over Slack and e-mail. Sometimes, security operations may need a way for employees to get in touch with an analyst on call in an emergency via e-mail, Slack, or phone and sometimes security operations may need to allow employees to open a ticket manually.
Security operations should have clear explicit lines for when legal needs to be notified of an investigation or intrusion. These lines may be different depending on your vertical, your regulatory obligations, and your reporting obligations. Your team should be intimately familiar with how legal defines a breach. If security and legal has different definitions for intrusion, incident, and breach, make sure they are well documented and the team is aware of them. Your team should also understand your breach notification requirements from legal and any security obligations from customers, investors, or regulators.
You may have to report breaches to regulatory organizations, customers, or partners and you may choose to publicly disclose breaches. As part of your runbooking, you should have language drafted for these scenarios before they happen, like holding statements and breach notification press releases. Work with your marketing, public relations, and social media teams on what reporting and press looks like in these scenarios. Make sure your CISO, or whoever is going to be customer- and press-facing, goes through media training and knows what to say and more importantly what not to say. Run these scenarios internally as part of your tabletop exercises and review.
Business Continuity and Disaster Recovery
I like when my security operations team is responsible for business continuity planning and disaster recovery, because security operations already knows where things live, what things are redundant, and what are the realities of recovery in different scenarios. Planning and recovery designed by your security operations team will also be more accurate and pragmatic, than if they were designed by compliance. Security operations should also be responsible for execution if a plan needs to be implemented.
I am not a fan of quantitative metrics, I think they are misleading and easily manipulated. I prefer qualitative metrics that can be defended with logic and reason. That said, your program needs some quantitative way to track itself. Here are my favorite security operations metrics:
Mean Time To Triage — How long until an event or alert is deemed an incident or not. This metric should include events that have automated runbooks (once an automated runbook has started you may consider the event triaged). The more accurate and complete your automation and runbooking is, the lower this metric will go. If automation makes a mistake and it needs to be corrected by an analyst, then the metric should take into account the second time an event is triaged.
The faster events are triaged, the faster response and recovery can begin, hopefully resulting in less intrusions turning into breaches. The less time your team is triaging events, the more time they have to work on new controls and strategic projects, hopefully resulting in less intrusions.
Automation and Runbook Coverage — How many of our incident response and recovery events and alerts have written runbooks and how many are completely automated. As the team and scope grows, the number of events the team responds to will grow and you want to make sure that the team is continuously writing runbooks and automation for more events.
Intrusions Not Detected — Any incident or breach that wasn’t caught by detection technology is a failure of the security operations team. Any intrusions not detected should be documented, given a postmortem, and then have controls built, telemetry collected, and detections written for them.