7 Proven ICS Threat Hunting Strategies to Prevent Downtime
Executive summary – why ICS threat hunting matters now
Industrial operators no longer have the luxury of reactive security. Adversaries that target ICS (industrial control systems) aim for persistence, stealth, and physical disruption – not just data theft. In 2024–2026 we’ve seen attackers weaponize supply chains, vendor access, and unmanaged remote connectivity to move from IT footholds into OT environments. That means detection alone is not enough: threat hunting – disciplined, adversary-focused investigation – is essential to find dwellers before they cause downtime.
A modern ICS threat hunting program must be OT-aware: it has to preserve safety and availability, use passive/flow-based telemetry where necessary, and map hunts to ICS tactics and techniques. This article gives a practical playbook for CISOs, OT managers, and security architects: seven proven hunting strategies, how to operationalize them safely, what telemetry you must collect, and the KPIs that show you’re winning. Where useful, I tie recommendations to well-known standards and guidance so your program is defensible and auditable.
Background – the OT hunting problem (short)
ICS environments are different: devices are deterministic, downtime is expensive and risky, and many field assets were never designed for hostile networks. Traditional IT hunting techniques (active scans, agent-heavy telemetry) can break control loops or are impractical on constrained devices. Instead, ICS hunting must rely on high-value, OT-specific telemetry and threat models informed by ICS adversary behaviors (e.g., the ATT&CK for ICS matrix). Getting the right telemetry, translating the telemetry into process context, and aligning hunts with safety procedures are the three most common failings of immature programs.
What good ICS threat hunting looks like
Before the strategies: a quick checklist of capabilities you must have to hunt effectively in ICS:
- Canonical OT asset inventory (device model, firmware, process role, owner).
- Passive network telemetry (flow logs, deep protocol awareness for Modbus, DNP3, OPC UA, IEC 61850).
- Host/edge telemetry where safe (gateway logs, edge attestations, HBAT/heartbeat).
- High-fidelity time sync across OT and SOC (accurate timestamps for correlating events).
- Threat intelligence mapped to ICS behaviors (TTPs mapped to ATT&CK for ICS).
- Safety-aware playbooks and escalation paths that include engineering and plant operations.
- Secure vendor access controls and recorded sessions to examine during hunts.
If you don’t have those, build them first – hunting without context produces noise and risk. National guidance and ICS playbooks make this explicit; align to them to demonstrate due diligence.
The seven proven ICS threat hunting strategies
Below are the strategies (one per section). Each contains: why it works, the telemetry you need, safe tactics to run in production, and example hypotheses or queries to try.
1) Hunt for abnormal process-originated commands (safety-first control-plane checks)
Why it works
ICS attacks often attempt to directly manipulate actuators or PLC registers. Rather than looking for malware signatures, hunting for unusual control commands (for example, setpoint writes at odd times, unexpected sequence numbers, or commands that bypass human signoffs) flags malicious intent before physical impact.
Telemetry required
Passive DPI for ICS protocols (Modbus, DNP3, OPC UA, IEC 61850), PLC/HMI command logs, historian write events, and engineering change records.
Safe tactics
- Use passive monitoring only – never inject traffic or query PLCs.
- Collate write commands and compare to baseline control flows (who normally issues them, from which host, during what hours).
- Enrich with maintenance schedules: a legitimate firmware/logic change should have a maintenance ticket.
Example hunt hypotheses & queries
- Hypothesis: “A host outside the usual engineering VLAN issued a series of PLC write commands that changed safety setpoints.”
- Query: show all Modbus Function Code 16 (write multiple registers) events in the last 7 days where source IP is not in the engineering allowlist and target register touches safety-critical range.
Metrics to track
Time-to-detect unauthorized write, number of unauthorized write attempts blocked/isolated, number of false positives from hunts.
2) Hunt for telemetry provenance loss and sensor spoofing
Why it works
Adversaries who want stealth will falsify telemetry rather than directly manipulate actuators. Spoofed sensor data can mask destructive actions or cause bad operator decisions. Detecting discrepancies between independent measurements (e.g., flow rate vs. pressure vs. pump RPM) is a reliable way to find spoofing.
Telemetry required
Multi-sensor telemetry, edge/PLC heartbeat, metadata (firmware hash, device attestation), and correlated IT-side metrics (network path, TLS session metadata).
Safe tactics
- Build physics-based or rules-based cross-checks (if pump RPM drops but flow remains constant, flag inconsistency).
- Use edge gateways to perform lightweight provenance checks (signed telemetry, device attestations).
- Prioritize invariant signals (e.g., energy consumption vs. output) that are hard to spoof at scale.
Example hunt hypotheses & queries
- Hypothesis: “Multiple sensors report nominal values after a suspected intrusion; yet substation power draw is inconsistent.”
- Query: correlate sensor X readings with historian-derived aggregated outputs; flag sustained deviations beyond model thresholds.
Metrics to track
Number and severity of provenance anomalies, time from anomaly to investigation, % anomalies confirmed malicious.
3) Hunt for suspicious vendor and maintenance access patterns
Why it works
Vendor access is a frequent initial access vector in ICS incidents. Hunting for irregular vendor sessions – unusual durations, off-hours access, remote IP changes, or failed MFA attempts – finds attackers using compromised vendor credentials.
Telemetry required
Bastion logs, VPN logs, PAM session recordings, jump-host session metadata, and vendor identity attributes.
Safe tactics
- Centralize vendor access via a controlled bastion (do not allow direct device VPNs).
- Hunt for deviations: same vendor account used from two distant geolocations within a short period, or vendor session with unexpected commands.
- Combine with asset context: vendor session to a safety-critical PLC without an active ticket is suspicious.
Example hunt hypotheses & queries
- Hypothesis: “A vendor account used to access multiple substations outside scheduled maintenance windows.”
- Query: list vendor account sessions in the past 30 days >2:00 AM local time or from unexpected ASN/netblocks; surface sessions without matching change tickets.
Metrics to track
Number of unauthorized vendor sessions, vendor access policy violations, time to revoke vendor credentials after suspicious access.
4) Hunt for early-stage lateral movement using IT–OT bridging signals
Why it works
Most sophisticated ICS intrusions start in IT and pivot to OT. Early hunting should therefore focus on bridge points: jump servers, DMZs, application servers that mediate between IT and OT, and authentication anomalies bridging these zones.
Telemetry required
Proxy/DMZ logs, AD authentication logs, workstation EDR alerts, syslog from jump hosts, forwarder logs into SIEM.
Safe tactics
- Instrument bridging hosts with high-fidelity logging; watch for changes in access patterns (new accounts, service account misuse).
- Hunt for tools and behaviors common to lateral movement (credential dumping attempts, unusual PowerShell or WMI usage on jump hosts).
- Use ATT&CK mappings to prioritize hunts (map suspicious upstream behaviors to likely OT pivot techniques).
Example hunt hypotheses & queries
- Hypothesis: “After a spearphish in IT, an operator workstation is used to access the DMZ and then the HMI network.”
- Query: find accounts that authenticated to both enterprise VPN and OT jump host in the same day where the workstation executing the access changed from typical maintenance machines.
Metrics to track
MTTA for lateral movement, number of IT-originated alerts correlated to OT events, reduction in asymmetric access incidents.
5) Hunt for firmware/update-chain compromise indicators
Why it works
Supply-chain and update-chain attacks (compromised signing keys, rogue firmware images) can give attackers deep persistence. Hunting for abnormal update behaviors, unrecognized signing certificates, or unusual distribution patterns of firmware helps detect such compromises before they’re widely installed.
Telemetry required
Firmware repository logs, update server logs, gateway/endpoint firmware hashes, SBOM metadata, and cryptographic validation logs.
Safe tactics
- Maintain a fingerprinted firmware inventory and monitor for unexpected version rollouts or mismatched hashes.
- Alert on update servers pushing updates to devices without matching change tickets or out of cadence.
- Validate signature chains in automation and hunt for unknown certificate chains.
Example hunt hypotheses & queries
- Hypothesis: “A vendor’s update server pushed a new firmware image signed by a certificate not previously associated with that vendor.”
- Query: list firmware update events in the last 90 days where signing certificate != authoritative certificate stored in CMDB.
Metrics to track
Number of unsigned or unexpected firmware pushes detected, time to isolate/rollback malicious firmware, percent of devices with verified firmware signatures.
6) Hunt using behavioral baselines with ML but keep human-in-the-loop
Why it works
Anomaly detection powered by ML can surface subtle deviations at scale (timing shifts, small but consistent command anomalies). However, ML models often produce false positives in OT unless trained with process-aware features and reviewed by OT experts.
Telemetry required
Flow statistics, protocol commands, sequence/time-of-day patterns, device performance metrics, and labeled past incidents for model training.
Safe tactics
- Use ML for prioritization, not automatic isolation. Humans must validate high-impact alerts.
- Train models on process-aware features (e.g., control-command sequences) and update continuously with feedback.
- Protect models from poisoning by restricting training data sources and auditing model changes.
Example hunt hypotheses & queries
- Hypothesis: “Anomalous but low-volume command patterns indicate reconnaissance.”
- Query: find devices with increased variance in command inter-arrival times compared to 90-day baseline.
Metrics to track
Precision and recall of ML alerts, ratio of ML alerts escalated to investigations, reduction in time-to-detect for subtle deviations.
7) Hunt with live tabletop-driven scenarios and purple-team feedback loops
Why it works
Tools and telemetry are only useful if people know how to use them. Regular purple-team exercises – where red team emulates ICS adversaries and blue team hunts and responds – improve detection rules, telemetry coverage, and response playbooks.
Telemetry required
Same as above; additionally, red-team telemetry to compare simulated TTPs with detection signals.
Safe tactics
- Run tabletop and live emulation in controlled environments (digital twins or segmented test cells) before applying to production.
- Convert red-team observations into new hunts and detection rules; validate rules against baseline noise.
- Establish a continuous improvement loop: test → hunt → measure → refine.
Example hunt hypotheses & queries
- Hypothesis: “Simulated lateral movement techniques produce specific sequences of logins and file access across bridging hosts.”
- Query: validate whether detection rules fired for simulated sequences and tune thresholds.
Metrics to track
Number of detection rules validated by exercises, time to remediate gaps discovered in purple-team events, increase in detection coverage for mapped ATT&CK techniques.
Operationalizing the program: playbook, data, and governance
Short 90-day ramp plan
- Days 0–14 – Visibility baseline: inventory critical OT assets, enable passive DPI at chokepoints, centralize vendor access through a bastion.
- Days 15–45 – Foundational hunts: run Strategy 1 & 3 hunts (unauthorized writes; vendor sessions), instrument jump hosts and DMZs for lateral movement hunts.
- Days 46–90 – Advanced hunts & automation: implement provenance checks, firmware integrity hunts, ML-assisted baselining, and schedule purple-team exercises.
Data priorities (first 6 months)
- Passive network flows with ICS protocol parsing (top priority).
- Time-synced historian events and PLC/HMI command logs.
- Bastion/PAM session logs and jump-host recordings.
- Firmware hashes and SBOM pointers.
- AD and DMZ authentication logs for bridging detection.
Governance & cross-team alignment
- Threat hunting must be in charter of both SOC and OT engineering teams.
- Create joint incident playbooks that escalate to operations/safety leads for any potential physical impact.
- Ensure legal/compliance and procurement teams require telemetry, SBOM, and signed firmware evidence from vendors.
KPIs and how to measure success
- Mean time to detect (MTTD) for ICS-specific adversary behaviors.
- Mean time to remediate (MTTR) for hunts that discover malicious persistence.
- Number of actionable hunts per month (not raw alerts).
- Percent of critical assets with attested firmware and identity.
- Reduction in unapproved vendor access incidents.
Organizations that treat hunting as an engineering discipline – instrumented, measurable, and repeatable – move the needle on these KPIs within 6–12 months.
Standards, frameworks and references
Map your hunting program to authoritative guidance:
- NIST SP 800-82 (OT security guidance) for architecture and safe monitoring practices.
- IEC 62443 for supplier/zone modeling and secure lifecycle requirements.
- MITRE ATT&CK for ICS for TTP mapping and threat-informed hunts.
- CISA ICS resources and advisories for current vulnerabilities and sector-specific guidance.
- Industry reports (Dragos, Claroty, vendor MDR reports) for trending adversary behaviors and validation of hunting hypotheses.
Common pitfalls and how to avoid them
- Doing noisy active scans in production. Use passive discovery and schedules for any active tests in controlled windows.
- Over-reliance on ML without OT validation. ML must be process-aware and validated by engineers.
- Treating hunting as a one-off project. Make hunting a continuous program with measurable objectives.
- Ignoring vendor access governance. Centralize and monitor vendor sessions; require recorded bastion access.
Final thoughts – hunting as an operational capability
Threat hunting in ICS is not an academic exercise: it’s a pragmatic, safety-first discipline that turns telemetry into early warning. The seven strategies above give you a layered approach: from direct control-plane observation to provenance validation, vendor access scrutiny, and model-driven anomaly detection – all tied into red-team validation and standards-based governance.
Start small, measure aggressively, and scale what gives you reliable, low-noise detection that respects safety constraints. Do that, and you’ll convert limited visibility into actionable hunts that prevent downtime, protect people, and keep operations resilient.
If you want, I can:
- convert this into a 90-day SOC-OT playbook with concrete queries and runbooks for your tooling (e.g., Flow/DPI, SIEM, OT NDR), or
- produce sample hunt queries for specific products/protocols (Modbus, OPC UA, DNP3) tuned for passive DPI outputs. Which would you like?
