Berkner Tech

Detecting Fault Injection at Runtime

Runtime fault-injection detection on an embedded device using sensors, counters, and integrity checks

Surviving a single glitch is one defense. Noticing that someone is glitching you is another, and a powerful one. Fault injection takes many tuning attempts, so a device that detects faults and responds can turn the attacker’s necessary search into a self-defeating process. Here is how runtime fault detection works and what to do when it fires.

Detection Versus Resistance

Glitch-resistant code makes a single fault insufficient. Fault detection goes further: it notices that a fault occurred and lets the device react, by locking down, wiping secrets, or counting the event toward an escalating response. The two are complementary, and high-assurance devices use both.

The reason detection is so valuable is the economics of fault injection. An attacker rarely lands the right glitch on the first try; they sweep timing and parameters across many attempts to find the window. A device that recognizes those attempts and responds can deny the attacker the patient, repeated access the attack depends on, which raises the cost far more than passive resistance alone.

Hardware Fault Sensors

Many secure microcontrollers and secure elements include sensors aimed squarely at fault injection: voltage monitors that flag out-of-range supply, frequency monitors that catch clock anomalies, temperature sensors, and light or mesh sensors that detect decapsulation and probing. When a sensor trips, it raises a hardware fault.

// enable the security sensors a part provides, and route them to a fault handler
SEC->TAMPER_CTRL = VOLT_MON_EN | CLK_MON_EN | TEMP_MON_EN | MESH_EN;
SEC->TAMPER_RESP = RESP_RESET | RESP_WIPE_KEYS;   // action on trip
NVIC_EnableIRQ(TAMPER_IRQn);

These sensors exist precisely because glitching manipulates voltage, clock, temperature, or the physical package. Enabling them, and configuring what happens when they trip, is often a matter of setting the right registers, yet they ship disabled by default on many parts, which is a common gap an assessment finds.

Software Fault Detection

Not every fault trips a hardware sensor, so firmware adds its own detection through the redundancy it already uses to resist glitches. An impossible value in a multi-bit flag, a control-flow path that should be unreachable, or a mismatch between two independent checks are all signals that a fault corrupted execution.

// a control-flow integrity check: a counter that must equal the steps taken
volatile uint32_t flow = 0;
flow += step_a();   // each returns a known increment
flow += step_b();
flow += step_c();
if (flow != EXPECTED_TOTAL) fault_detected();   // a skipped step is caught

A running control-flow counter that must reach a known total catches a glitch that skipped a step, because the total comes out wrong. These software checks cost a few instructions and turn the corruption that fault injection causes into something the device can see and act on, even when no hardware sensor noticed.

The Fault Counter

Because fault injection requires many attempts, a persistent fault counter is one of the strongest responses. Each detected fault increments a counter in non-volatile memory, and crossing a threshold triggers escalating action, from delays to lockout to key erasure. The attacker’s own repeated attempts drive the device toward shutting them out.

// persistent, escalating response to repeated faults
uint16_t fc = nv_read(FAULT_COUNT) + 1;
nv_write(FAULT_COUNT, fc);
if (fc > 50)  wipe_keys();          // sustained attack -> destroy secrets
else if (fc > 10) lock_for(60);     // suspicious -> cool-down lockout

The counter must persist across resets, because attackers reset constantly while tuning, and a counter that zeroed on reset would be useless. Stored in non-volatile memory, it accumulates across the whole attack campaign, so the search an attacker needs becomes the very thing that locks the device or destroys its secrets.

Integrity Checks on Critical Data

Faults can corrupt data as well as control flow, so security-critical values deserve integrity protection. A checksum or a duplicated copy of a key, a configuration word, or a permission flag lets the device detect when a glitch flipped a bit in something that matters and refuse to act on the corrupted value.

This pairs naturally with the redundant encodings used for glitch resistance. A flag stored twice, once normal and once inverted, is checked for consistency before use, and any mismatch is both a failed check and a detected fault. The data integrity check and the fault detector are the same mechanism viewed two ways.

Responding Proportionately

What a device does on detection should match the threat and the product. The options run from gentle to severe: log the event, insert a delay, lock out for a cooldown, require re-authentication, or in the strongest response, erase the keys that protect the device’s secrets. The right choice depends on what is at stake and what false positives would cost.

Proportionality matters because environmental noise, a brownout, a hot day, a marginal supply, can trip a sensitive detector. A device that wiped its keys on a single voltage dip would be a support nightmare. The usual pattern is to tolerate isolated events and escalate only on a pattern, which distinguishes a real attack from ordinary noise.

The Wipe-on-Tamper Response

For devices holding high-value secrets, the strongest response is to destroy the keys when a sustained attack is detected, rendering the device’s protected data permanently unreadable. This is standard in payment hardware and secure elements, where the secret is worth more than the device.

The design requirement is that the wipe be fast and complete, clearing the actual key material and any derived copies before an attacker can freeze the state and extract it. A wipe that leaves a recoverable remnant, or that is slow enough to be interrupted, gives a false sense of protection, so the erase path itself deserves careful design and testing.

Avoiding False Positives

A detector that fires on normal operation is worse than no detector, because it erodes trust and may be disabled in frustration. Calibrating thresholds against the device’s real operating envelope, supply variation, temperature range, clock tolerance, is what keeps detection from triggering on legitimate conditions.

The balance is between sensitivity and reliability: tight enough to catch an attack, loose enough to ignore a noisy power supply. Testing across the full environmental range, and using the escalate-on-pattern approach rather than acting on every single event, is how production devices keep their fault detection both effective and quiet during normal use.

Testing the Detection

Detection that has never been tested is an assumption. The way to validate it is the attacker’s way: point a fault-injection rig at the device and confirm the detectors fire, the counter escalates, and the response, lockout or wipe, actually happens. Anything that does not trigger is a blind spot to close.

# sweep glitch parameters and confirm the device detects and responds
for v in voltage_steps:
    inject_glitch(v); status = read_device_state()
    assert status in ('locked','wiped','fault_logged'), f"undetected at {v}"

A sweep that the device catches at every effective glitch setting confirms the detection works. A setting that produces a successful bypass without tripping any detector is exactly the gap an attacker would find, and finding it yourself, on the bench, is the entire point of testing the countermeasure rather than trusting it.

Detection as Part of a Layered Defense

Runtime fault detection is one layer, not a complete defense. It sits alongside glitch-resistant code that survives a single fault, hardware sensors that catch physical tampering, and the broader protections of encrypted firmware, secure key storage, and verified boot. Each layer covers what the others miss.

The combination is what frustrates a capable attacker. Glitch-resistant code means one fault is not enough; detection means many faults get noticed; the fault counter means the attempts needed to tune the attack are the attempts that lock the device. Together they turn fault injection from a patient afternoon into a race the device is designed to win.

What This Means for Your Design

For a product where physical attack is in the threat model, fault detection is worth designing in, because passive resistance alone lets an attacker keep trying indefinitely. Enable the hardware sensors the part provides, add software control-flow and data integrity checks, keep a persistent fault counter, and define a proportionate, tested response.

The throughline is that a device can do more than withstand a glitch, it can notice the attack and turn the attacker’s own persistence against them. A device that detects, counts, and responds has changed the game from whether a glitch is possible to whether the attacker survives long enough to use one, which is a far better position to defend from.

Where This Fits

Assessing whether a device detects and responds to fault injection, and testing those countermeasures with a real glitch rig, is part of a hardware-focused product security assessment. If you want your fault-detection and tamper-response designed or validated, that is the kind of work we do at Berkner Tech.


References and Further Reading