Functional Safety For An EV Battery Management System

Functional Safety For An EV Battery Management System
Functional Safety for an EV BMS — Detailed Guide

ISO 26262:2018 · Road Vehicles — Functional Safety

Functional Safety for an
EV Battery Management System

A step-by-step engineering walkthrough of the full safety lifecycle for a battery management system — from defining the item, through hazard analysis, safety concepts and analyses, to verification, validation, and the safety case. With diagrams, worked examples, and reference tables throughout.

Item  EV BMS Standard  ISO 26262 (2018) Typical target  ASIL C / D Coverage  Concept → Production
0

Orientation

Foundations — the V-model and ASIL

Before the individual work products make sense, two ideas hold the whole standard together: the safety lifecycle (drawn as a V), and the ASIL — the risk class that decides how much rigour each requirement gets.

How to read this guide — follow one hazard

The whole lifecycle is easiest to learn by watching a single hazard travel through it. Throughout, look for the ⛓ golden thread marker: it follows cell overcharge → SG1 from the moment it’s identified, through its safety goal, requirements, mechanisms, analyses, and finally the test that proves it’s handled. If you only follow one thing, follow that.

ISO 26262 organises automotive safety work as a V-model. The left arm goes top-down: you start with the whole item and refine it step by step into a concept, then technical requirements, then a design you can build. The bottom is implementation. The right arm comes back up: each level you specified on the left is verified by a matching level of integration and test on the right, ending in validation of the complete item in the vehicle. Every box on the left has a partner on the right.

verifies verifies verifies Item Definition HARA · Safety Goals Functional Safety Concept · FSRs Technical Safety Concept · TSRs System / HW / SW Architecture & Design Implementation HW build · SW coding Unit & HW/SW Integration & Test System Integration & Test Safety Validation vehicle level Release · Safety Case production ◣ Refinement & specification (top-down) Integration & verification (bottom-up) ◢
FIG. 1  The ISO 26262 safety lifecycle. Left arm refines the item into a buildable design; right arm verifies each level against its specification, ending in vehicle-level validation.

The standard is split into parts

You don’t apply all of ISO 26262 at once — different parts own different phases. For a BMS, these are the ones you live in:

ISO 26262 parts most relevant to a BMS programme
PartTitleWhat it gives the BMS work
Part 2Management of functional safetyThe Safety Plan, roles, and the confirmation measures (reviews, audit, assessment) that close out the Safety Case.
Part 3Concept phaseItem Definition, HARA, Safety Goals, and the Functional Safety Concept (FSRs).
Part 4System levelTechnical Safety Concept (TSRs), system integration, and safety validation.
Part 5Hardware levelHardware architectural metrics (SPFM, LFM) and the PMHF — the quantitative evidence from FMEDA.
Part 6Software levelSoftware safety requirements, architecture, unit design and verification for the BMS firmware.
Part 8Supporting processesRequirements management, configuration, change, tool qualification, and the DIA between partners.
Part 9ASIL-oriented analysesASIL decomposition, criteria for coexistence, and dependent-failure analysis (DFA).

ASIL — how much rigour a requirement earns

The Automotive Safety Integrity Level is the heart of the standard. Every hazard is rated, and that rating flows down to every requirement, mechanism, and test that addresses it. There are five levels. Higher means more dangerous, which means more demanding methods, more independence in reviews, and tighter hardware metrics.

QM A B C D no safety req. lowest · · most stringent increasing risk → increasing rigour
FIG. 2  The five integrity levels. QM (Quality Management) means standard engineering quality is enough — no ASIL requirements apply. A BMS overcharge hazard typically lands at the top end, ASIL C or D.
Key terms used throughout

Item — the system being assessed (here, the BMS). Safety goal — a top-level safety requirement, carrying an ASIL. Safe state — an operating mode with no unreasonable risk (for a BMS, often “high-voltage disconnected”). Fault → Error → Failure — a fault is the cause, an error the wrong internal state, a failure the loss of correct function at the boundary.

How a fault becomes a hazard

This three-word chain — fault → error → failure — is the mental model behind every safety mechanism. A fault is a root cause (a broken wire). It produces an error, a wrong internal state (a cell reads false-low). The error becomes a failure when the function is lost at the boundary (a real overvoltage goes undetected). The entire job of a safety mechanism is to break this chain — to catch the error before it becomes a failure, and steer the system to a safe state.

FAULT root cause ERROR wrong internal state FAILURE function lost at boundary “broken cell-sense wire” “cell reads false-low” “overvoltage not detected” Safety mechanism detects the error → SAFE STATE if not caught, failure → hazard
Fault → Error → Failure.  A safety mechanism’s whole purpose is to interrupt this chain — usually at the error stage — before the function fails at the system boundary. Everything in Steps 5–7 exists to make that interruption reliable.
1

ISO 26262-3 · Clause 5

Item Definition

Everything downstream is built on this. The Item Definition draws a precise boundary around the BMS — what it does, where it stops, what it connects to, and the conditions it must work in. Get the boundary wrong and the hazard analysis misses things.

What goes into it

The Item Definition answers four questions about the BMS: its functions, its boundary and interfaces, its operating modes and environment, and the assumptions and dependencies placed on the rest of the vehicle. The functions are the obvious starting point.

Core BMS functions and where the boundary sits
FunctionPurposeBoundary
Cell voltage monitoringMeasure every cell / cell-group voltage to keep each within its safe window.Owns sensing; relies on the AFE ICs.
Temperature monitoringTrack cell and module temperatures to prevent thermal limits being exceeded.Owns sensing & thresholds.
Current measurementMeasure pack current for protection, SoC, and power limits.Owns the shunt / sensor reading.
SoC / SoH estimationEstimate state of charge and health to inform limits and the VCU.Owns the algorithm.
Contactor controlOpen / close the main and precharge contactors to connect or isolate the HV bus.Owns the command & diagnostics.
Cell balancingEqualise cell charge to maximise usable capacity and life.Owns balancing control.
Protection & limitsEnforce voltage / current / temperature limits and request a safe state on violation.Core safety function.
CommunicationReport status and limits to the VCU / charger over CAN.Owns the BMS side of the bus.

The BMS in its system context

A boundary is easiest to see drawn out. Below is a representative BMS architecture — the signal/low-voltage domain in teal, the high-voltage power path in orange. This same diagram becomes the canvas later, when safety mechanisms get mapped onto the blocks.

Battery Pack Cell module Cell module Cell module Cell module Cell module + thermistors Cell Monitoring ICs (AFE) ×N voltage · temp balancing Current Sensor shunt / hall Isolation Monitor insulation res. BMS Master Controller MCU · protection logic SoC/SoH · state mgr isoSPI Contactor Box Main + contactor Main − contactor Precharge + resistor control + feedback HV bus → to inverter CAN → VCU / Charger status · limits HV Interlock (HVIL) connector loop
High-voltage power path Signal / low-voltage & control Controller (the safety brain)
FIG. 3  Representative BMS architecture. The master controller reads cell data (via the AFE ICs), current, and isolation, and commands the contactors that connect or isolate the high-voltage bus.

Operating modes, environment, assumptions

The definition also pins down the modes the BMS runs in — off / sleep, standby, drive, charging (AC/DC), balancing, fault / limp — and the environment: temperature range, vibration, EMC, and the supply it depends on. Finally it records assumptions on the rest of the vehicle (for example, that the VCU will reduce torque demand when the BMS signals a power limit). These assumptions become assumed requirements on other items and must be agreed, not just hoped for.

Why this step is load-bearing

The HARA can only find hazards for functions and failure modes that the Item Definition put on the table. A function left out here — say, an undocumented fast-charge mode — is a hazard that never gets analysed. This is the most common root cause of a safety gap, and it is the cheapest one to prevent.

2

ISO 26262-3 · Clause 6

Hazard Analysis & Risk Assessment

The HARA is where risk gets named and measured. You imagine the BMS failing in real driving situations, judge how bad each outcome could be, and turn that judgement into an ASIL and a safety goal. It is the single most consequential analysis in the whole programme.

The method in three moves

For each function you consider what happens when it fails — too high, too low, stuck, late, missing. You place that failure in an operational situation (driving on a motorway, charging overnight, parked) to form a hazardous event. Then you rate that event on three axes, and the three ratings together fix the ASIL.

Hazardous event situation + failure S — Severity S0 … S3 (harm) E — Exposure E0 … E4 (how often) C — Controllability C0 … C3 (can driver act?) ASIL = ƒ(S,E,C) lookup table QM A B C D
FIG. 4  The determination. Severity, Exposure and Controllability are each rated independently, then combined through the standard lookup to yield one ASIL (or QM) per hazardous event.

What the three axes mean

S — Severity

How bad is the harm?

S0 none · S1 light/moderate · S2 severe, survival probable · S3 life-threatening or fatal.

E — Exposure

How often is the situation faced?

E0 incredible · E1 very low · E2 low · E3 medium · E4 high (e.g. normal driving).

C — Controllability

Can the driver avoid harm?

C0 controllable in general · C1 simply · C2 normally · C3 difficult or uncontrollable.

Result

One ASIL per event

The worst realistic combination of S, E and C sets the level. The most severe BMS hazards reach ASIL D.

The ASIL determination table

This is the lookup behind FIG. 4. Read it as: pick the Severity block, then the Exposure row, then the Controllability column. Note how many combinations land at QM — risk has to be both likely and hard to control before it climbs.

ASIL determination — Severity × Exposure × Controllability
SeverityExposureC1C2C3
S1
light /
moderate
E1QMQMQM
E2QMQMQM
E3QMQMA
E4QMAB
S2
severe,
survivable
E1QMQMQM
E2QMQMA
E3QMAB
E4ABC
S3
life-threat. /
fatal
E1QMQMA
E2QMAB
E3ABC
E4BCD

S0 (no harm) yields QM in every case and is omitted. E0 (incredible) likewise yields QM and is omitted.

A worked BMS HARA

Putting it together for the BMS gives entries like these. Each row produces one safety goal, carrying the ASIL of its hazardous event — these goals drive everything from Step 3 onward.

Illustrative HARA entries for an EV BMS
Hazard (malfunction)SituationPotential effectS / E / CASILSafety goal
Cell overvoltage (overcharge)DC fast charging, high SoCCell damage → thermal runaway, fireS3/E3/C3CSG1 Prevent any cell exceeding its upper voltage limit.
Cell overtemperatureDriving / chargingThermal runaway, fire, propagationS3/E4/C3DSG2 Prevent cell temperature exceeding its safe limit.
Pack overcurrentHard acceleration / external shortOverheating, arc, fireS3/E3/C3CSG3 Limit or interrupt pack current on overcurrent.
Cell over-discharge (undervoltage)Driving at low SoCCell degradation, latent failureS2/E2/C3ASG4 Prevent any cell below its lower voltage limit.
Unintended contactor openingDrivingSudden loss of propulsionS3/E4/C2CSG5 Avoid unintended HV disconnection while driving.
Contactor stuck closedFault / crash needing isolationHV present when it must be removedS3/E2/C3BSG6 Ensure HV can be disconnected on demand.
Reading note

The S/E/C values above are illustrative, to show the mechanics. A real HARA argues each rating with evidence (crash data, field exposure, controllability studies) and is reviewed independently — the ASIL it fixes will shape cost and effort for the entire project, so it is never assigned casually.

Worked example — reading two of the ratings

The table can feel like magic until you see the reasoning. Here is how two rows were argued, axis by axis — this is the kind of justification a real HARA records for every entry.

Cell overtemperature → ASIL D

S3 — a thermal runaway can cause fire and is potentially fatal, the top severity. E4 — the triggering situations (driving, charging) are part of everyday use, so exposure is high. C3 — a driver cannot sense or stop a cell-level thermal event, so it is uncontrollable. S3 + E4 + C3 sits in the single worst cell of the table → D.

Cell overcharge → ASIL C

S3 — overcharge can also lead to thermal runaway. E3 — it is tied to high-SoC fast charging, frequent but not constant, so medium rather than high exposure. C3 — again outside the driver’s control. Dropping exposure from E4 to E3 moves the result down one level → C. One axis changed the whole effort budget.

⛓ Golden thread — our hazard enters

From here on we follow cell overcharge. It has just become SG1 — “Prevent any cell exceeding its upper voltage limit,” ASIL C. Watch it pick up requirements, a mechanism, an analysis, and finally a test.

3

ISO 26262-3 · output of the HARA

Safety Goals, Safe States & FTTI

A safety goal is the top of the requirement tree — one per hazardous event, written in the vehicle’s language, not the component’s. To be testable it needs two companions: the safe state it drives toward, and the FTTI, the time budget it has to get there.

Safe state

The safe state is the condition the BMS moves to when it can no longer guarantee correct operation — a state with no unreasonable risk. For most BMS hazards the safe state is high-voltage disconnected (open the contactors). Sometimes it is gentler — limit current or limp-home — because an abrupt disconnect can create a new hazard (loss of propulsion). Choosing it is a safety decision in itself.

FTTI — the time budget

A fault does not become a hazard instantly. The Fault Tolerant Time Interval is the window from a fault occurring to the hazard happening, if nothing intervenes. The BMS must detect the fault and react to reach the safe state inside that window — with margin to spare.

FDTI — detect FRTI — react safety margin Fault occurs Fault detected Safe state reached Hazard (if no action) FTTI detection + reaction must finish before the hazard FDTI + FRTI < FTTI
FIG. 5  The timing contract. The diagnostic test interval and reaction time together (FDTI + FRTI) must fit inside the FTTI, leaving margin. This budget drives how fast monitoring must run and how quickly contactors must open.
Safety goals with safe state and (indicative) FTTI
GoalSafety goalASILSafe stateFTTI (indicative)
SG1Prevent cell overvoltageCStop charge current, open contactors~ 1 s
SG2Prevent cell overtemperatureDOpen contactors, controlled shutdownseconds
SG3Limit / interrupt overcurrentCOpen contactors~ 10–100 ms
SG4Prevent cell over-dischargeALimit load, then disconnectseconds
SG5Avoid unintended HV disconnectCStay connected; warn & deratedriver-paced
SG6Ensure HV disconnect on demandBRedundant disconnect pathon demand

FTTI figures are illustrative — actual values come from physical analysis of how fast each fault propagates to harm.

FTTI with real numbers

The abstract timing diagram lands better with a concrete case. Take an external short (SG3, overcurrent), where the budget is tight. Suppose analysis shows the pack reaches a dangerous condition about 100 ms after the short. The BMS must detect and react well inside that:

5 ms 20 ms 75 ms margin t = 0 5 ms 25 ms 100 ms short occurs overcurrent detected contactors open hazard FTTI = 100 ms  ·  FDTI 5 ms + FRTI 20 ms = 25 ms  ✓ comfortably inside budget
FTTI, made concrete.  Detection (5 ms) plus reaction (20 ms) uses a quarter of the 100 ms budget, leaving 75 ms of margin. If the monitoring loop were too slow — say it only sampled current every 50 ms — detection alone could blow the budget. This is why safety monitoring runs fast. (Values illustrative.)

What the BMS actually does — its safety states

Safe states aren’t just “on” or “off”. A well-designed BMS degrades in stages: it keeps running normally until a limit is approached, warns and reduces power in a middle band, and only disconnects when a true safety limit is crossed. Drawing it as a state machine makes the logic obvious.

NORMAL full power within all limits WARNING / DERATE reduce current & power alert driver / VCU SAFE STATE open contactors · HV off latched limit approached condition clears limit exceeded severe fault — skip straight to safe state (within FTTI) stays latched until service / power cycle
BMS safety state machine.  Graceful degradation in the middle band avoids unnecessary disconnects (which are themselves a hazard — see SG5), while a severe fault routes straight to the safe state within the FTTI. The safe state is latched so the system can’t quietly re-enable itself.
4

ISO 26262-3 · Clause 7

Functional Safety Concept & FSRs

The safety goals say what must not happen. The Functional Safety Concept says how, in functional terms — without yet committing to specific hardware. It breaks each goal into functional safety requirements (FSRs) and allocates them to elements of the architecture.

From a goal to its FSRs

A functional safety concept typically covers four things per goal: detecting the fault, reacting to reach the safe state within the FTTI, warning the driver or degrading gracefully, and ensuring the detection itself is dependable (so a single sensing fault can’t blind it). Taking SG1 (prevent overcharge, ASIL C):

FSRs derived from SG1 — prevent cell overvoltage (ASIL C)
FSRFunctional safety requirementAllocated toASIL
FSR 1.1Detect when any cell voltage exceeds the safe upper threshold.Cell monitoring + controllerC
FSR 1.2On detection, command the safe state (stop charge / open contactors) within the FTTI.Controller + contactorsC
FSR 1.3Notify the driver and charger, and record the fault.Controller + CANC
FSR 1.4Ensure voltage detection tolerates a single measurement-chain fault (redundancy / diagnostics).Cell monitoring pathC

ASIL decomposition — splitting the burden

A high ASIL is expensive to meet in one element. ISO 26262-9 allows an ASIL to be split across two sufficiently independent elements, each carrying a lower ASIL while the combination still meets the original. The catch is the word independent: you must prove freedom from interference with a dependent-failure analysis, or a common cause undoes the whole scheme.

Safety goal ASIL D Element A ASIL B(D) Element B ASIL B(D) independence required (DFA)
FIG. 6  One valid decomposition of ASIL D into two ASIL B(D) elements. The “(D)” records the origin. Other splits are allowed (e.g. C(D)+A(D), D(D)+QM(D)) — all require demonstrated independence between the two paths.
5

ISO 26262-4 · Clause 6

Technical Safety Concept & TSRs

Now the concept meets silicon. The Technical Safety Concept refines each functional requirement into technical safety requirements (TSRs) on specific hardware and software, names the safety mechanisms that realise them, and defines the hardware–software interface (HSI) where the two meet.

From FSR to TSR

Where an FSR said “detect overvoltage, tolerant to a single fault”, a TSR says exactly how: two independent measurement paths, compared every cycle, with the AFE running internal self-tests at a defined interval. Continuing the SG1 / overcharge thread:

TSRs realising the SG1 functional concept
TSRFromTechnical safety requirementSafety mechanismASIL
TSR 1.1FSR 1.1/1.4Provide two independent cell-voltage measurements per cell-group; compare each cycle; flag deviation beyond tolerance.Redundant sensing + plausibilityC
TSR 1.2FSR 1.1AFE runs internal self-tests (reference, open-wire, ADC) each diagnostic cycle within the FDTI budget.Diagnostic monitoringC
TSR 1.3FSR 1.2On an overvoltage flag, the controller de-energises the main contactors within the reaction time so FDTI + FRTI < FTTI.Safe-state actuationC
TSR 1.4FSR 1.2Confirm contactor state via auxiliary feedback; on mismatch, escalate to the redundant disconnect path.Contactor diagnosticsC
TSR 1.5allAn independent watchdog supervises controller liveness and timing; loss of service forces the hardware safe state.Windowed watchdogC
The HSI — easy to forget, expensive to miss

The hardware–software interface specification records exactly what each diagnostic register means, which signals are safety-relevant, and the timing the software may assume. When hardware and software are built by different teams (or different companies), the HSI is the contract that keeps a safety mechanism from falling through the gap between them.

6

Realising the TSRs

Safety Mechanisms — a deep dive

Safety mechanisms are the working parts of the concept — the things that actually catch faults and steer the system to safety. Think of them as layers of defence: a fault has to slip past every one to cause harm.

fault Sensor inputs Range &plausibility Self-diagnostics Watchdogsupervision Safe-state actuator → reach safe state any layer that catches the fault routes the system to its safe state
FIG. 7  Defence in depth. A fault is caught by whichever layer detects it first; detection then drives the safe-state actuator (open contactors). The more independent the layers, the lower the chance a fault passes through all of them.

The four mechanism families

Redundant sensing

Two eyes, not one

An independent second channel measures the same quantity. Agreement builds confidence; disagreement beyond tolerance is itself a detected fault.

Diagnostic monitoring

The system tests itself

Periodic self-tests — reference checks, open-wire detection, RAM/ROM tests, comms CRC — find internal faults before they can combine into a failure.

Plausibility checks

Does this reading make sense?

Values are checked against physics and against each other (current vs. voltage change vs. model). Implausible combinations flag a fault no single-signal range check would catch.

Watchdog supervision

Is the brain still alive?

An independent timer expects to be serviced within a time window. Too early or too late — a sign of a hung or runaway controller — forces a hardware safe state without the software’s help.

Mechanism → fault class → typical diagnostic coverage
MechanismFault class it detectsHow it worksTypical DC
Redundant sensingSensor / channel faults, drift, stuck valuesCross-check an independent second measurementhigh (→99%)
Diagnostic monitoringInternal IC, memory, communication faultsScheduled self-tests & integrity checksmedium–high
Plausibility checksImplausible or cross-inconsistent signalsValidate against physics & other signalsmedium
Windowed watchdogController hang, lockup, timing faultsIndependent in-window service checkhigh (its class)

“DC” is diagnostic coverage — the fraction of a component’s dangerous failure rate that the mechanism detects. These figures feed directly into the FMEDA in the next step.

How redundant sensing actually catches a fault

“Redundant sensing” sounds abstract until you see the comparison at its heart. Two independent paths measure the same cell. If they agree, the value is trusted. If they drift apart by more than a set tolerance, that disagreement is the fault signal — no single broken path can quietly report a false-low voltage anymore.

Cell group Sensor A → ADC A path A Sensor B → ADC B independent path B Compare A vs B |A − B| |A−B| ≤ tol ? plausibility gate agree trust the value disagree fault → safe state
Cross-checked redundancy.  The comparison turns a hidden sensing fault into a visible, detectable event. The same pattern (compare two independent things, treat disagreement as a fault) underlies most diagnostic coverage in a BMS.

How a windowed watchdog catches a hung controller

The watchdog is independent hardware that expects the controller to “check in” — but only inside a specific time window. Check in too late (the CPU has hung) or too early (the CPU is looping wildly) and the watchdog fires, forcing the safe state without trusting the software at all.

closed (too early) OPEN window closed (too late) too early ✗ on time ✓ missed ✗ on-time service resets the timer · early or missed service → watchdog forces the hardware safe state
Windowed watchdog.  A simple “alive?” check isn’t enough — a runaway CPU could service it constantly. Requiring service inside a window catches both a stalled controller (too late) and a malfunctioning one (too early).
⛓ Golden thread — SG1 gets its mechanism

SG1’s “detect overvoltage” requirement is now realised by exactly the pattern above: two independent cell-voltage paths cross-checked, backed by AFE self-diagnostics. A single broken sense path becomes a detected disagreement — not a missed overvoltage. Next we prove that coverage with numbers.

7

ISO 26262-5 / -9

Safety Analyses — FMEA, FMEDA, FTA

Three analyses verify the concept from three directions. FMEA works upward from component failures, FMEDA turns that into hardware numbers, and FTA works downward from the hazard. Together they show the safety mechanisms actually cover what they claim to.

Hazard / safety-goal violation top of the tree Component failure modes bottom — every part FTA — top-down (deductive) FMEA — bottom-up (inductive) they meet in the middle consistent picture of failure FMEDA adds the numbers rates · coverage → SPFM / LFM / PMHF
Three views of the same failures.  FTA reasons downward from the hazard; FMEA reasons upward from parts; where they agree you have confidence, where they don’t you’ve found a gap. FMEDA then attaches failure rates and coverage to turn that picture into the metrics an ASIL demands.

FMEA — bottom-up, qualitative

A Failure Mode and Effects Analysis walks every component, asks how it can fail, and traces each failure mode up to its effect on the system and the safety goal. Crucially it records the safety mechanism that catches each one — exposing any failure mode that has no defence.

FMEA extract — BMS measurement & actuation chain
ComponentFailure modeSystem effectDetecting mechanismSev.
Cell voltage channelReads low (short / stuck)True overvoltage masked → SG1 at riskRedundant channel + open-wire diag.S3
Temperature sensorOpen circuitOvertemperature masked → SG2 at riskOpen-wire detection + redundancyS3
Current sensorGain / offset errorOvercurrent under-read → SG3 at riskPlausibility vs. voltage & modelS3
Main contactorWelds closedCannot isolate HV → SG6 violatedAux feedback + redundant disconnectS3
Controller (MCU)Software hangDetection & reaction stop → many SGsIndependent windowed watchdogS3

FMEDA — bottom-up, quantitative

The FMEDA extends the FMEA with numbers: each component’s failure rate, how its failures split between safe and dangerous, and what fraction of the dangerous ones the diagnostics catch. Every fault lands in one of a few classes, and those classes feed the hardware metrics.

All hardware faults λ (failure rate) Safe faults no effect on safety goal Can violate a safety goal “dangerous” faults Single-point + Residual no / partial coverage → addressed by SPFM Multi-point detected / perceived safe Multi-point latent undetected → addressed by LFM SPFM measures coverage of single-point & residual faults · LFM measures coverage of latent faults · PMHF is the residual dangerous-failure rate
FIG. 8  How the FMEDA classifies every fault. The split between detected and undetected dangerous faults is what the diagnostics buy you — and what the two architectural metrics measure.
Hardware architectural metric targets by ASIL (ISO 26262-5)
MetricWhat it measuresASIL BASIL CASIL D
SPFMSingle-Point Fault Metric — coverage of single-point & residual faults≥ 90%≥ 97%≥ 99%
LFMLatent-Fault Metric — coverage of latent multi-point faults≥ 60%≥ 80%≥ 90%
PMHFProbabilistic Metric for random HW Failures — residual dangerous rate< 100 FIT< 100 FIT< 10 FIT

1 FIT = 1 failure per 10⁹ hours. ASIL A has no hardware architectural metric target. These are the numbers a BMS hardware design must hit to claim its ASIL.

FTA — top-down, deductive

A Fault Tree Analysis starts at the hazard — the top event — and works down through logic gates to the combinations of basic faults that could cause it. An OR gate means any input alone suffices; an AND gate means several must coincide. The smallest such combinations are the minimal cut sets, and a single-element cut set is a red flag: one fault, on its own, defeating a safety goal.

Cell overvoltage not prevented TOP EVENT · SG1 violated OR ≥1 Overvoltage not detected sensing chain Detected, not acted upon reaction chain AND & OR ≥1 B1 B2 B3 B4 primary sensechannel fault redundant ch. +diag. fault controller fault(+ watchdog fails) contactorwelded
Basic event OR gate (any input) AND gate (all inputs)
FIG. 9  A fault tree for SG1. Minimal cut sets here are {B4}, {B3}, and {B1·B2}. B4 (contactor welded) is a single-element cut set — exactly why SG6 demands a redundant disconnect path, so that no lone fault removes the protection.

The concept people find hardest — latent faults

Single-point faults are intuitive: one fault, straight to a hazard. Latent faults are subtler and are the whole reason the LFM metric exists. A latent fault is a hidden fault — usually in a safety mechanism itself — that causes no harm on its own. It only bites when a second fault arrives and finds the safety net already broken.

Fault 1 redundant channel B fails silently LATENT — unnoticed system still appears safe (protection secretly gone) Fault 2 channel A reads false-low HAZARD LFM measures how well diagnostics reveal Fault 1 before Fault 2 ever arrives — closing the window
Why latent faults get their own metric.  With redundancy, no single fault causes harm — so the danger hides in an undetected first fault that quietly disables the backup. The Latent-Fault Metric (LFM) rewards diagnostics that test the safety mechanisms themselves, catching Fault 1 while it’s still harmless.
Putting FIT in perspective

A failure rate of 10 FIT (the ASIL D PMHF target) means one dangerous failure per 10⁹ hours of operation — that’s roughly one failure every 11,000+ years for a single unit. The metric isn’t a promise any one part lasts that long; it’s a statistical budget across the whole fleet, which is exactly why it has to be computed, not estimated by feel.

8

ISO 26262-2 / -4

Verification, Validation & the Safety Case

Analysis shows the concept should be safe; this final phase shows it is — and packages the proof. Verification confirms each step was done right, validation confirms the whole item meets its safety goals in the vehicle, and the Safety Case ties the argument together. Independent confirmation measures sign it off.

Verification vs. validation

Verification asks “did we build the thing right?” at every level — requirement reviews, design analyses, and tests that each specified requirement is met (this is the right arm of the V). Validation asks “did we build the right thing?” — vehicle-level evidence that the safety goals genuinely hold, including fault-injection to confirm safe-state behaviour and timing within the FTTI.

⛓ Golden thread — SG1 is closed

Our hazard reaches the end: a fault-injection test deliberately forces one cell-voltage path to read false-low, and the bench confirms the cross-check flags the disagreement and the contactors open within the FTTI. That test traces back through TSR 1.1 → FSR 1.1 → SG1 → the overcharge hazard — a complete, evidenced loop. That loop, repeated for every hazard, is the Safety Case.

Traceability — the thread through everything

None of it counts unless it connects. Every hazard must trace forward to a safety goal, a requirement, a mechanism, an implementation, and a test — and every test must trace back to the hazard it ultimately protects against. Traceability is what lets you prove completeness and assess the impact of any change.

Hazard HARA Safety Goal + ASIL FSR functional req. TSR technical req. Safety Mechanism redundancy · diag. Implementation HW + SW Verification review · test Validation vehicle level every link recorded and checked in both directions — forward for completeness, backward for impact
FIG. 10  The traceability chain. A continuous thread from each hazard to the test that closes it is the backbone of the Safety Case — and the first thing an assessor follows.

The Safety Case and confirmation measures

The Safety Case is the structured argument, backed by all of the above as evidence, that the BMS is acceptably safe. It is judged by confirmation measures whose required independence rises with ASIL — the higher the risk, the further removed the reviewer must be from the people who did the work.

Confirmation measures and independence by ASIL (indicative)
MeasureASIL AASIL BASIL CASIL D
Confirmation review
of work products
recommendedindependent personindependent dept.independent org.
Functional safety audit
process implementation
optionalrecommendedrequiredrequired
Functional safety assessment
achieved functional safety
optionalreduced scoperequiredrequired

Indicative summary — the precise independence levels and applicability are defined in ISO 26262-2. The principle is constant: more risk demands more independence.

§

Reference

Glossary of terms & acronyms

Functional safety runs on acronyms. Here they are in one place, in plain language, roughly in the order they appear across the lifecycle.

Acronyms used in this guide
TermStands forIn plain words
ASILAutomotive Safety Integrity LevelThe risk class (QM, A–D) that sets how much rigour a requirement needs.
QMQuality Management“No ASIL required” — normal engineering quality is enough.
HARAHazard Analysis & Risk AssessmentThe analysis that finds hazards and assigns each an ASIL.
S / E / CSeverity / Exposure / ControllabilityThe three axes that together determine the ASIL.
SGSafety GoalA top-level safety requirement — one per hazardous event.
FSC / FSRFunctional Safety Concept / RequirementHow a safety goal is met, in functional terms (not yet hardware).
TSC / TSRTechnical Safety Concept / RequirementThe same, refined onto specific hardware and software.
HSIHardware–Software InterfaceThe spec that pins down what each diagnostic signal means and its timing.
FTTIFault Tolerant Time IntervalTime from a fault to the hazard — the budget to detect and react.
FDTIFault Detection Time IntervalThe portion of the FTTI spent detecting the fault.
FRTIFault Reaction Time IntervalThe portion spent reaching the safe state after detection.
FMEAFailure Mode & Effects AnalysisBottom-up: trace each component failure to its effect.
FMEDA…& Diagnostic AnalysisFMEA plus failure rates and diagnostic coverage → the hardware metrics.
FTAFault Tree AnalysisTop-down: from the hazard down to the combinations of causes.
SPFMSingle-Point Fault MetricHow well the design covers single-point and residual faults.
LFMLatent-Fault MetricHow well diagnostics reveal hidden (latent) faults.
PMHFProbabilistic Metric for random HW FailuresThe residual rate of dangerous failure, in FIT.
FITFailures In TimeOne failure per 10⁹ operating hours.
DCDiagnostic CoverageThe fraction of a part’s dangerous failures a mechanism detects.
DFADependent-Failure AnalysisChecks that “independent” elements really are — no common cause.
DIADevelopment Interface AgreementWho-does-what safety contract between customer and supplier.
AFEAnalog Front EndThe cell-monitoring ICs that measure cell voltage and temperature.
HVILHigh-Voltage Interlock LoopA loop that detects an open HV connector and triggers disconnect.
SoC / SoHState of Charge / State of HealthHow full the battery is / how degraded it is.
VCUVehicle Control UnitThe supervisory controller the BMS reports limits to.

Putting it together

The whole chain, in one breath

Define the BMS precisely. Find its hazards and rate them to set ASILs and safety goals. Refine those goals into functional and then technical requirements, realised by safety mechanisms. Prove the mechanisms work with FMEA, FMEDA, and FTA. Verify each level, validate the whole in the vehicle, and assemble the Safety Case — with independent sign-off scaled to the risk. Traceability holds every link together.

Experience summary

Performed Functional Safety Analysis for a Battery Management System (BMS) in compliance with ISO 26262 — covering Item Definition, HARA, Safety Goals, FSC / FSRs, TSC / TSRs, and safety analyses including FMEA, FMEDA, and FTA.

Identified hazards related to overvoltage, undervoltage, overcurrent, overtemperature, and contactor control, and defined safety mechanisms including redundant sensing, diagnostic monitoring, plausibility checks, and watchdog supervision.

Supported ASIL compliance, safety verification, validation, and requirements traceability across the safety lifecycle.

ASIL  compliance Verification Validation Traceability
ISO 26262 · Functional Safety · EV Battery Management System
Worked examples and S/E/C, FTTI and metric figures are illustrative for teaching — production values come from project-specific analysis.