Understanding Hardware Error Handling in Linux: MCA Explained

Share
Share

Introduction to MCA

All modern X86-64-based CPUs have a subsystem called Machine Check Architecture (MCA). It “provides a mechanism for detecting and reporting hardware (machine) errors, such as: system bus errors, ECC errors, parity errors, cache errors, and TLB errors”. That’s the description that Intel have put in their SDM vol 3. It consists of a set of MSRs registers that give the system user the ability to configure how the mechanism works and a different set of registers recording abnormal events. Such errors can be broadly put into 2 categories: corrected and uncorrected errors. The former are being signalled to the operating system via the corrected machine-check error interrupt (CMCI) whilst the latter are signalled via the machine-check exception (#MC). 

Understanding Uncorrected errors

The #MC is an abort-class exception, meaning once it’s encountered there’s no way to reliably restart the executing program (in this case the kernel) so the only way to rectify it is to reboot the machine. Of course before this is done the handler can collect relevant information from the various MSRs, this can serve as a starting point in properly debugging the issue. Handling of #MC is rather straightforward by the linux kernel – once the respective exception is generated its handler is executed which eventually leads to calling exc_machine_check_kernel function that deals with interrogating the registers and generating the log entries. 

Managing Corrected errors

From the point of view of a kernel engineer the handling of the CMCI is a lot more involved, hence the focus of this article will be on it. The way CMCI is designed to work is the user is able to program a threshold value which upon being passed an CMCI interrupt is delivered to the CPU, that interrupt in turn is handled by the Linux kernel. On Intel-based systems, CMCI is delivered via the Advanced Programmable Interrupt Controller (APIC). In particular, Intel’s MCA initialization code (in kernel land it’s actually referred to as MCE) explicitly programs the APIC’s LVT CMCI register to deliver the threshold interrupt. So once a user-specified (or rather, kernel-specified) number of errors happen the CMCI will signal to the kernel that something abnormal has happened and the kernel in turn will query the various registers, similarly to how #MC is being handled. 

Handling of the CMCI

Note: For the sake of brevity, I’ll only describe the process of handling reported CMCI interrupt, the actual setup so that each CPU can receive the CMCI is a different topic in and of itself and deserves a post of its own but this will be left for some other time. 

When a Corrected Machine Check Interrupt (CMCI) is signaled, the handling process starts by calling the mce_threshold_vector function. This function is processor-specific; on Intel CPUs, it points to intel_threshold_interrupt, which in turn calls machine_check_poll. This core kernel function reads CPU registers to gather information about the error and stores it in a standardized mce structure. This structure is then added to an internal list (mce_event_llist) by the mce_log function. This list is necessary because CMCIs can originate from multiple sources, including the ACPI Advanced Platform Error Reporting (APEI) mechanism. This standardized approach allows the kernel to handle CMCIs from different sources in a consistent manner. Additionally processing the error might require interacting with hardware components to gain additional context which might be time consuming and should be done outside of the interrupt handling context. 

At this point we’ll have relevant error information enqueued but not really processed. Since the enqueueing is done from interrupt context no heavy handling could be done. Because of this the interrupt routine will simply schedule the execution of mce_gen_pool_process function in the system_wq’s context. 

Processing enqueued errors in the Linux kernel

Processing the enqueued errors boils down to calling all the registered notifiers on the x86_mce_decoder_chain. What this means is that there is a list of functions which register their interest to process CMCs by adding their respective function to the notifier block via a call to mce_register_decode_chain. Currently there is a set of 3 default notifiers:

  • early_nb – responsible for producing trace events for a CMC as well as notifying userspace via /dev/mcelog
  • mce_uc_nb – this one is used to handle only Action Optional and Action Deferred severities. Mainly used to prevent speculative access to memory pages that generate errors by setting their NOT PRESENT attribute.
  • mce_default_nb – this is the default notifier if none of the previous 2 or an additional one, such as a model-specific EDAC driver handles the error. 

Vendor-Specific Error Handling: Understanding EDAC Drivers

Generally notifiers will be called sequentially until one of them returns the special value NOTIFY_STOP, indicating no further processing is required or until all notifiers have run. 

Everything explained so far has essentially lived in the “generic” parts of the kernel, i.e. not containing any vendor-specific code. The notifier block is the place where device-specific code, think specific CPU model, can live. 

In the case of Intel CPUs the driver likely used for EDAC will be sb_edac, but that’s not mandatory, depending on the CPU used a different driver can be used. For the sake of this article let’s assume it’s the sb_edac driver. The workhorse of sb_edac is the sbridge_mce_check_error function. It ensures that the error is indeed a memory error (since CPUs can generate non-memory errors as well) and then prints detailed information about the error, such as:

This log provides plenty of data about the event – i.e. which Bank (think hardware component) generated the error, the type of access and the  value of the address triggering the error, raw value of the MISC/STATUS MSRs, which aid in decoding other fields. 

 

The first 5 lines of output can be considered “standard” and are produced by sbridge_mce_check_error itself. However, the last 3 lines are produced by a fairly long-winded parsing of various values from the MSRs. Without going into too much detail for the sake of brevity the call chain is: 

Conclusion

All things considered it turns out that in order to handle a memory error there are actually 3 components involved:

  1. The generic MCE code in the kernel that deals with producing the generic struct mce that holds a snapshot of the register at the time the error got reported.
  2. The logged error eventually gets produced by the function registered by the device-specific code
  3. The device-specific EDAC code would eventually call into the generic EDAC code – the edac_ prefixed functions to actually output a more detailed error

As almost anything in the kernel, even if a piece of technology is simple conceptually – after all MCA deals with reporting errors and providing a limited set of information spread among 2 or 3 64bit words (MSRs) the actual handling that happens in software is considerably more complex because it has to account for a lot of variables and situations.

Share
(Visited 1 times, 1 visits today)
OSZAR »