LogoLogo
  • 简介
  • 引导
    • 从引导加载程序内核
    • 在内核安装代码的第一步
    • 视频模式初始化和转换到保护模式
    • 过渡到 64 位模式
    • 内核解压缩
  • 初始化
    • 内核解压之后的首要步骤
    • 早期的中断和异常控制
    • 在到达内核入口之前最后的准备
    • 内核入口 - start_kernel
    • 体系架构初始化
    • 进一步初始化指定体系架构
    • 最后对指定体系架构初始化
    • 调度器初始化
    • RCU 初始化
    • 初始化结束
  • 中断
    • 中断和中断处理第一部分
    • 深入 Linux 内核中的中断
    • 初步中断处理
    • 中断处理
    • 异常处理的实现
    • 处理不可屏蔽中断
    • 深入外部硬件中断
    • IRQs的非早期初始化
    • Softirq, Tasklets and Workqueues
    • 最后一部分
  • 系统调用
    • 系统调用概念简介
    • Linux 内核如何处理系统调用
    • vsyscall and vDSO
    • Linux 内核如何运行程序
    • open 系统调用的实现
    • Linux 资源限制
  • 定时器和时钟管理
    • 简介
    • 时钟源框架简介
    • The tick broadcast framework and dyntick
    • 定时器介绍
    • Clockevents 框架简介
    • x86 相关的时钟源
    • Linux 内核中与时钟相关的系统调用
  • 同步原语
    • 自旋锁简介
    • 队列自旋锁
    • 信号量
    • 互斥锁
    • 读者/写者信号量
    • 顺序锁
    • RCU
    • Lockdep
  • 内存管理
    • 内存块
    • 固定映射地址和 ioremap
    • kmemcheck
  • 控制组
    • 控制组简介
  • SMP
  • 概念
    • 每个 CPU 的变量
    • CPU 掩码
    • initcall 机制
    • Linux 内核的通知链
  • Linux 内核中的数据结构
    • 双向链表
    • 基数树
    • 位数组
  • 理论
    • 分页
    • ELF 文件格式
    • 內联汇编
    • CPUID
    • MSR
  • Initial ram disk
  • 杂项
    • Linux 内核开发
    • 内核编译方法
    • 链接器
    • 用户空间的程序启动过程
    • 书写并提交你第一个内核补丁
  • 内核数据结构
    • 中断描述符表
  • 有帮助的链接
  • 贡献者
由 GitBook 提供支持
在本页
  • Non-maskable interrupt handler
  • Non-Maskable interrupt handling
  • Range Exceeded Exception
  • Coprocessor exception and SIMD exception
  • Conclusion
  • Links
  1. 中断

处理不可屏蔽中断

上一页异常处理的实现下一页深入外部硬件中断

最后更新于1年前

Non-maskable interrupt handler

It is sixth part of the chapter and in the previous we saw implementation of some exception handlers for the exception, divide exception, invalid exceptions, etc. As I wrote in the previous part we will see implementations of the rest exceptions in this part. We will see implementation of the following handlers:

  • interrupt;

  • Range Exceeded Exception;

  • exception;

  • coprocessor exception.

in this part. So, let's start.

Non-Maskable interrupt handling

A interrupt is a hardware interrupt that cannot be ignored by standard masking techniques. In a general way, a non-maskable interrupt can be generated in either of two ways:

  • External hardware asserts the non-maskable interrupt on the CPU.

  • The processor receives a message on the system bus or the APIC serial bus with a delivery mode NMI.

When the processor receives a NMI from one of these sources, the processor handles it immediately by calling the NMI handler pointed to by interrupt vector which has number 2 (see table in the first ). We already filled the with the , address of the nmi interrupt handler and NMI_STACK :

set_intr_gate_ist(X86_TRAP_NMI, &nmi, NMI_STACK);

in the trap_init function which defined in the source code file. In the previous we saw that entry points of the all interrupt handlers are defined with the:

.macro idtentry sym do_sym has_error_code:req paranoid=0 shift_ist=-1
ENTRY(\sym)
...
...
...
END(\sym)
.endm
ENTRY(nmi)
...
...
...
END(nmi)
PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq	%rdx

And allocated check that cs was not the kernel segment when an non-maskable interrupt occurs:

cmpl	$__KERNEL_CS, 16(%rsp)
jne	first_nmi
#define GDT_ENTRY_KERNEL_CS	2
#define __KERNEL_CS	(GDT_ENTRY_KERNEL_CS*8)
first_nmi:
	movq	(%rsp), %rdx
	pushq	$1
+------------------------+
|         SS             |
|         RSP            |
|        RFLAGS          |
|         CS             |
|         RIP            |
+------------------------+

and also an error code if an exception has it. So, after all of these manipulations our stack frame will look like this:

+------------------------+
|         SS             |
|         RSP            |
|        RFLAGS          |
|         CS             |
|         RIP            |
|         RDX            |
|          1             |
+------------------------+

In the next step we allocate yet another 40 bytes on the stack:

subq	$(5*8), %rsp

and pushes the copy of the original stack frame after the allocated space:

.rept 5
pushq	11*8(%rsp)
.endr
addq	$(10*8), %rsp

.rept 5
pushq	-6*8(%rsp)
.endr
subq	$(5*8), %rsp

After all of these manipulations our stack frame will be like this:

+-------------------------+
| original SS             |
| original Return RSP     |
| original RFLAGS         |
| original CS             |
| original RIP            |
+-------------------------+
| temp storage for rdx    |
+-------------------------+
| NMI executing variable  |
+-------------------------+
| copied SS               |
| copied Return RSP       |
| copied RFLAGS           |
| copied CS               |
| copied RIP              |
+-------------------------+
| Saved SS                |
| Saved Return RSP        |
| Saved RFLAGS            |
| Saved CS                |
| Saved RIP               |
+-------------------------+

After this we push dummy error code on the stack as we did it already in the previous exception handlers and allocate space for the general purpose registers on the stack:

pushq	$-1
ALLOC_PT_GPREGS_ON_STACK
.macro ALLOC_PT_GPREGS_ON_STACK addskip=0
addq	$-(15*8+\addskip), %rsp
.endm

After space allocation for the general registers we can see call of the paranoid_entry:

call	paranoid_entry
ENTRY(paranoid_entry)
	cld
	SAVE_C_REGS 8
	SAVE_EXTRA_REGS 8
	movl	$1, %ebx
	movl	$MSR_GS_BASE, %ecx
	rdmsr
	testl	%edx, %edx
	js	1f
	SWAPGS
	xorl	%ebx, %ebx
1:	ret
END(paranoid_entry)
movq	%cr2, %r12

Now time to call actual NMI handler. We push the address of the pt_regs to the rdi, error code to the rsi and call the do_nmi handler:

movq	%rsp, %rdi
movq	$-1, %rsi
call	do_nmi

We will back to the do_nmi little later in this part, but now let's look what occurs after the do_nmi will finish its execution. After the do_nmi handler will be finished we check the cr2 register, because we can got page fault during do_nmi performed and if we got it we restore original cr2, in other way we jump on the label 1. After this we test content of the ebx register (remember it must contain 0 if we have used swapgs instruction and 1 if we didn't use it) and execute SWAPGS_UNSAFE_STACK if it contains 1 or jump to the nmi_restore label. The SWAPGS_UNSAFE_STACK macro just expands to the swapgs instruction. In the nmi_restore label we restore general purpose registers, clear allocated space on the stack for this registers, clear our temporary variable and exit from the interrupt handler with the INTERRUPT_RETURN macro:

	movq	%cr2, %rcx
	cmpq	%rcx, %r12
	je	1f
	movq	%r12, %cr2
1:
	testl	%ebx, %ebx
	jnz	nmi_restore
nmi_swapgs:
	SWAPGS_UNSAFE_STACK
nmi_restore:
	RESTORE_EXTRA_REGS
	RESTORE_C_REGS
	/* Pop the extra iret frame at once */
	REMOVE_PT_GPREGS_FROM_STACK 6*8
	/* Clear the NMI executing stack variable */
	movq	$0, 5*8(%rsp)
	INTERRUPT_RETURN

Now let's consider case when another NMI interrupt occurred when previous NMI interrupt didn't finish its execution. You can remember from the beginning of this part that we've made a check that we came from userspace and jump on the first_nmi in this case:

cmpl	$__KERNEL_CS, 16(%rsp)
jne	first_nmi

Note that in this case it is first NMI every time, because if the first NMI caught page fault, breakpoint or another exception it will be executed in the kernel mode. If we didn't come from userspace, first of all we test our temporary variable:

cmpl	$1, -8(%rsp)
je	nested_nmi

and if it is set to 1 we jump to the nested_nmi label. If it is not 1, we test the IST stack. In the case of nested NMIs we check that we are above the repeat_nmi. In this case we ignore it, in other way we check that we above than end_repeat_nmi and jump on the nested_nmi_out label.

  • address of the pt_regs;

  • error code.

static inline void nmi_nesting_preprocess(struct pt_regs *regs)
{
        if (unlikely(is_debug_stack(regs->sp))) {
                debug_stack_set_zero();
                this_cpu_write(update_debug_stack, 1);
        }
}
if (regs->ip == __this_cpu_read(last_nmi_rip))
    b2b = true;
else
    __this_cpu_write(swallow_nmi, false);

__this_cpu_write(last_nmi_rip, regs->ip);

After this first of all we need to handle CPU-specific NMIs:

handled = nmi_handle(NMI_LOCAL, regs, b2b);
__this_cpu_add(nmi_stats.normal, handled);

And then non-specific NMIs depends on its reason:

reason = x86_platform.get_nmi_reason();
if (reason & NMI_REASON_MASK) {
	if (reason & NMI_REASON_SERR)
		pci_serr_error(reason, regs);
	else if (reason & NMI_REASON_IOCHK)
		io_check_error(reason, regs);

	__this_cpu_add(nmi_stats.external, 1);
	return;
}

That's all.

Range Exceeded Exception

prev_state = exception_enter();

if (notify_die(DIE_TRAP, "bounds", regs, error_code,
	           X86_TRAP_BR, SIGSEGV) == NOTIFY_STOP)
    goto exit;
...
...
...
exception_exit(prev_state);
return;
conditional_sti(regs);

if (!user_mode(regs))
	die("bounds", regs, error_code);
if (!cpu_feature_enabled(X86_FEATURE_MPX)) {
	goto exit_trap;
}

where we execute `do_trap` function (more about it you can find in the previous part):

```C
exit_trap:
	do_trap(X86_TRAP_BR, SIGSEGV, "bounds", regs, error_code, NULL);
	exception_exit(prev_state);

If MPX feature is enabled we check the BNDSTATUS with the get_xsave_field_ptr function and if it is zero, it means that the MPX was not responsible for this exception:

bndcsr = get_xsave_field_ptr(XSTATE_BNDCSR);
if (!bndcsr)
		goto exit_trap;

After all of this, there is still only one way when MPX is responsible for this exception. We will not dive into the details about Intel Memory Protection Extensions in this part, but will see it in another chapter.

Coprocessor exception and SIMD exception

dotraplinkage void do_coprocessor_error(struct pt_regs *regs, long error_code)
{
	enum ctx_state prev_state;

	prev_state = exception_enter();
	math_error(regs, error_code, X86_TRAP_MF);
	exception_exit(prev_state);
}

and do_simd_coprocessor_error passes X86_TRAP_XF to the math_error function:

dotraplinkage void
do_simd_coprocessor_error(struct pt_regs *regs, long error_code)
{
	enum ctx_state prev_state;

	prev_state = exception_enter();
	math_error(regs, error_code, X86_TRAP_XF);
	exception_exit(prev_state);
}

First of all the math_error function defines current interrupted task, address of its FPU, string which describes an exception, add it to the notify_die chain and return from the exception handler if it will return NOTIFY_STOP:

	struct task_struct *task = current;
	struct fpu *fpu = &task->thread.fpu;
	siginfo_t info;
	char *str = (trapnr == X86_TRAP_MF) ? "fpu exception" :
						"simd exception";

	if (notify_die(DIE_TRAP, str, regs, error_code, trapnr, SIGFPE) == NOTIFY_STOP)
		return;

After this we check that we are from the kernel mode and if yes we will try to fix an exception with the fixup_exception function. If we cannot we fill the task with the exception's error code and vector number and die:

if (!user_mode(regs)) {
	if (!fixup_exception(regs)) {
		task->thread.error_code = error_code;
		task->thread.trap_nr = trapnr;
		die(str, regs, error_code);
	}
	return;
}

If we came from the user mode, we save the fpu state, fill the task structure with the vector number of an exception and siginfo_t with the number of signal, errno, the address where exception occurred and signal code:

fpu__save(fpu);

task->thread.trap_nr	= trapnr;
task->thread.error_code = error_code;
info.si_signo		= SIGFPE;
info.si_errno		= 0;
info.si_addr		= (void __user *)uprobe_get_trap_addr(regs);
info.si_code = fpu__exception_code(fpu, trapnr);

After this we check the signal code and if it is non-zero we return:

if (!info.si_code)
	return;

Or send the SIGFPE signal in the end:

force_sig_info(SIGFPE, &info, task);

That's all.

Conclusion

Links

macro from the assembly source code file. But the handler of the Non-Maskable interrupts is not defined with this macro. It has own entry point:

in the same assembly file. Lets dive into it and will try to understand how Non-Maskable interrupt handler works. The nmi handlers starts from the call of the:

macro but we will not dive into details about it in this part, because this macro related to the stuff which we will see in another chapter. After this save the content of the rdx register on the stack:

The __KERNEL_CS macro defined in the and represented second descriptor in the :

more about GDT you can read in the second of the Linux kernel booting process chapter. If cs is not kernel segment, it means that it is not nested NMI and we jump on the first_nmi label. Let's consider this case. First of all we put address of the current stack pointer to the rdx and pushes 1 to the stack in the first_nmi label:

Why do we push 1 on the stack? As the comment says: We allow breakpoints in NMIs. On the , like other architectures, the CPU will not execute another NMI until the first NMI is completed. A NMI interrupt finished with the instruction like other interrupts and exceptions do it. If the NMI handler triggers either a or or another exception which are use iret instruction too. If this happens while in NMI context, the CPU will leave NMI context and a new NMI may come in. The iret used to return from those exceptions will re-enable NMIs and we will get nested non-maskable interrupts. The problem the NMI handler will not return to the state that it was, when the exception triggered, but instead it will return to a state that will allow new NMIs to preempt the running NMI handler. If another NMI comes in before the first NMI handler is complete, the new NMI will write all over the preempted NMIs stack. We can have nested NMIs where the next NMI is using the top of the stack of the previous NMI. It means that we cannot execute it because a nested non-maskable interrupt will corrupt stack of a previous non-maskable interrupt. That's why we have allocated space on the stack for temporary variable. We will check this variable that it was set when a previous NMI is executing and clear if it is not nested NMI. We push 1 here to the previously allocated space on the stack to denote that a non-maskable interrupt executed currently. Remember that when and NMI or another exception occurs we have the following :

with the assembly directive. We need in the copy of the original stack frame. Generally we need in two copies of the interrupt stack. First is copied interrupts stack: saved stack frame and copied stack frame. Now we pushes original stack frame to the saved stack frame which locates after the just allocated 40 bytes (copied stack frame). This stack frame is used to fixup the copied stack frame that a nested NMI may change. The second - copied stack frame modified by any nested NMIs to let the first NMI know that we triggered a second NMI and we should repeat the first NMI handler. Ok, we have made first copy of the original stack frame, now time to make second copy:

We already saw implementation of the ALLOC_PT_GPREGS_ON_STACK macro in the third part of the interrupts . This macro defined in the and yet another allocates 120 bytes on stack for the general purpose registers, from the rdi to the r15:

We can remember from the previous parts this label. It pushes general purpose registers on the stack, reads MSR_GS_BASE and checks its value. If the value of the MSR_GS_BASE is negative, we came from the kernel mode and just return from the paranoid_entry, in other way it means that we came from the usermode and need to execute swapgs instruction which will change user gs with the kernel gs:

Note that after the swapgs instruction we zeroed the ebx register. Next time we will check content of this register and if we executed swapgs than ebx must contain 0 and 1 in other way. In the next step we store value of the cr2 to the r12 register, because the NMI handler can cause page fault and corrupt the value of this control register:

where INTERRUPT_RETURN is defined in the and just expands to the iret instruction. That's all.

Now let's look on the do_nmi exception handler. This function defined in the source code file and takes two parameters:

as all exception handlers. The do_nmi starts from the call of the nmi_nesting_preprocess function and ends with the call of the nmi_nesting_postprocess. The nmi_nesting_preprocess function checks that we likely do not work with the debug stack and if we on the debug stack set the update_debug_stack variable to 1 and call the debug_stack_set_zero function from the . This function increases the debug_stack_use_ctr per-cpu variable and loads new Interrupt Descriptor Table:

The nmi_nesting_postprocess function checks the update_debug_stack per-cpu variable which we set in the nmi_nesting_preprocess and resets debug stack or in another words it loads origin Interrupt Descriptor Table. After the call of the nmi_nesting_preprocess function, we can see the call of the nmi_enter in the do_nmi. The nmi_enter increases lockdep_recursion field of the interrupted process, update preempt counter and informs the subsystem about NMI. There is also nmi_exit function that does the same stuff as nmi_enter, but vice-versa. After the nmi_enter we increase __nmi_count in the irq_stat structure and call the default_do_nmi function. First of all in the default_do_nmi we check the address of the previous nmi and update address of the last nmi to the actual:

The next exception is the BOUND range exceeded exception. The BOUND instruction determines if the first operand (array index) is within the bounds of an array specified the second operand (bounds operand). If the index is not within bounds, a BOUND range exceeded exception or #BR is occurred. The handler of the #BR exception is the do_bounds function that defined in the . The do_bounds handler starts with the call of the exception_enter function and ends with the call of the exception_exit:

After we have got the state of the previous context, we add the exception to the notify_die chain and if it will return NOTIFY_STOP we return from the exception. More about notify chains and the context tracking functions you can read in the . In the next step we enable interrupts if they were disabled with the contidional_sti function that checks IF flag and call the local_irq_enable depends on its value:

and check that if we didn't came from user mode we send SIGSEGV signal with the die function. After this we check is enabled or not, and if this feature is disabled we jump on the exit_trap label:

The next two exceptions are Floating-Point Error exception or #MF and Floating-Point Exception or #XF. The first exception occurs when the x87 FPU has detected floating point error. For example divide by zero, numeric overflow, etc. The second exception occurs when the processor has detected SIMD floating-point exception. It can be the same as for the x87 FPU. The handlers for these exceptions are do_coprocessor_error and do_simd_coprocessor_error are defined in the and very similar on each other. They both make a call of the math_error function from the same source code file but pass different vector number. The do_coprocessor_error passes X86_TRAP_MF vector number to the math_error:

It is the end of the sixth part of the chapter and we saw implementation of some exception handlers in this part, like non-maskable interrupt, and floating point exception. Finally, we finished with the trap_init function in this part and will go ahead in the next part. The next our point is the external interrupts and the early_irq_init function from the .

If you have any questions or suggestions write me a comment or ping me at .

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to .

Interrupts and Interrupt Handling in the Linux kernel
part
General Protection Fault
opcode
Non-Maskable
BOUND
Coprocessor
SIMD
Non-Maskable
pin
part
Interrupt Descriptor Table
vector number
Interrupt Stack Table entry
arch/x86/kernel/traps.c
parts
arch/x86/entry/entry_64.S
arch/x86/entry/entry_64.S
Paravirtualization
arch/x86/include/asm/segment.h
Global Descriptor Table
part
x86_64
iret
page fault
breakpoint
stack frame
.rept
chapter
arch/x86/entry/calling.h
Model Specific register
control register
arch/x86/include/asm/irqflags.h
arch/x86/kernel/nmi.c
per-cpu
arch/x86/kernel/cpu/common.c
RCU
arch/x86/kernel/traps.c
previous part
MPX
x87 FPU
SIMD
SSE/SSE2/SSE3
arch/x86/kernel/traps.c
Interrupts and Interrupt Handling
SIMD
x87 FPU
init/main.c
twitter
linux-insides
General Protection Fault
opcode
Non-Maskable
BOUND instruction
CPU socket
Interrupt Descriptor Table
Interrupt Stack Table
Paravirtualization
.rept
SIMD
Coprocessor
x86_64
iret
page fault
breakpoint
Global Descriptor Table
stack frame
Model Specific register
percpu
RCU
MPX
x87 FPU
Previous part