LogoLogo
  • 简介
  • 引导
    • 从引导加载程序内核
    • 在内核安装代码的第一步
    • 视频模式初始化和转换到保护模式
    • 过渡到 64 位模式
    • 内核解压缩
  • 初始化
    • 内核解压之后的首要步骤
    • 早期的中断和异常控制
    • 在到达内核入口之前最后的准备
    • 内核入口 - start_kernel
    • 体系架构初始化
    • 进一步初始化指定体系架构
    • 最后对指定体系架构初始化
    • 调度器初始化
    • RCU 初始化
    • 初始化结束
  • 中断
    • 中断和中断处理第一部分
    • 深入 Linux 内核中的中断
    • 初步中断处理
    • 中断处理
    • 异常处理的实现
    • 处理不可屏蔽中断
    • 深入外部硬件中断
    • IRQs的非早期初始化
    • Softirq, Tasklets and Workqueues
    • 最后一部分
  • 系统调用
    • 系统调用概念简介
    • Linux 内核如何处理系统调用
    • vsyscall and vDSO
    • Linux 内核如何运行程序
    • open 系统调用的实现
    • Linux 资源限制
  • 定时器和时钟管理
    • 简介
    • 时钟源框架简介
    • The tick broadcast framework and dyntick
    • 定时器介绍
    • Clockevents 框架简介
    • x86 相关的时钟源
    • Linux 内核中与时钟相关的系统调用
  • 同步原语
    • 自旋锁简介
    • 队列自旋锁
    • 信号量
    • 互斥锁
    • 读者/写者信号量
    • 顺序锁
    • RCU
    • Lockdep
  • 内存管理
    • 内存块
    • 固定映射地址和 ioremap
    • kmemcheck
  • 控制组
    • 控制组简介
  • SMP
  • 概念
    • 每个 CPU 的变量
    • CPU 掩码
    • initcall 机制
    • Linux 内核的通知链
  • Linux 内核中的数据结构
    • 双向链表
    • 基数树
    • 位数组
  • 理论
    • 分页
    • ELF 文件格式
    • 內联汇编
    • CPUID
    • MSR
  • Initial ram disk
  • 杂项
    • Linux 内核开发
    • 内核编译方法
    • 链接器
    • 用户空间的程序启动过程
    • 书写并提交你第一个内核补丁
  • 内核数据结构
    • 中断描述符表
  • 有帮助的链接
  • 贡献者
由 GitBook 提供支持
在本页
  • The tick broadcast framework and dyntick
  • The idle process
  • The tick broadcast framework
  • Initialization of dyntick related data structures
  • Conclusion
  • Links
  1. 定时器和时钟管理

The tick broadcast framework and dyntick

上一页时钟源框架简介下一页定时器介绍

最后更新于1年前

The tick broadcast framework and dyntick

This is third part of the which describes timers and time management related stuff in the Linux kernel and we stopped on the clocksource framework in the previous . We have started to consider this framework because it is closely related to the special counters which are provided by the Linux kernel. One of these counters which we already saw in the first of this chapter is - jiffies. As I already wrote in the first part of this chapter, we will consider time management related stuff step by step during the Linux kernel initialization. Previous step was call of the:

register_refined_jiffies(CLOCK_TICK_RATE);

function which is defined in the source code file and executes initialization of the refined_jiffies clock source for us. Recall that this function is called from the setup_arch function that is defined in the source code and executes architecture-specific ( in our case) initialization. Look on the implementation of the setup_arch and you will note that the call of the register_refined_jiffies is the last step before the setup_arch function finishes its work.

There are many different x86_64 specific things already configured after the end of the setup_arch execution. For example some early handlers already able to handle interrupts, memory space reserved for the , scanned, the Linux kernel log buffer is already set and this means that the function is able to work, parsed and the Linux kernel already knows about available memory and and many many other architecture specific things (if you are interested, you can read more about the setup_arch function and Linux kernel initialization process in the second of this book).

Now, the setup_arch finished its work and we can go back to the generic Linux kernel code. Recall that the setup_arch function was called from the start_kernel function which is defined in the source code file. So, we shall return to this function. You can see that there are many different functions that are called right after setup_arch function inside of the start_kernel function, but since our chapter is devoted to timers and time management related stuff, we will skip all code which is not related to this topic. The first function which is related to the time management in the Linux kernel is:

tick_init();

in the start_kernel. The tick_init function defined in the source code file and does two things:

  • Initialization of tick broadcast framework related data structures;

  • Initialization of full tickless mode related data structures.

We didn't see anything related to the tick broadcast framework in this book and didn't know anything about tickless mode in the Linux kernel. So, the main point of this part is to look on these concepts and to know what they are.

The idle process

First of all, let's look on the implementation of the tick_init function. As I already wrote, this function is defined in the source code file and consists from the two calls of following functions:

void __init tick_init(void)
{
	tick_broadcast_init();
	tick_nohz_init();
}

The cpu_idle_loop function represents infinite loop which checks the need for rescheduling on each iteration. After the scheduler finds something to execute, the idle process will finish its work and the control will be moved to a new runnable task with the call of the schedule_preempt_disabled function:

static void cpu_idle_loop(void)
{
	while (1) {
		while (!need_resched()) {
			...
			...
			...
		    /* the main idle function */
			cpuidle_idle_call();
		}
		...
		...
		...
		schedule_preempt_disabled();
	}
}

By default, there is the CONFIG_HZ_PERIODIC kernel configuration option which is enabled in the Linux kernel and tells to handle each interrupt of the system timer. To solve this problem, the Linux kernel provides two additional ways of managing scheduling-clock interrupts:

The first is to omit scheduling-clock ticks on idle processors. To enable this behaviour in the Linux kernel, we need to enable the CONFIG_NO_HZ_IDLE kernel configuration option. This option allows Linux kernel to avoid sending timer interrupts to idle processors. In this case periodic timer interrupts will be replaced with on-demand interrupts. This mode is called - dyntick-idle mode. But if the kernel does not handle interrupts of a system timer, how can the kernel decide if the system has nothing to do?

The second way is to omit scheduling-clock ticks on processors that are either in idle state or that have only one runnable task or in other words busy processor. We can enable this feature with the CONFIG_NO_HZ_FULL kernel configuration option and it allows to reduce the number of timer interrupts significantly.

Besides the cpu_idle_loop, idle processor can be in a sleeping state. The Linux kernel provides special cpuidle framework. Main point of this framework is to put an idle processor to sleeping states. The name of the set of these states is - C-states. But how will a processor will be woken if local timer is disabled? The Linux kernel provides tick broadcast framework for this. The main point of this framework is assign a timer which is not affected by the C-states. This timer will wake a sleeping processor.

Now, after some theory we can return to the implementation of our function. Let's recall that the tick_init function just calls two following functions:

void __init tick_init(void)
{
	tick_broadcast_init();
	tick_nohz_init();
}
void __init tick_broadcast_init(void)
{
        zalloc_cpumask_var(&tick_broadcast_mask, GFP_NOWAIT);
        zalloc_cpumask_var(&tick_broadcast_on, GFP_NOWAIT);
        zalloc_cpumask_var(&tmpmask, GFP_NOWAIT);
#ifdef CONFIG_TICK_ONESHOT
         zalloc_cpumask_var(&tick_broadcast_oneshot_mask, GFP_NOWAIT);
         zalloc_cpumask_var(&tick_broadcast_pending_mask, GFP_NOWAIT);
         zalloc_cpumask_var(&tick_broadcast_force_mask, GFP_NOWAIT);
#endif
}
bool zalloc_cpumask_var(cpumask_var_t *mask, gfp_t flags)
{
        return alloc_cpumask_var(mask, flags | __GFP_ZERO);
}

Ultimately, the memory space will be allocated for the given cpumask with the certain flags with the help of the kmalloc_node function:

*mask = kmalloc_node(cpumask_size(), flags, node);

Now let's look on the cpumasks that will be initialized in the tick_broadcast_init function. As we can see, the tick_broadcast_init function will initialize six cpumasks, and moreover, initialization of the last three cpumasks will depend on the CONFIG_TICK_ONESHOT kernel configuration option.

The first three cpumasks are:

  • tick_broadcast_mask - the bitmap which represents list of processors that are in a sleeping mode;

  • tick_broadcast_on - the bitmap that stores numbers of processors which are in a periodic broadcast state;

  • tmpmask - this bitmap for temporary usage.

As we already know, the next three cpumasks depends on the CONFIG_TICK_ONESHOT kernel configuration option. Actually each clock event devices can be in one of two modes:

  • periodic - clock events devices that support periodic events;

  • oneshot - clock events devices that capable of issuing events that happen only once.

#define CLOCK_EVT_FEAT_PERIODIC        0x000001
#define CLOCK_EVT_FEAT_ONESHOT         0x000002

So, the last three cpumasks are:

  • tick_broadcast_oneshot_mask - stores numbers of processors that must be notified;

  • tick_broadcast_pending_mask - stores numbers of processors that pending broadcast;

  • tick_broadcast_force_mask - stores numbers of processors with enforced broadcast.

We have initialized six cpumasks in the tick broadcast framework, and now we can proceed to implementation of this framework.

The tick broadcast framework

struct tick_device {
        struct clock_event_device *evtdev;
        enum tick_device_mode mode;
};

The second field of the tick_device structure represents mode of the tick_device. As we already know, the mode can be one of the:

enum tick_device_mode {
        TICKDEV_MODE_PERIODIC,
        TICKDEV_MODE_ONESHOT,
};
tick_install_broadcast_device(newdev);

function that checks that the given clock event device can be broadcast device and install it, if the given device can be broadcast device. Let's look on the implementation of the tick_install_broadcast_device function:

void tick_install_broadcast_device(struct clock_event_device *dev)
{
	struct clock_event_device *cur = tick_broadcast_device.evtdev;

	if (!tick_check_broadcast_device(cur, dev))
		return;

	if (!try_module_get(dev->owner))
		return;

	clockevents_exchange_device(cur, dev);

	if (cur)
		cur->event_handler = clockevents_handle_noop;

	tick_broadcast_device.evtdev = dev;

	if (!cpumask_empty(tick_broadcast_mask))
		tick_broadcast_start_periodic(dev);

	if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
		tick_clock_notify();
}
static struct tick_device tick_broadcast_device;

In the last step of the tick_install_broadcast_device function we check that the tick_broadcast_mask is not empty and start the given clock events device in periodic mode with the call of the tick_broadcast_start_periodic function:

if (!cpumask_empty(tick_broadcast_mask))
	tick_broadcast_start_periodic(dev);

if (dev->features & CLOCK_EVT_FEAT_ONESHOT)
	tick_clock_notify();

The tick_broadcast_mask filled in the tick_device_uses_broadcast function that checks a clock events device during registration of this clock events device:

int cpu = smp_processor_id();

int tick_device_uses_broadcast(struct clock_event_device *dev, int cpu)
{
	...
	...
	...
	if (!tick_device_is_functional(dev)) {
		...
		cpumask_set_cpu(cpu, tick_broadcast_mask);
		...
	}
	...
	...
	...
}

The tick_broadcast_start_periodic function check the given clock event device and call the tick_setup_periodic function:

static void tick_broadcast_start_periodic(struct clock_event_device *bc)
{
	if (bc)
		tick_setup_periodic(bc, 1);
}
tick_set_periodic_handler(dev, broadcast);

This function checks the second parameter which represents broadcast state (on or off) and sets the broadcast handler depends on its value:

void tick_set_periodic_handler(struct clock_event_device *dev, int broadcast)
{
	if (!broadcast)
		dev->event_handler = tick_handle_periodic;
	else
		dev->event_handler = tick_handle_periodic_broadcast;
}
static irqreturn_t hpet_interrupt_handler(int irq, void *data)
{
	struct hpet_dev *dev = (struct hpet_dev *)data;
	struct clock_event_device *hevt = &dev->evt;

	if (!hevt->event_handler) {
		printk(KERN_INFO "Spurious HPET timer interrupt on HPET timer %d\n",
				dev->num);
		return IRQ_HANDLED;
	}

	hevt->event_handler(hevt);
	return IRQ_HANDLED;
}

The tick_handler_periodic_broadcast function calls the

bc_local = tick_do_periodic_broadcast();

function which stores numbers of processors which have asked to be woken up in the temporary cpumask and call the tick_do_broadcast function:

cpumask_and(tmpmask, cpu_online_mask, tick_broadcast_mask);
return tick_do_broadcast(tmpmask);
if (bc_local)
	td->evtdev->event_handler(td->evtdev);

which actually represents interrupt handler of the local timer of a processor. After this a processor will wake up. That is all about tick broadcast framework in the Linux kernel. We have missed some aspects of this framework, for example reprogramming of a clock event device and broadcast with the oneshot timer and etc. But the Linux kernel is very big, it is not realistic to cover all aspects of it. I think it will be interesting to dive into it yourself.

If you remember, we have started this part with the call of the tick_init function. We just consider the tick_broadcast_init function and related theory, but the tick_init function contains another call of a function and this function is - tick_nohz_init. Let's look on the implementation of this function.

Initialization of dyntick related data structures

if (!tick_nohz_full_running) {
    if (tick_nohz_init_all() < 0)
    return;
}

If this mode is not running we call the tick_nohz_init_all function that defined in the same source code file and check its result. The tick_nohz_init_all function tries to allocate the tick_nohz_full_mask with the call of the alloc_cpumask_var that will allocate space for a tick_nohz_full_mask. The tick_nohz_full_mask will store numbers of processors that have enabled full NO_HZ. After successful allocation of the tick_nohz_full_mask we set all bits in the tick_nohz_full_mask, set the tick_nohz_full_running and return result to the tick_nohz_init function:

static int tick_nohz_init_all(void)
{
        int err = -1;
#ifdef CONFIG_NO_HZ_FULL_ALL
        if (!alloc_cpumask_var(&tick_nohz_full_mask, GFP_KERNEL)) {
                WARN(1, "NO_HZ: Can't allocate full dynticks cpumask\n");
                return err;
        }
        err = 0;
        cpumask_setall(tick_nohz_full_mask);
        tick_nohz_full_running = true;
#endif
        return err;
}

In the next step we try to allocate a memory space for the housekeeping_mask:

if (!alloc_cpumask_var(&housekeeping_mask, GFP_KERNEL)) {
	WARN(1, "NO_HZ: Can't allocate not-full dynticks cpumask\n");
	cpumask_clear(tick_nohz_full_mask);
	tick_nohz_full_running = false;
	return;
}
static inline bool arch_irq_work_has_interrupt(void)
{
    return cpu_has_apic;
}

If a processor has not APIC, the Linux kernel prints warning message, clears the tick_nohz_full_mask cpumask, copies numbers of all possible processors in the system to the housekeeping_mask and resets the value of the tick_nohz_full_running variable:

if (!arch_irq_work_has_interrupt()) {
	pr_warning("NO_HZ: Can't run full dynticks because arch doesn't "
		   "support irq work self-IPIs\n");
	cpumask_clear(tick_nohz_full_mask);
	cpumask_copy(housekeeping_mask, cpu_possible_mask);
	tick_nohz_full_running = false;
	return;
}

After this step, we get the number of the current processor by the call of the smp_processor_id and check this processor in the tick_nohz_full_mask. If the tick_nohz_full_mask contains a given processor we clear appropriate bit in the tick_nohz_full_mask:

cpu = smp_processor_id();

if (cpumask_test_cpu(cpu, tick_nohz_full_mask)) {
	pr_warning("NO_HZ: Clearing %d from nohz_full range for timekeeping\n", cpu);
	cpumask_clear_cpu(cpu, tick_nohz_full_mask);
}

Because this processor will be used for timekeeping. After this step we put all numbers of processors that are in the cpu_possible_mask and not in the tick_nohz_full_mask:

cpumask_andnot(housekeeping_mask,
	       cpu_possible_mask, tick_nohz_full_mask);

After this operation, the housekeeping_mask will contain all processors of the system except a processor for timekeeping. In the last step of the tick_nohz_init_all function, we are going through all processors that are defined in the tick_nohz_full_mask and call the following function for an each processor:

for_each_cpu(cpu, tick_nohz_full_mask)
	context_tracking_cpu_set(cpu);

That's all. This is the end of the tick_nohz_init function. After this NO_HZ related data structures will be initialized. We didn't see API of the NO_HZ mode, but will see it soon.

Conclusion

This is the end of the third part of the chapter that describes timers and timer management related stuff in the Linux kernel. In the previous part got acquainted with the clocksource concept in the Linux kernel which represents framework for managing different clock source in a interrupt and hardware characteristics independent way. We continued to look on the Linux kernel initialization process in a time management context in this part and got acquainted with two new concepts for us: the tick broadcast framework and tick-less mode. The first concept helps the Linux kernel to deal with processors which are in deep sleep and the second concept represents the mode in which kernel may work to improve power management of idle processors.

In the next part we will continue to dive into timer management related things in the Linux kernel and will see new concept for us - timers.

Links

As you can understand from the paragraph's title, we are interesting only in the tick_broadcast_init function for now. This function defined in the source code file and executes initialization of the tick broadcast framework related data structures. Before we will look on the implementation of the tick_broadcast_init function and will try to understand what does this function do, we need to know about tick broadcast framework.

Main point of a central processor is to execute programs. But sometimes a processor may be in a special state when it is not being used by any program. This special state is called - . When the processor has no anything to execute, the Linux kernel launches idle task. We already saw a little about this in the last part of the . When the Linux kernel will finish all initialization processes in the start_kernel function from the source code file, it will call the rest_init function from the same source code file. Main point of this function is to launch kernel init thread and the kthreadd thread, to call the schedule function to start task scheduling and to go to sleep by calling the cpu_idle_loop function that defined in the source code file.

Of course, we will not consider full implementation of the cpu_idle_loop function and details of the idle state in this part, because it is not related to our topic. But there is one interesting moment for us. We know that the processor can execute only one task in one time. How does the Linux kernel decide to reschedule and stop idle process if the processor executes infinite loop in the cpu_idle_loop? The answer is system timer interrupts. When an interrupt occurs, the processor stops the idle thread and transfers control to an interrupt handler. After the system timer interrupt handler will be handled, the need_resched will return true and the Linux kernel will stop idle process and will transfer control to the current runnable task. But handling of the system timer interrupts is not effective for , because if a processor is in idle state, there is little point in sending it a system timer interrupt.

Whenever the idle task is selected to run, the periodic tick is disabled with the call of the tick_nohz_idle_enter function that defined in the source code file and enabled with the call of the tick_nohz_idle_exit function. There is special concept in the Linux kernel which is called - clock event devices that are used to schedule the next interrupt. This concept provides API for devices which can deliver interrupts at a specific time in the future and represented by the clock_event_device structure in the Linux kernel. We will not dive into implementation of the clock_event_device structure now. We will see it in the next part of this chapter. But there is one interesting moment for us right now.

Let's consider the first function. The first tick_broadcast_init function defined in the source code file and executes initialization of the tick broadcast framework related data structures. Let's look on the implementation of the tick_broadcast_init function:

As we can see, the tick_broadcast_init function allocates different with the help of the zalloc_cpumask_var function. The zalloc_cpumask_var function defined in the source code file and expands to the call of the following function:

The Linux kernel defines two mask for such clock events devices in the header file:

Hardware may provide some clock source devices. When a processor sleeps and its local timer stopped, there must be additional clock source device that will handle awakening of a processor. The Linux kernel uses these special clock source devices which can raise an interrupt at a specified time. We already know that such timers called clock events devices in the Linux kernel. Besides clock events devices, each processor in the system has its own local timer which is programmed to issue interrupt at the time of the next deferred task. Also these timers can be programmed to do a periodical job, like updating jiffies and etc. These timers represented by the tick_device structure in the Linux kernel. This structure defined in the header file and looks:

Note, that the tick_device structure contains two fields. The first field - evtdev represents pointer to the clock_event_device structure that is defined in the header file and represents descriptor of a clock event device. A clock event device allows to register an event that will happen in the future. As I already wrote, we will not consider clock_event_device structure and related API in this part, but will see it in the next part.

Each clock events device in the system registers itself by the call of the clockevents_register_device function or clockevents_config_and_register function during initialization process of the Linux kernel. During the registration of a new clock events device, the Linux kernel calls the tick_check_new_device function that defined in the source code file and checks the given clock events device should be used by the Linux kernel. After all checks, the tick_check_new_device function executes a call of the:

First of all we get the current clock event device from the tick_broadcast_device. The tick_broadcast_device defined in the source code file:

and represents external clock device that keeps track of events for a processor. The first step after we get the current clock device is the call of the tick_check_broadcast_device function which checks that a given clock events device can be utilized as broadcast device. The main point of the tick_check_broadcast_device function is to check value of the features field of the given clock events device. As we can understand from the name of this field, the features field contains a clock event device features. Available values defined in the header file and can be one of the CLOCK_EVT_FEAT_PERIODIC - which represents a clock events device which supports periodic events and etc. So, the tick_check_broadcast_device function check features flags for CLOCK_EVT_FEAT_ONESHOT, CLOCK_EVT_FEAT_DUMMY and other flags and returns false if the given clock events device has one of these features. In other way the tick_check_broadcast_device function compares ratings of the given clock event device and current clock event device and returns the best.

After the tick_check_broadcast_device function, we can see the call of the try_module_get function that checks module owner of the clock events. We need to do it to be sure that the given clock events device was correctly initialized. The next step is the call of the clockevents_exchange_device function that defined in the source code file and will release old clock events device and replace the previous functional handler with a dummy handler.

More about the smp_processor_id macro you can read in the fourth of the Linux kernel initialization process chapter.

that defined in the source code file and sets broadcast handler for the given clock event device by the call of the following function:

When an clock event device will issue an interrupt, the dev->event_handler will be called. For example, let's look on the interrupt handler of the which is located in the source code file:

The hpet_interrupt_handler gets the specific data and check the event handler of the clock event device. Recall that we just set in the tick_set_periodic_handler function. So the tick_handler_periodic_broadcast function will be called in the end of the high precision event timer interrupt handler.

The tick_do_broadcast calls the broadcast function of the given clock events which sends interrupt to the set of the processors. In the end we can call the event handler of the given tick_device:

We already saw some information about dyntick concept in this part and we know that this concept allows kernel to disable system timer interrupts in the idle state. The tick_nohz_init function makes initialization of the different data structures which are related to this concept. This function defined in the source code file and starts from the check of the value of the tick_nohz_full_running variable which represents state of the tick-less mode for the idle state and the state when system timer interrupts are disabled during a processor has only one runnable task:

This cpumask will store number of processor for housekeeping or in other words we need at least in one processor that will not be in NO_HZ mode, because it will do timekeeping and etc. After this we check the result of the architecture-specific arch_irq_work_has_interrupt function. This function checks ability to send inter-processor interrupt for the certain architecture. We need to check this, because system timer of a processor will be disabled during NO_HZ mode, so there must be at least one online processor which can send inter-processor interrupt to awake offline processor. This function defined in the header file for the and just checks that a processor has from the :

The context_tracking_cpu_set function defined in the source code file and main point of this function is to set the context_tracking.active variable to true. When the active field will be set to true for the certain processor, all will be ignored by the Linux kernel context tracking subsystem for this processor.

If you have questions or suggestions, feel free to ping me in twitter , drop me or just create .

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to .

chapter
part
part
kernel/time/jiffies.c
arch/x86/kernel/setup.c
x86_64
interrupt
initrd
DMI
printk
e820
chapter
init/main.c
kernel/time/tick-common.c
kernel/time/tick-common.c
kernel/time/tick-broadcast.c
idle
Linux kernel initialization process
init/main.c
kernel/sched/idle.c
power management
kernel/time/tick-sched.c
kernel/time/tick-broadcast.c
cpumasks
lib/cpumask.c
include/linux/clockchips.h
kernel/time/tick-sched.h
include/linux/clockchips.h
kernel/time/tick-common.c
kernel/time/tick-common.c
include/linux/clockchips.h
kernel/time/clockevents.c
part
kernel/time/tick-common.c
high precision event timer
arch/x86/kernel/hpet.c
IRQ
IPI
kernel/time/tick-sched.c
arch/x86/include/asm/irq_work.h
x86_64
APIC
CPUID
kernel/context_tracking.c
percpu
context switches
0xAX
email
issue
linux-insides
x86_64
initrd
interrupt
DMI
printk
CPU idle
power management
NO_HZ documentation
cpumasks
high precision event timer
IRQ
IPI
CPUID
APIC
percpu
context switches
Previous part