LogoLogo
  • 简介
  • 引导
    • 从引导加载程序内核
    • 在内核安装代码的第一步
    • 视频模式初始化和转换到保护模式
    • 过渡到 64 位模式
    • 内核解压缩
  • 初始化
    • 内核解压之后的首要步骤
    • 早期的中断和异常控制
    • 在到达内核入口之前最后的准备
    • 内核入口 - start_kernel
    • 体系架构初始化
    • 进一步初始化指定体系架构
    • 最后对指定体系架构初始化
    • 调度器初始化
    • RCU 初始化
    • 初始化结束
  • 中断
    • 中断和中断处理第一部分
    • 深入 Linux 内核中的中断
    • 初步中断处理
    • 中断处理
    • 异常处理的实现
    • 处理不可屏蔽中断
    • 深入外部硬件中断
    • IRQs的非早期初始化
    • Softirq, Tasklets and Workqueues
    • 最后一部分
  • 系统调用
    • 系统调用概念简介
    • Linux 内核如何处理系统调用
    • vsyscall and vDSO
    • Linux 内核如何运行程序
    • open 系统调用的实现
    • Linux 资源限制
  • 定时器和时钟管理
    • 简介
    • 时钟源框架简介
    • The tick broadcast framework and dyntick
    • 定时器介绍
    • Clockevents 框架简介
    • x86 相关的时钟源
    • Linux 内核中与时钟相关的系统调用
  • 同步原语
    • 自旋锁简介
    • 队列自旋锁
    • 信号量
    • 互斥锁
    • 读者/写者信号量
    • 顺序锁
    • RCU
    • Lockdep
  • 内存管理
    • 内存块
    • 固定映射地址和 ioremap
    • kmemcheck
  • 控制组
    • 控制组简介
  • SMP
  • 概念
    • 每个 CPU 的变量
    • CPU 掩码
    • initcall 机制
    • Linux 内核的通知链
  • Linux 内核中的数据结构
    • 双向链表
    • 基数树
    • 位数组
  • 理论
    • 分页
    • ELF 文件格式
    • 內联汇编
    • CPUID
    • MSR
  • Initial ram disk
  • 杂项
    • Linux 内核开发
    • 内核编译方法
    • 链接器
    • 用户空间的程序启动过程
    • 书写并提交你第一个内核补丁
  • 内核数据结构
    • 中断描述符表
  • 有帮助的链接
  • 贡献者
由 GitBook 提供支持
在本页
  • Kernel initialization. Part 9.
  • RCU initialization
  • Initialization of the integer ID management
  • RCU initialization
  • Rest of the initialization process
  • Conclusion
  • Links
  1. 初始化

RCU 初始化

上一页调度器初始化下一页初始化结束

最后更新于1年前

Kernel initialization. Part 9.

RCU initialization

This is ninth part of the and in the previous part we stopped at the . In this part we will continue to dive to the Linux kernel initialization process and the main purpose of this part will be to learn about initialization of the . We can see that the next step in the after the sched_init is the call of the preempt_disable. There are two macros:

  • preempt_disable

  • preempt_enable

for preemption disabling and enabling. First of all let's try to understand what is preempt in the context of an operating system kernel. In simple words, preemption is ability of the operating system kernel to preempt current task to run task with higher priority. Here we need to disable preemption because we will have only one init process for the early boot time and we don't need to stop it before we call cpu_idle function. The preempt_disable macro is defined in the and depends on the CONFIG_PREEMPT_COUNT kernel configuration option. This macro is implemented as:

#define preempt_disable() \
do { \
        preempt_count_inc(); \
        barrier(); \
} while (0)

and if CONFIG_PREEMPT_COUNT is not set just:

#define preempt_disable()                       barrier()

Let's look on it. First of all we can see one difference between these macro implementations. The preempt_disable with CONFIG_PREEMPT_COUNT set contains the call of the preempt_count_inc. There is special percpu variable which stores the number of held locks and preempt_disable calls:

DECLARE_PER_CPU(int, __preempt_count);

In the first implementation of the preempt_disable we increment this __preempt_count. There is API for returning value of the __preempt_count, it is the preempt_count function. As we called preempt_disable, first of all we increment preemption counter with the preempt_count_inc macro which expands to the:

#define preempt_count_inc() preempt_count_add(1)
#define preempt_count_add(val)  __preempt_count_add(val)
preempt_disable();
foo();
preempt_enable();

Compiler can rearrange it as:

preempt_disable();
preempt_enable();
foo();

In the next step we can see following statement:

if (WARN(!irqs_disabled(),
	 "Interrupts were enabled *very* early, fixing it\n"))
	local_irq_disable();

That's all. Preemption is disabled and we can go ahead.

Initialization of the integer ID management

Let's look on the implementation of the idr_init_cache function:

void __init idr_init_cache(void)
{
        idr_layer_cache = kmem_cache_create("idr_layer_cache",
                                sizeof(struct idr_layer), 0, SLAB_PANIC, NULL);
}
  • name of the cache;

  • size of the object to store in cache;

  • offset of the first object in the page;

  • flags;

  • constructor for the objects.

static DEFINE_IDR(i2c_adapter_idr);

and then uses it for the declaration of the i2c adapter:

static int __i2c_add_numbered_adapter(struct i2c_adapter *adap)
{
  int     id;
  ...
  ...
  ...
  id = idr_alloc(&i2c_adapter_idr, adap, adap->nr, adap->nr + 1, GFP_KERNEL);
  ...
  ...
  ...
}

and id2_adapter_idr presents dynamically calculated bus number.

RCU initialization

  • CONFIG_TINY_RCU

  • CONFIG_TREE_RCU

void __init rcu_init(void)
{
         int cpu;

         rcu_bootup_announce();
         rcu_init_geometry();
         rcu_init_one(&rcu_bh_state, &rcu_bh_data);
         rcu_init_one(&rcu_sched_state, &rcu_sched_data);
         __rcu_init_preempt();
         open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);

         /*
          * We don't need protection against CPU-hotplug here because
          * this is called early in boot, before either interrupts
          * or the scheduler are operational.
          */
         cpu_notifier(rcu_cpu_notify, 0);
         pm_notifier(rcu_pm_notify, 0);
         for_each_online_cpu(cpu)
                 rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);

         rcu_early_boot_tests();
}

In the beginning of the rcu_init function we define cpu variable and call rcu_bootup_announce. The rcu_bootup_announce function is pretty simple:

static void __init rcu_bootup_announce(void)
{
        pr_info("Hierarchical RCU implementation.\n");
        rcu_bootup_announce_oddness();
}

It just prints information about the RCU with the pr_info function and rcu_bootup_announce_oddness which uses pr_info too, for printing different information about the current RCU configuration which depends on different kernel configuration options like CONFIG_RCU_TRACE, CONFIG_PROVE_RCU, CONFIG_RCU_FANOUT_EXACT, etc. In the next step, we can see the call of the rcu_init_geometry function. This function is defined in the same source code file and computes the node tree geometry depends on the amount of CPUs. Actually RCU provides scalability with extremely low internal RCU lock contention. What if a data structure will be read from the different CPUs? RCU API provides the rcu_state structure which presents RCU global state including node hierarchy. Hierarchy is presented by the:

struct rcu_node node[NUM_RCU_NODES];

array of structures. As we can read in the comment of above definition:

The root (first level) of the hierarchy is in ->node[0] (referenced by ->level[0]), the second
level in ->node[1] through ->node[m] (->node[1] referenced by ->level[1]), and the third level
in ->node[m+1] and following (->node[m+1] referenced by ->level[2]).  The number of levels is
determined by the number of CPUs and by CONFIG_RCU_FANOUT.

Small systems will have a "hierarchy" consisting of a single rcu_node.
#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)

where levels values depend on the CONFIG_RCU_FANOUT_LEAF configuration option. For example for the simplest case, one rcu_node will cover two CPU on machine with the eight CPUs:

+-----------------------------------------------------------------+
|  rcu_state                                                      |
|                 +----------------------+                        |
|                 |         root         |                        |
|                 |       rcu_node       |                        |
|                 +----------------------+                        |
|                    |                |                           |
|               +----v-----+       +--v-------+                   |
|               |          |       |          |                   |
|               | rcu_node |       | rcu_node |                   |
|               |          |       |          |                   |
|         +------------------+     +----------------+             |
|         |                  |        |             |             |
|         |                  |        |             |             |
|    +----v-----+    +-------v--+   +-v--------+  +-v--------+    |
|    |          |    |          |   |          |  |          |    |
|    | rcu_node |    | rcu_node |   | rcu_node |  | rcu_node |    |
|    |          |    |          |   |          |  |          |    |
|    +----------+    +----------+   +----------+  +----------+    |
|         |                 |             |               |       |
|         |                 |             |               |       |
|         |                 |             |               |       |
|         |                 |             |               |       |
+---------|-----------------|-------------|---------------|-------+
          |                 |             |               |
+---------v-----------------v-------------v---------------v--------+
|                 |                |               |               |
|     CPU1        |      CPU3      |      CPU5     |     CPU7      |
|                 |                |               |               |
|     CPU2        |      CPU4      |      CPU6     |     CPU8      |
|                 |                |               |               |
+------------------------------------------------------------------+

So, in the rcu_init_geometry function we just need to calculate the total number of rcu_node structures. We start to do it with the calculation of the jiffies till to the first and next fqs which is force-quiescent-state (read above about it):

d = RCU_JIFFIES_TILL_FORCE_QS + nr_cpu_ids / RCU_JIFFIES_FQS_DIV;
if (jiffies_till_first_fqs == ULONG_MAX)
        jiffies_till_first_fqs = d;
if (jiffies_till_next_fqs == ULONG_MAX)
        jiffies_till_next_fqs = d;

where:

#define RCU_JIFFIES_TILL_FORCE_QS (1 + (HZ > 250) + (HZ > 500))
#define RCU_JIFFIES_FQS_DIV     256
static ulong jiffies_till_first_fqs = ULONG_MAX;
static ulong jiffies_till_next_fqs = ULONG_MAX;

In the next step of the rcu_init_geometry, we check that rcu_fanout_leaf didn't change (it has the same value as CONFIG_RCU_FANOUT_LEAF in compile-time) and equal to the value of the CONFIG_RCU_FANOUT_LEAF configuration option, we just return:

if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
    nr_cpu_ids == NR_CPUS)
    return;

After this we need to compute the number of nodes that an rcu_node tree can handle with the given number of levels:

rcu_capacity[0] = 1;
rcu_capacity[1] = rcu_fanout_leaf;
for (i = 2; i <= MAX_RCU_LVLS; i++)
    rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;

As we calculated geometry of the rcu_node tree, we need to go back to the rcu_init function and next step we need to initialize two rcu_state structures with the rcu_init_one function:

rcu_init_one(&rcu_bh_state, &rcu_bh_data);
rcu_init_one(&rcu_sched_state, &rcu_sched_data);

The rcu_init_one function takes two arguments:

  • Global RCU state;

  • Per-CPU data for RCU.

extern struct rcu_state rcu_bh_state;
DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);

function. This function registers a handler of the pending interrupt. Pending interrupt or softirq supposes that part of actions can be delayed for later execution when the system is less loaded. Pending interrupts is represented by the following structure:

struct softirq_action
{
        void    (*action)(struct softirq_action *);
};
$ cat /proc/softirqs
                    CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7
          HI:          2          0          0          1          0          2          0          0
       TIMER:     137779     108110     139573     107647     107408     114972      99653      98665
      NET_TX:       1127          0          4          0          1          1          0          0
      NET_RX:        334        221     132939       3076        451        361        292        303
       BLOCK:       5253       5596          8        779       2016      37442         28       2855
BLOCK_IOPOLL:          0          0          0          0          0          0          0          0
     TASKLET:         66          0       2916        113          0         24      26708          0
       SCHED:     102350      75950      91705      75356      75323      82627      69279      69914
     HRTIMER:        510        302        368        260        219        255        248        246
         RCU:      81290      68062      82979      69015      68390      69385      63304      63473

The open_softirq function takes two parameters:

  • index of the interrupt;

  • interrupt handler.

and adds interrupt handler to the array of the pending interrupts:

void open_softirq(int nr, void (*action)(struct softirq_action *))
{
        softirq_vec[nr].action = action;
}
cpu_notifier(rcu_cpu_notify, 0);
pm_notifier(rcu_pm_notify, 0);
for_each_online_cpu(cpu)
    rcu_cpu_notify(NULL, CPU_UP_PREPARE, (void *)(long)cpu);
void rcu_early_boot_tests(void)
{
        pr_info("Running RCU self tests\n");

        if (rcu_self_test)
                 early_boot_test_call_rcu();
         if (rcu_self_test_bh)
                 early_boot_test_call_rcu_bh();
         if (rcu_self_test_sched)
                early_boot_test_call_rcu_sched();
}

which runs self tests for the RCU.

That's all. We saw initialization process of the RCU subsystem. As I wrote above, more about the RCU will be in the separate chapter about synchronization primitives.

Rest of the initialization process

Ok, we already passed the main theme of this part which is RCU initialization, but it is not the end of the Linux kernel initialization process. In the last paragraph of this theme we will see a couple of functions which work in the initialization time, but we will not dive into deep details around this function for different reasons. Some reasons not to dive into details are following:

  • They are not very important for the generic kernel initialization process and depend on the different kernel configuration;

  • They have the character of debugging and not important for now;

  • We will see many of this stuff in the separate parts/chapters.

In the next step we can see the functions which are related to the interrupts handling subsystem, they are:

  • early_irq_init

  • init_IRQ

  • softirq_init

We will see explanation about this functions and their implementation in the special part about interrupts and exceptions handling. After this many different functions (like init_timers, hrtimers_init, time_init, etc.) which are related to different timing and timers stuff. We will see more about these function in the chapter about timers.

local_irq_enable();

Conclusion

Links

where preempt_count_add calls the raw_cpu_add_4 macro which adds 1 to the given percpu variable (__preempt_count) in our case (more about precpu variables you can read in the part about ). Ok, we increased __preempt_count and the next step we can see the call of the barrier macro in the both macros. The barrier macro inserts an optimization barrier. In the processors with x86_64 architecture independent memory access operations can be performed in any order. That's why we need the opportunity to point compiler and processor on compliance of order. This mechanism is memory barrier. Let's consider a simple example:

In this case non-preemptible function foo can be preempted. As we put barrier macro in the preempt_disable and preempt_enable macros, it prevents the compiler from swapping preempt_count_inc with other statements. More about barriers you can read and .

which check state, and disabling (with cli instruction for x86_64) if they are enabled.

In the next step we can see the call of the idr_init_cache function which defined in the . The idr library is used in a various in the Linux kernel to manage assigning integer IDs to objects and looking up objects by id.

Here we can see the call of the kmem_cache_create. We already called the kmem_cache_init in the . This function create generalized caches again using the kmem_cache_alloc (more about caches we will see in the chapter). In our case, as we are using kmem_cache_t which will be used by the allocator and kmem_cache_create creates it. As you can see we pass five parameters to the kmem_cache_create:

and it will create kmem_cache for the integer IDs. Integer IDs is commonly used pattern to map set of integer IDs to the set of pointers. We can see usage of the integer IDs in the drivers subsystem. For example which represents the core of the i2c subsystem defines ID for the i2c adapter with the DEFINE_IDR macro:

More about integer ID management you can read .

The next step is initialization with the rcu_init function and its implementation depends on two kernel configuration options:

In the first case rcu_init will be in the and in the second case it will be defined in the . We will see the implementation of the tree rcu, but first of all about the RCU in general.

RCU or read-copy update is a scalable high-performance synchronization mechanism implemented in the Linux kernel. On the early stage the Linux kernel provided support and environment for the concurrently running applications, but all execution was serialized in the kernel using a single global lock. In our days linux kernel has no single global lock, but provides different mechanisms including , data structures and other. One of these mechanisms is - the read-copy update. The RCU technique is designed for rarely-modified data structures. The idea of the RCU is simple. For example we have a rarely-modified data structure. If somebody wants to change this data structure, we make a copy of this data structure and make all changes in the copy. In the same time all other users of the data structure use old version of it. Next, we need to choose safe moment when original version of the data structure will have no users and update it with the modified copy.

Of course this description of the RCU is very simplified. To understand some details about RCU, first of all we need to learn some terminology. Data readers in the RCU executed in the . Every time when data reader get to the critical section, it calls the rcu_read_lock, and rcu_read_unlock on exit from the critical section. If the thread is not in the critical section, it will be in state which called - quiescent state. The moment when every thread is in the quiescent state called - grace period. If a thread wants to remove an element from the data structure, this occurs in two steps. First step is removal - atomically removes element from the data structure, but does not release the physical memory. After this thread-writer announces and waits until it is finished. From this moment, the removed element is available to the thread-readers. After the grace period finished, the second step of the element removal will be started, it just removes the element from the physical memory.

There a couple of implementations of the RCU. Old RCU called classic, the new implementation called tree RCU. As you may already understand, the CONFIG_TREE_RCU kernel configuration option enables tree RCU. Another is the tiny RCU which depends on CONFIG_TINY_RCU and CONFIG_SMP=n. We will see more details about the RCU in general in the separate chapter about synchronization primitives, but now let's look on the rcu_init implementation from the :

The rcu_node structure is defined in the and contains information about current grace period, is grace period completed or not, CPUs or groups that need to switch in order for current grace period to proceed, etc. Every rcu_node contains a lock for a couple of CPUs. These rcu_node structures are embedded into a linear array in the rcu_state structure and represented as a tree with the root as the first element and covers all CPUs. As you can see the number of the rcu nodes determined by the NUM_RCU_NODES which depends on number of available CPUs:

As we calculated these , we check that previous defined jiffies_till_first_fqs and jiffies_till_next_fqs variables are equal to the (their default values) and set they equal to the calculated value. As we did not touch these variables before, they are equal to the ULONG_MAX:

And in the last step we calculate the number of rcu_nodes at each level of the tree in the .

Both variables defined in the with its percpu data:

About these states you can read . As I wrote above we need to initialize rcu_state structures and rcu_init_one function will help us with it. After the rcu_state initialization, we can see the call of the __rcu_init_preempt which depends on the CONFIG_PREEMPT_RCU kernel configuration option. It does the same as previous functions - initialization of the rcu_preempt_state structure with the rcu_init_one function which has rcu_state type. After this, in the rcu_init, we can see the call of the:

which is defined in the and contains only one field - handler of an interrupt. You can check about softirqs in the your system with the:

In our case the interrupt handler is - rcu_process_callbacks which is defined in the and does the RCU core processing for the current CPU. After we registered softirq interrupt for the RCU, we can see the following code:

Here we can see registration of the cpu notifier which needs in systems which supports and we will not dive into details about this theme. The last function in the rcu_init is the rcu_early_boot_tests:

After we initialized RCU, the next step which you can see in the is the - trace_init function. As you can understand from its name, this function initialize subsystem. You can read more about Linux kernel trace system - .

After the trace_init, we can see the call of the radix_tree_init. If you are familiar with the different data structures, you can understand from the name of this function that it initializes kernel implementation of the . This function is defined in the and you can read more about it in the part about .

The next couple of functions are related with the events - perf_event-init (there will be separate chapter about perf), initialization of the profiling with the profile_init. After this we enable irq with the call of the:

which expands to the sti instruction and making post initialization of the with the call of the kmem_cache_init_late function (As I wrote above we will know about the SLAB in the chapter).

After the post initialization of the SLAB, next point is initialization of the console with the console_init function from the .

After the console initialization, we can see the lockdep_info function which prints information about the . After this, we can see the initialization of the dynamic allocation of the debug objects with the debug_objects_mem_init, kernel memory leak initialization with the kmemleak_init, percpu pageset setup with the setup_per_cpu_pageset, setup of the policy with the numa_policy_init, setting time for the scheduler with the sched_clock_init, pidmap initialization with the call of the pidmap_init function for the initial PID namespace, cache creation with the anon_vma_init for the private virtual memory areas and early initialization of the with the acpi_early_init.

This is the end of the ninth part of the and here we saw initialization of the . In the last paragraph of this part (Rest of the initialization process) we will go through many functions but did not dive into details about their implementations. Do not worry if you do not know anything about these stuff or you know and do not understand anything about this. As I already wrote many times, we will see details of implementations in other parts or other chapters.

It is the end of the ninth part about the Linux kernel . In this part, we looked on the initialization process of the RCU subsystem. In the next part we will continue to dive into linux kernel initialization process and I hope that we will finish with the start_kernel function and will go to the rest_init function from the same source code file and will see the start of the first process.

If you have any questions or suggestions write me a comment or ping me at .

Please note that English is not my first language, And I am really sorry for any inconvenience. If you find any mistakes please send me PR to .

Linux Kernel initialization process
scheduler initialization
RCU
init/main.c
include/linux/preempt.h
Per-CPU variables
here
here
IRQs
lib/idr.c
places
init/main.c
Linux kernel memory management
slab
i2c
drivers/i2c/i2c-core-base.c
here
RCU
kernel/rcu/tiny.c
kernel/rcu/tree.c
lock-free data structures
percpu
critical section
kernel/rcu/tree.c
kernel/rcu/tree.h
jiffies
ULONG_MAX
loop
kernel/rcu/tree.h
here
include/linux/interrupt.h
kernel/rcu/tree.c
CPU hotplug
init/main.c
tracing
here
Radix tree
lib/radix-tree.c
Radix tree
perf
SLAB
Linux memory management
drivers/tty/tty_io.c
Lock dependency validator
detector
NUMA
ACPI
linux kernel initialization process
RCU
initialization process
init/main.c
twitter
linux-insides
lock-free data structures
kmemleak
ACPI
IRQs
RCU
RCU documentation
integer ID management
Documentation/memory-barriers.txt
Runtime locking correctness validator
Per-CPU variables
Linux kernel memory management
slab
i2c
Previous part