LogoLogo
  • 简介
  • 引导
    • 从引导加载程序内核
    • 在内核安装代码的第一步
    • 视频模式初始化和转换到保护模式
    • 过渡到 64 位模式
    • 内核解压缩
  • 初始化
    • 内核解压之后的首要步骤
    • 早期的中断和异常控制
    • 在到达内核入口之前最后的准备
    • 内核入口 - start_kernel
    • 体系架构初始化
    • 进一步初始化指定体系架构
    • 最后对指定体系架构初始化
    • 调度器初始化
    • RCU 初始化
    • 初始化结束
  • 中断
    • 中断和中断处理第一部分
    • 深入 Linux 内核中的中断
    • 初步中断处理
    • 中断处理
    • 异常处理的实现
    • 处理不可屏蔽中断
    • 深入外部硬件中断
    • IRQs的非早期初始化
    • Softirq, Tasklets and Workqueues
    • 最后一部分
  • 系统调用
    • 系统调用概念简介
    • Linux 内核如何处理系统调用
    • vsyscall and vDSO
    • Linux 内核如何运行程序
    • open 系统调用的实现
    • Linux 资源限制
  • 定时器和时钟管理
    • 简介
    • 时钟源框架简介
    • The tick broadcast framework and dyntick
    • 定时器介绍
    • Clockevents 框架简介
    • x86 相关的时钟源
    • Linux 内核中与时钟相关的系统调用
  • 同步原语
    • 自旋锁简介
    • 队列自旋锁
    • 信号量
    • 互斥锁
    • 读者/写者信号量
    • 顺序锁
    • RCU
    • Lockdep
  • 内存管理
    • 内存块
    • 固定映射地址和 ioremap
    • kmemcheck
  • 控制组
    • 控制组简介
  • SMP
  • 概念
    • 每个 CPU 的变量
    • CPU 掩码
    • initcall 机制
    • Linux 内核的通知链
  • Linux 内核中的数据结构
    • 双向链表
    • 基数树
    • 位数组
  • 理论
    • 分页
    • ELF 文件格式
    • 內联汇编
    • CPUID
    • MSR
  • Initial ram disk
  • 杂项
    • Linux 内核开发
    • 内核编译方法
    • 链接器
    • 用户空间的程序启动过程
    • 书写并提交你第一个内核补丁
  • 内核数据结构
    • 中断描述符表
  • 有帮助的链接
  • 贡献者
由 GitBook 提供支持
在本页
  • Time related system calls in the Linux kernel
  • Implementation of the gettimeofday system call
  • Implementation of the clock_gettime system call
  • Implementation of the nanosleep system call
  • Conclusion
  • Links
  1. 定时器和时钟管理

Linux 内核中与时钟相关的系统调用

上一页x86 相关的时钟源下一页同步原语

最后更新于1年前

Time related system calls in the Linux kernel

This is the seventh and last part , which describes timers and time management related stuff in the Linux kernel. In the previous , we discussed timers in the context of : and . Internal time management is an interesting part of the Linux kernel, but of course not only the kernel needs the time concept. Our programs also need to know time. In this part, we will consider implementation of some time management related . These system calls are:

  • clock_gettime;

  • gettimeofday;

  • nanosleep.

We will start from a simple userspace program and see all way from the call of the function to the implementation of certain system calls. As each provides its own implementation of certain system calls, we will consider only specific implementations of system calls, as this book is related to this architecture.

Additionally, we will not consider the concept of system calls in this part, but only implementations of these three system calls in the Linux kernel. If you are interested in what is a system call, there is a special about this.

So, let's start from the gettimeofday system call.

Implementation of the gettimeofday system call

As we can understand from the name gettimeofday, this function returns the current time. First of all, let's look at the following simple example:

#include <time.h>
#include <sys/time.h>
#include <stdio.h>

int main(int argc, char **argv)
{
    char buffer[40];
    struct timeval time;
        
    gettimeofday(&time, NULL);

    strftime(buffer, 40, "Current date/time: %m-%d-%Y/%T", localtime(&time.tv_sec));
    printf("%s\n",buffer);

    return 0;
}

As you can see, here we call the gettimeofday function, which takes two parameters. The first parameter is a pointer to the timeval structure, which represents an elapsed time:

struct timeval {
    time_t      tv_sec;     /* seconds */
    suseconds_t tv_usec;    /* microseconds */
};

The second parameter of the gettimeofday function is a pointer to the timezone structure which represents a timezone. In our example, we pass address of the timeval time to the gettimeofday function, the Linux kernel fills the given timeval structure and returns it back to us. Additionally, we format the time with the strftime function to get something more human readable than elapsed microseconds. Let's see the result:

~$ gcc date.c -o date
~$ ./date
Current date/time: 03-26-2016/16:42:02

The glibc implementation of gettimeofday tries to resolve the given symbol; in our case this symbol is __vdso_gettimeofday by the call of the _dl_vdso_vsym internal function. If the symbol cannot be resolved, it returns NULL and we fallback to the call of the usual system call:

return (_dl_vdso_vsym ("__vdso_gettimeofday", &linux26)
  ?: (void*) (&__gettimeofday_syscall));
int gettimeofday(struct timeval *, struct timezone *)
	__attribute__((weak, alias("__vdso_gettimeofday")));

The __vdso_gettimeofday is defined in the same source code file and calls the do_realtime function if the given timeval is not null:

notrace int __vdso_gettimeofday(struct timeval *tv, struct timezone *tz)
{
	if (likely(tv != NULL)) {
		if (unlikely(do_realtime((struct timespec *)tv) == VCLOCK_NONE))
			return vdso_fallback_gtod(tv, tz);
		tv->tv_usec /= 1000;
	}
	if (unlikely(tz != NULL)) {
		tz->tz_minuteswest = gtod->tz_minuteswest;
		tz->tz_dsttime = gtod->tz_dsttime;
	}

	return 0;
}

If the do_realtime will fail, we fallback to the real system call via call the syscall instruction and passing the __NR_gettimeofday system call number and the given timeval and timezone:

notrace static long vdso_fallback_gtod(struct timeval *tv, struct timezone *tz)
{
	long ret;

	asm("syscall" : "=a" (ret) :
	    "0" (__NR_gettimeofday), "D" (tv), "S" (tz) : "memory");
	return ret;
}

First of all we try to access the gtod or global time of day the vsyscall_gtod_data structure via the call of the gtod_read_begin and will continue to do it until it will be successful:

do {
	seq = gtod_read_begin(gtod);
	mode = gtod->vclock_mode;
	ts->tv_sec = gtod->wall_time_sec;
	ns = gtod->wall_time_snsec;
	ns += vgetsns(&mode);
	ns >>= gtod->shift;
} while (unlikely(gtod_read_retry(gtod, seq)));

ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
ts->tv_nsec = ns;

That's all about the gettimeofday system call. The next system call in our list is the clock_gettime.

Implementation of the clock_gettime system call

The clock_gettime function gets the time which is specified by the second parameter. Generally the clock_gettime function takes two parameters:

  • clk_id - clock identifier;

  • timespec - address of the timespec structure which represent elapsed time.

Let's look on the following simple example:

#include <time.h>
#include <sys/time.h>
#include <stdio.h>

int main(int argc, char **argv)
{
    struct timespec elapsed_from_boot;

    clock_gettime(CLOCK_BOOTTIME, &elapsed_from_boot);

    printf("%d - seconds elapsed from boot\n", elapsed_from_boot.tv_sec);
    
    return 0;
}

which prints uptime information:

~$ gcc uptime.c -o uptime
~$ ./uptime
14180 - seconds elapsed from boot
~$ uptime
up  3:56

The elapsed_from_boot.tv_sec represents elapsed time in seconds, so:

>>> 14180 / 60
236
>>> 14180 / 60 / 60
3
>>> 14180 / 60 % 60
56

The clock_id maybe one of the following:

  • CLOCK_REALTIME - system wide clock which measures real or wall-clock time;

  • CLOCK_REALTIME_COARSE - faster version of the CLOCK_REALTIME;

  • CLOCK_MONOTONIC - represents monotonic time since some unspecified starting point;

  • CLOCK_MONOTONIC_COARSE - faster version of the CLOCK_MONOTONIC;

  • CLOCK_BOOTTIME - the same as the CLOCK_MONOTONIC but plus time that the system was suspended;

  • CLOCK_PROCESS_CPUTIME_ID - per-process time consumed by all threads in the process;

  • CLOCK_THREAD_CPUTIME_ID - thread-specific clock.

The Implementation of the clock_gettime depends on the clock id. If we have passed the CLOCK_REALTIME clock id, the do_realtime function will be called:

notrace int __vdso_clock_gettime(clockid_t clock, struct timespec *ts)
{
	switch (clock) {
	case CLOCK_REALTIME:
		if (do_realtime(ts) == VCLOCK_NONE)
			goto fallback;
		break;
    ...
    ...
    ...
fallback:
	return vdso_fallback_gettime(clock, ts);
}

In other cases, the do_{name_of_clock_id} function is called. Implementations of some of them is similar. For example if we will pass the CLOCK_MONOTONIC clock id:

...
...
...
case CLOCK_MONOTONIC:
	if (do_monotonic(ts) == VCLOCK_NONE)
		goto fallback;
	break;
...
...
...

the do_monotonic function will be called which is very similar on the implementation of the do_realtime:

notrace static int __always_inline do_monotonic(struct timespec *ts)
{
	do {
		seq = gtod_read_begin(gtod);
		mode = gtod->vclock_mode;
		ts->tv_sec = gtod->monotonic_time_sec;
		ns = gtod->monotonic_time_snsec;
		ns += vgetsns(&mode);
		ns >>= gtod->shift;
	} while (unlikely(gtod_read_retry(gtod, seq)));

	ts->tv_sec += __iter_div_u64_rem(ns, NSEC_PER_SEC, &ns);
	ts->tv_nsec = ns;

	return mode;
}

That's all.

Implementation of the nanosleep system call

The last system call in our list is the nanosleep. As you can understand from its name, this function provides sleeping ability. Let's look on the following simple example:

#include <time.h>
#include <stdlib.h>
#include <stdio.h>

int main (void)
{    
   struct timespec ts = {5,0};

   printf("sleep five seconds\n");
   nanosleep(&ts, NULL);
   printf("end of sleep\n");

   return 0;
}

If we will compile and run it, we will see the first line

~$ gcc sleep_test.c -o sleep
~$ ./sleep
sleep five seconds
end of sleep

and the second line after five seconds.

  • rdi - first parameter;

  • rsi - second parameter;

  • rdx - third parameter;

  • r10 - fourth parameter;

  • r8 - fifth parameter;

  • r9 - sixth parameter.

The nanosleep system call has two parameters - two pointers to the timespec structures. The system call suspends the calling thread until the given timeout has elapsed. Additionally it will finish if a signal interrupts its execution. It takes two parameters, the first is timespec which represents timeout for the sleep. The second parameter is the pointer to the timespec structure too and it contains remainder of time if the call of the nanosleep was interrupted.

As nanosleep has two parameters:

int nanosleep(const struct timespec *req, struct timespec *rem);
# define INTERNAL_SYSCALL(name, err, nr, args...) \
  INTERNAL_SYSCALL_NCS (__NR_##name, err, nr, ##args)
# define INTERNAL_SYSCALL_NCS(name, err, nr, args...)      \
  ({									                                      \
    unsigned long int resultvar;					                          \
    LOAD_ARGS_##nr (args)						                              \
    LOAD_REGS_##nr							                                  \
    asm volatile (							                                  \
    "syscall\n\t"							                                  \
    : "=a" (resultvar)							                              \
    : "0" (name) ASM_ARGS_##nr : "memory", REGISTERS_CLOBBERED_BY_SYSCALL);   \
    (long int) resultvar; })

The LOAD_ARGS_##nr macro calls the LOAD_ARGS_N macro where the N is number of arguments of the system call. In our case, it will be the LOAD_ARGS_2 macro. Ultimately all of these macros will be expanded to the following:

# define LOAD_REGS_TYPES_1(t1, a1)					   \
  register t1 _a1 asm ("rdi") = __arg1;					   \
  LOAD_REGS_0

# define LOAD_REGS_TYPES_2(t1, a1, t2, a2)				   \
  register t2 _a2 asm ("rsi") = __arg2;					   \
  LOAD_REGS_TYPES_1(t1, a1)
...
...
...
SYSCALL_DEFINE2(nanosleep, struct timespec __user *, rqtp,
		struct timespec __user *, rmtp)
{
	struct timespec tu;

	if (copy_from_user(&tu, rqtp, sizeof(tu)))
		return -EFAULT;

	if (!timespec_valid(&tu))
		return -EINVAL;

	return hrtimer_nanosleep(&tu, rmtp, HRTIMER_MODE_REL, CLOCK_MONOTONIC);
}
static inline bool timespec_valid(const struct timespec *ts)
{
	if (ts->tv_sec < 0)
		return false;
	if ((unsigned long)ts->tv_nsec >= NSEC_PER_SEC)
		return false;
	return true;
}
do {
	set_current_state(TASK_INTERRUPTIBLE);
	hrtimer_start_expires(&t->timer, mode);

	if (likely(t->task))
		freezable_schedule();
    
} while (t->task && !signal_pending(current));

__set_current_state(TASK_RUNNING);
return t->task == NULL;

Which freezes current task during sleep. After we set TASK_INTERRUPTIBLE flag for the current task, the hrtimer_start_expires function starts the give high-resolution timer on the current processor. As the given high resolution timer will expire, the task will be again running.

That's all.

Conclusion

Links

As you may already know, a userspace application does not call a system call directly from the kernel space. Before the actual system call entry will be called, we call a function from the standard library. In my case it is , so I will consider this case. The implementation of the gettimeofday function is located in the source code file. As you already may know, the gettimeofday is not a usual system call. It is located in the special area which is called vDSO (you can read more about it in the , which describes this concept).

The gettimeofday entry is located in the source code file. As we can see the gettimeofday is a weak alias of the __vdso_gettimeofday:

The do_realtime function gets the time data from the vsyscall_gtod_data structure which is defined in the header file and contains mapping of the timespec structure and a couple of fields which are related to the current clock source in the system. This function fills the given timeval structure with values from the vsyscall_gtod_data which contains a time related data which is updated via timer interrupt.

As we got access to the gtod, we fill the ts->tv_sec with the gtod->wall_time_sec which stores current time in seconds gotten from the during initialization of the timekeeping subsystem in the Linux kernel and the same value but in nanoseconds. In the end of this code we just fill the given timespec structure with the resulted values.

We can easily check the result with the help of the util:

CLOCK_MONOTONIC_RAW - the same as the CLOCK_MONOTONIC but provides non adjusted time.

The clock_gettime is not usual syscall too, but as the gettimeofday, this system call is placed in the vDSO area. Entry of this system call is located in the same source code file - ) as for gettimeofday.

We already saw a little about the implementation of this function in the previous paragraph about the gettimeofday. There is only one difference here, that the sec and nsec of our timespec value will be based on the gtod->monotonic_time_sec instead of gtod->wall_time_sec which maps the value of the tk->tkr_mono.xtime_nsec or number of elapsed.

The nanosleep is not located in the vDSO area like the gettimeofday and the clock_gettime functions. So, let's look how the real system call which is located in the kernel space will be called by the standard library. The implementation of the nanosleep system call will be called with the help of the instruction. Before the execution of the syscall instruction, parameters of the system call must be put in processor according to order which is described in the or in other words:

To call system call, we need put the req to the rdi register, and the rem parameter to the rsi register. The does these job in the INTERNAL_SYSCALL macro which is located in the header file.

which takes the name of the system call, storage for possible error during execution of system call, number of the system call (all x86_64 system calls you can find in the ) and arguments of certain system call. The INTERNAL_SYSCALL macro just expands to the call of the INTERNAL_SYSCALL_NCS macro, which prepares arguments of system call (puts them into the processor registers in correct order), executes syscall instruction and returns the result:

After the syscall instruction will be executed, the will occur and the kernel will transfer execution to the system call handler. The system call handler for the nanosleep system call is located in the source code file and defined with the SYSCALL_DEFINE2 macro helper:

More about the SYSCALL_DEFINE2 macro you may read in the about system calls. If we look at the implementation of the nanosleep system call, first of all we will see that it starts from the call of the copy_from_user function. This function copies the given data from the userspace to kernelspace. In our case we copy timeout value to sleep to the kernelspace timespec structure and check that the given timespec is valid by the call of the timesc_valid function:

which just checks that the given timespec does not represent date before 1970 and nanoseconds does not overflow 1 second. The nanosleep function ends with the call of the hrtimer_nanosleep function from the same source code file. The hrtimer_nanosleep function creates a and calls the do_nanosleep function. The do_nanosleep does main job for us. This function provides loop:

This is the end of the seventh part of the that describes timers and timer management related stuff in the Linux kernel. In the previous part we saw specific clock sources. As I wrote in the beginning, this part is the last part of this chapter. We saw important time management related concepts like clocksource and clockevents frameworks, jiffies counter and etc., in this chpater. Of course this does not cover all of the time management in the Linux kernel. Many parts of this mostly related to the scheduling which we will see in other chapter.

If you have questions or suggestions, feel free to ping me in twitter , drop me or just create .

Please note that English is not my first language and I am really sorry for any inconvenience. If you found any mistakes please send me PR to .

chapter
part
x86_64
High Precision Event Timer
Time Stamp Counter
system calls
C
standard library
architecture
x86_64
chapter
glibc
sysdeps/unix/sysv/linux/x86/gettimeofday.c
part
arch/x86/entry/vdso/vclock_gettime.c
arch/x86/include/asm/vgtod.h
real time clock
uptime
NTP
arch/x86/entry/vdso/vclock_gettime.c
nanoseconds
syscall
registers
System V Application Binary Interface
glibc
sysdeps/unix/sysv/linux/x86_64/sysdep.h
system calls table
context switch
kernel/time/hrtimer.c
chapter
timer
chapter
x86_64
0xAX
email
issue
linux-insides
system call
C programming language
standard library
glibc
real time clock
NTP
nanoseconds
register
System V Application Binary Interface
context switch
Introduction to timers in the Linux kernel
uptime
system calls table for x86_64
High Precision Event Timer
Time Stamp Counter
x86_64
previous part