系统调用时现代操作系统中内核提供的用户进程与内核进行交互的一组接口。这些接口具有完成诸如访问硬件设备、进程间通信、申请操作系统资源等能力。实际上提供这些接口主要是为了保证系统稳定可靠,避免应用程序恣意妄行

与内核通信

系统调用在用户空间进程和硬件设备之间添加了一个中间层,该层的主要作用有:

  • 第一,为用户空间提供了一种硬件的抽象接口。例如,当需要读写文件时,应用程序可以不去管磁盘类型和介质,不去管文件系统类型
  • 第二,系统调用保证了系统的稳定和安全。作为硬件设备和应用程序之间的中间人,内核可以基于权限、用户类型和其他一些规则对需要进行的访问进行裁决
  • 第三,每个进程都运行在虚拟系统中,而在用户空间和系统的其余部分提供这样一层公共接口,也是出于这种考虑

在Linux中,系统调用是用户空间访问内核的唯一手段;除异常和陷入外,它们是内核唯一的合法入口

系统调用

要访问系统调用,通常通过C库中定义的函数调用来进行,它们通常都需要零个、一个或几个参数

系统调用最终具有一种明确的操作,例如getpid()系统调用,根据定义它会返回当前进程的pid,内核中它的实现非常简单:

1
2
3
4
SYSCALL_DEFINE0(getpid)
{
return task_tgid_vnr(current); //return current->tgid
}

SYSCALL_DEFINE0只是一个宏,它定义一个无参数的系统调用,展开后的代码如下:

1
asmlinkage long sys_getgid(void)

注意函数声明中的asmlinkage限定词,这是一个编译指令,通知编译器仅从栈中提取该函数的参数。所有的系统调用都需要这个限定词。其次,函数返回long。为了保证32位与64位系统的兼容,系统调用在用户空间和内核空间有不同的返回值类型,在用户空间为int,内核空间为long。最后注意get_pid()在内核中被定义成sys_getpid(),这是Linux中所有系统调用都应该遵守的命名规则

为什么getpid()返回的是tgid(即线程组id)?原因在于,对于普通进程来说,TGIDPID相等。对于线程来说,同一线程组内的所有线程器TGID相等,这使得这些线程能够调用getpid()并得到相同的PID

系统调用号

每个系统调用被赋予一个系统调用号,用来指明执行哪个系统调用,进程不会提及系统调用的名称

系统调用号相当重要,一旦分配就不能变更,否则编译好的程序会崩溃。如果一个系统调用被删除,所占用的系统调用号也不被回收。Linux有一个“未实现”系统调用sys_ni_syscall(),它出了返回-ENOSYS外不做任何其他工作。如果一个系统调用被删除或者变得不可用,这个函数就要负责“填补空缺”

内核记录了系统调用表中所有已注册过系统调用的列表,存储在sys_call_table中。每一种体系结构都明确定义了这个表,在x86_64中定义于arch/i386/kernel/syscall_64.c

系统调用的性能

Linux系统调用比其他许多操作系统执行要快,一个重要原因是Linux很短的上下文切换时间,进出内核都被优化得很简洁,另一个原因是系统调用本身也非常简洁

系统调用处理程序

程序不能直接执行内核代码,因此程序应该以某种方式通知系统,告知内核自己需要执行一个系统调用,从而系统切换到内核态,代表程序在内核空间执行系统调用

通知内核的机制是靠软中断实现的:通过引发一个异常来促使系统切换到内核态去执行异常处理程序。此时的异常处理程序实际上就是系统调用处理程序。在x86系统上预定义的软中断是中断号128,通过int $0x80指令触发该中断。这条指令会触发一个异常导致系统切换到内核态并执行第128号异常处理程序,而该程序正是系统调用处理程序

。这个处理程序名字起得很贴切,叫system_call()

指定恰当的系统调用

在x86上,系统调用号是通过eax寄存器传递给内核的。system_call()函数通过将给定的系统调用号与NR_syscalls作比较来检查其有效性。如果它大于或等于NR_syscalls,该函数返回-ENOSYS,否则就执行相应的系统调用:

1
call *sys_call_table(,%rax,8)

内核需要将给定的系统调用号乘以特定长度,然后用所得的结果在该表中查询其位置

参数传递

大部分系统调用出了系统调用号外还需要传递参数。在x86-32系统上,ebx,ecx,edx,esi和edi按照顺序存放前五个参数,需要六个或六个以上参数的情况不多见,此时,应该用一个单独的寄存器存放指向这些参数在用户空间地址的指针

参数验证

系统调用必须仔细检查所有的参数是否合法有效,以控制权限和资源访问

最重要的一个检查就是检查用户提供的指针是否有效,内核必须保证:

  • 指针指向的内存区域属于用户空间
  • 指针指向的内存区域在进程的地址空间里
  • 指向的内存区域的权限应该与操作相符合

内核提供了copy_from_user()copy_to_user()来从用户空间读数据和向用户空间写数据。如果执行失败,这两个函数返回的都是没能完成拷贝的字节数。如果成功,则返回0

注意copy_to_user()copy_from_user()都有可能引起阻塞,当包含用户数据的页被换出到硬盘上而不是在物理内存上的时候,这种情况就会发生。此时,进程就会休眠,直到缺页处理程序将该页从硬盘重新换回物理内存

关于最后一项,检查指针是否具有合法权限。在老版本中需要调用suser()函数来完成检查。在新版本中,使用capable()函数来检查是否有权能对指定的资源进操作。返回非0则有权,返回0则无权。例如capable(CAP_SYS_NICE)可以检查调用者是否有权改变其他进程的nice值

系统调用上下文

内核在执行系统调用时处于系统调用上下文,current指针指向当前任务,即引发系统调用的那个进程

在进程上下文中,内核可以休眠(比如在系统调用阻塞或显式调用schedule()时)并且可以被抢占。能够休眠说明系统调用可以使用内核提供的绝大部分功能。在进程上下文中可以被抢占表明,像用户空间内的进程一样,当前进程同样可以被其他进程抢占

当系统调用返回的时候,控制权仍然在system_call()中,它最终会负责切换到用户空间,并让用户进程继续执行下去

绑定一个系统调用的最后步骤

当编写完一个系统调用之后,把它注册为一个正式的系统调用是件琐碎的工作

  1. 首先,在系统调用表的最后加入一个表项。从0开始算起,系统调用在该表中的位置就是它的系统调用号
  2. 对于所支持的各种体系结构,系统调用号都必须定义于<asm/unistd.h>
  3. 系统调用必须被编译进内核映像(不能编译成模块)。这只要把它放进kernel/下的一个相关文件中就好了,比如sys.c,它涵盖了各种各样的系统调用

让我们通过一个虚构的系统调用foo()来观察一下这些步骤。首先我们要把sys_foo加入到系统调用表中去。对于大多数体系结构来说,该表位于entry.s文件中,如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
ENTRY(sys_claa_table)
.long sys_restart_syscall /* 0 */
.long sys_exit
.long sys_fork
.long sys_read
.long sys_write
.long sys_open /* 5 */

......
.long sys_eventfd2
.long sys_epoll_createl
.long sys_dup3 /* 330 */
.long sys_pipe2
.long sys_inotify_initl
.long sys_preadv
.long sys_pwritev
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
.long sys_recvmmsg

我门把新的系统调用加到这个表的末尾:

1
.long sys_foo

虽然没有明确地指定编号,但我们加入的这个系统调用被按照次序分配给了338这个系统调用号

注意,每隔5个表项就加入一个调用号注释的习惯可以在查找系统调用对应的调用号时提供方便

接下来,把系统调用号加入到<asm/unistd.h>

加入这行:

1
#define __NR_foo	338

最后,我们来实现foo()系统调用。在这个例子中我们把它放进kernel/sys.c文件中,你也可以将其放到与其功能练习最紧密的代码中去,加入它的功能与调度有关,也可以放到kernel/sched.c中去

如下:

1
2
3
4
5
6
7
8
9
#include <asm/page.h>
/*
* sys_foo
* 返回每个进程的内核栈大小
*/
asmlinkage long sys_foo(void)
{
return THREAD_SIZE;
}

这样,就可以在启动内核并在用户空间调用foo()系统调用了

从用户空间访问系统调用

通常,系统调用靠C库支持

然而,Linux本身提供了一组宏,用于直接对系统调用进行访问,它会设置好寄存器并调用陷入指令,这些宏便是_syscalln。n的范围从0到6,代表需要传递给系统调用的参数个数

例如,open系统调用的定义是:

1
long open(const char *filename, int flags, int mode)

而不靠库支持,直接调用此系统调用的宏的形式为:

1
2
#define NR_open 5
_syscall3(long, open, const char*, filename, int, flags, int, mode)

这样,应用程序就可以直接使用open

对于每个宏来说,都有2+n*2个参数。第一个是系统调用的返回值类型,第二个是系统调用的名称,再往后是按照参数顺序排列的每个参数的类型和名称

_NR_open<asm/unistd.h>中定义,该宏会被扩展成为内嵌汇编的C函数,由汇编语言将系统调用号和参数压入寄存器并触发软中断来陷入内核

例如使用前面的foo系统调用:

1
2
3
4
5
6
7
8
9
10
11
#define _NR_foo 283
_syscall0(long, foo)

int main()
{
long stack_size;
stack_size = foo();
printf("The kernel stack size is %ld\n", stack_size);

return 0;
}

从源码角度出发看syscall过程

主要参考这篇文章:Linux syscall过程分析

所用Linux源码为4.9.76

int/iret

我们通过int 0x80触发系统调用:

1
2
mov 0x05 ,eax       /* 设置系统调用号 */
int 0x80

arch/x86/kernel/traps.ctrap_init 中,定义了各种set_intr_gate/set_intr_gate_ist/set_system_intr_gate。其中set_system_intr_gate用于在**中断描述符表(IDT)**上设置系统调用门:

1
2
3
4
#ifdef CONFIG_X86_32
set_system_intr_gate(IA32_SYSCALL_VECTOR, entry_INT80_32);
set_bit(IA32_SYSCALL_VECTOR, used_vectors);
#endif

set_system_intr_gate()函数用于设置中断门,如下:

1
static inline void set_system_intr_gate(unsigned int n, void *addr) //在IDT的第n个表项插入一个中断门。门中的段选择符设置成内核代码的段选择符,偏移量设置为异常处理程序的地址addr, DPL字段设置为3.

转到IA32_SYSCALL_VECTOR定义处会发现:

1
#define IA32_SYSCALL_VECTOR		0x80

也就是说,此处将0x80号中断设置为entry_INT80_32

于是在调用int 0x80后,硬件根据向量号在IDT中找到对应的表项,即中断描述符,进行特权级检查,发现 DPL = CPL = 3 ,允许调用。然后硬件将切换到内核栈 (tss.ss0 : tss.esp0)。接着根据中断描述符的segment selector在GDT/LDT中找到对应的段描述符,从段描述符拿到段的基址,加载到cs 。将offset加载到eip。最后硬件将ss/sp/eflags/cs/ip/error code依次压到内核栈

于是从entry_INT80_32开始执行,其定义在arch/x86/entry/entry_32.S

代码如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
ENTRY(entry_INT80_32)
ASM_CLAC
pushl %eax /* pt_regs->orig_ax */
SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest */

/*
* User mode is traced as though IRQs are on, and the interrupt gate
* turned them off.
*/
TRACE_IRQS_OFF

movl %esp, %eax
call do_int80_syscall_32

主要干了啥呢?

主要将存在eax中的系统调用号压入栈中,然后调用SAVE_ALL将其他寄存器的值压入栈中进行保存

依然是在entry_32.S中找到SAVE_ALL的定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
.macro SAVE_ALL pt_regs_ax=%eax
cld
PUSH_GS
pushl %fs
pushl %es
pushl %ds
pushl \pt_regs_ax
pushl %ebp
pushl %edi
pushl %esi
pushl %edx
pushl %ecx
pushl %ebx
movl $(__USER_DS), %edx
movl %edx, %ds
movl %edx, %es
movl $(__KERNEL_PERCPU), %edx
movl %edx, %fs
SET_KERNEL_GS %edx
.endm

可以看到,首先将各个寄存器保存在栈上,然后将ds和es设置为__USER_DS,将fs和gs设置为__KERNEL_PERCPU

保存完毕后,关闭中断(TRACE_IRQS_OFF),将当前栈指针保存到eax,调用 do_int80_syscall_32 => do_syscall_32_irqs_on,该函数在 arch/x86/entry/common.c 中定义:

1
2
3
4
5
6
7
/* Handles int $0x80 */
__visible void do_int80_syscall_32(struct pt_regs *regs)
{
enter_from_user_mode();
local_irq_enable();
do_syscall_32_irqs_on(regs);
}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
{
struct thread_info *ti = current_thread_info();
unsigned int nr = (unsigned int)regs->orig_ax;

#ifdef CONFIG_IA32_EMULATION
current->thread.status |= TS_COMPAT;
#endif

if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
/*
* Subtlety here: if ptrace pokes something larger than
* 2^32-1 into orig_ax, this truncates it. This may or
* may not be necessary, but it matches the old asm
* behavior.
*/
nr = syscall_trace_enter(regs);
}

if (likely(nr < IA32_NR_syscalls)) {
/*
* It's possible that a 32-bit syscall implementation
* takes a 64-bit parameter but nonetheless assumes that
* the high bits are zero. Make sure we zero-extend all
* of the args.
*/
regs->ax = ia32_sys_call_table[nr](
(unsigned int)regs->bx, (unsigned int)regs->cx,
(unsigned int)regs->dx, (unsigned int)regs->si,
(unsigned int)regs->di, (unsigned int)regs->bp);
}

syscall_return_slowpath(regs);
}

这个函数的参数regs(struct pt_regs 定义见 arch/x86/include/asm/ptrace.h )就是先前在 entry_INT80_32 依次被压入栈的寄存器

这个结构体定义如下(32位模式下):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#ifdef __i386__
/* this struct defines the way the registers are stored on the
stack during a system call. */

#ifndef __KERNEL__

struct pt_regs {
long ebx;
long ecx;
long edx;
long esi;
long edi;
long ebp;
long eax;
int xds;
int xes;
int xfs;
int xgs;
long orig_eax;
long eip;
int xcs;
long eflags;
long esp;
int xss;
};

#endif /* __KERNEL__ */

#else /* __i386__ */

回过头去看,就可以验证得到与前面的寄存器保存顺序是对应的,注意这些数据都是在内核栈上的,因为在中断处理是先切换到内核栈之后,再进行参数的压入的

而此时的eax为$-ENOSYS(见entry_INT80_32),用来保存系统调用执行返回值的,orig_eax此时才是系统调用号

ok,回过头去继续看do_syscall_32_irqs_on函数

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
{
struct thread_info *ti = current_thread_info();
unsigned int nr = (unsigned int)regs->orig_ax;

#ifdef CONFIG_IA32_EMULATION
current->thread.status |= TS_COMPAT;
#endif

if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
/*
* Subtlety here: if ptrace pokes something larger than
* 2^32-1 into orig_ax, this truncates it. This may or
* may not be necessary, but it matches the old asm
* behavior.
*/
nr = syscall_trace_enter(regs);
}

if (likely(nr < IA32_NR_syscalls)) {
/*
* It's possible that a 32-bit syscall implementation
* takes a 64-bit parameter but nonetheless assumes that
* the high bits are zero. Make sure we zero-extend all
* of the args.
*/
regs->ax = ia32_sys_call_table[nr](
(unsigned int)regs->bx, (unsigned int)regs->cx,
(unsigned int)regs->dx, (unsigned int)regs->si,
(unsigned int)regs->di, (unsigned int)regs->bp);
}

syscall_return_slowpath(regs);
}

首先获取当前进程的thread_info

然后,这里先取出系统调用号,从系统调用表(ia32_sys_call_table) 中取出对应的处理函数,然后通过先前寄存器中的参数调用之

转到系统调用表ia32_sys_call_table定义,发现:

1
2
3
4
5
6
7
8
__visible const sys_call_ptr_t ia32_sys_call_table[__NR_syscall_compat_max+1] = {
/*
* Smells like a compiler bug -- it doesn't work
* when the & below is removed.
*/
[0 ... __NR_syscall_compat_max] = &sys_ni_syscall,
#include <asm/syscalls_32.h>
};

这个表不是静态定义的,而是通过arch/x86/entry/syscalls/syscalltbl.sh动态生成的,具体可以自行分析

最终形成如下一张表:

1
2
3
4
5
6
7
8
9
10
11
__visible const sys_call_ptr_t ia32_sys_call_table[__NR_syscall_compat_max+1] = {
[0 ... __NR_syscall_compat_max] = &sys_ni_syscall,

[0] = sys_restart_syscall,
[1] = sys_exit,
[2] = sys_fork,
[3] = sys_read,
[4] = sys_write,
[5] = sys_open,
...
};

因为我们的调用号是0x05,所以这里调用了sys_open,定义在fs/open.c中定义:

1
2
3
4
5
6
7
SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode)
{
if (force_o_largefile())
flags |= O_LARGEFILE;

return do_sys_open(AT_FDCWD, filename, flags, mode);
}

我们找到了open的定义,来看SYSCALL_DEFINE3的相关定义

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
#define SYSCALL_DEFINEx(x, sname, ...)				\
SYSCALL_METADATA(sname, x, __VA_ARGS__) \
__SYSCALL_DEFINEx(x, sname, __VA_ARGS__)

#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
#define __SYSCALL_DEFINEx(x, name, ...) \
asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \
__attribute__((alias(__stringify(SyS##name)))); \
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__)); \
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \
asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \
{ \
long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__)); \
__MAP(x,__SC_TEST,__VA_ARGS__); \
__PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \
return ret; \
} \
static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))

由两部分组成,SYSCALL_METADATA__SYSCALL_DEFINEx

SYSCALL_METADATA保存了调用的基本信息,供调试程序跟踪使用(kernel 需开启 CONFIG_FTRACE_SYSCALLS)

__SYSCALL_DEFINEx用于拼接函数,函数名被拼接为sys##_##open,参数也通过 __SC_DECL 拼接,最终得到展开后的定义:

1
2
3
4
5
6
7
asmlinkage long sys_open(const char __user * filename, int flags, umode_t mode)
{
if (force_o_largefile())
flags |= O_LARGEFILE;

return do_sys_open(AT_FDCWD, filename, flags, mode);
}

继续看,sys_open调用了do_sys_open,即sys_open是对do_sys_open的封装

再次转到do_sys_open

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
long do_sys_open(int dfd, const char __user *filename, int flags, umode_t mode)
{
struct open_flags op;
int fd = build_open_flags(flags, mode, &op);
struct filename *tmp;

if (fd)
return fd;

tmp = getname(filename);
if (IS_ERR(tmp))
return PTR_ERR(tmp);

fd = get_unused_fd_flags(flags);
if (fd >= 0) {
struct file *f = do_filp_open(dfd, tmp, &op);
if (IS_ERR(f)) {
put_unused_fd(fd);
fd = PTR_ERR(f);
} else {
fsnotify_open(f);
fd_install(fd, f);
}
}
putname(tmp);
return fd;
}

终于拨云见日,我们跟到了open系统调用真正实现的地方

首先,getname将处于用户态的文件名拷到内核态

然后,通过get_unused_fd_flags获取一个没用过的文件描述符fd,然后do_filp_open创建struct filefd_installfdstruct file绑定(task_struct->files->fdt[fd] = file),然后返回fd

fd一直返回到do_syscall_32_irqs_on中,然后被设置到 regs->ax (eax) 中

1
2
3
4
regs->ax = ia32_sys_call_table[nr](
(unsigned int)regs->bx, (unsigned int)regs->cx,
(unsigned int)regs->dx, (unsigned int)regs->si,
(unsigned int)regs->di, (unsigned int)regs->bp);

接着返回entry_INT80_32继续执行,最后执行INTERRUPT_RETURNINTERRUPT_RETURNarch/x86/include/asm/irqflags.h中定义为iret:

1
#define INTERRUPT_RETURN     iret

iret负责恢复先前压栈的寄存器,返回用户态。系统调用执行完毕

sysenter/sysexit

接下来介绍的是32位下Intel提出的快速系统调用 sysenter/sysexit,它和同期AMD的syscall/sysret机制类似

之所以提出新指令,是因为通过软中断来实现系统调用实在太慢了。于是 Intel x86 CPU 自 Pentium II(Family 6, Model 3, Stepping 3)之后,开始支持新的系统调用指令sysenter/sysexit。前者用于从低特权级切换到ring 0,后者用于从ring 0切换到低特权级。没有特权级别检查(CPL, DPL),也没有压栈的操作

在 Intel SDM 中阐述了sysenter指令。首先 CPU 有一堆特殊的寄存器,名为Model-Specific Register(MSR),这些寄存器在操作系统运行过程中起着重要作用。对于这些寄存器,需要采用专门的指令RDMSRWRMSR进行读写

sysenter 用到了以下 MSR (定义在 arch/x86/include/asm/msr-index.h):

  • IA32_SYSENTER_CS(174H):存放内核态处理代码的段选择符
  • IA32_SYSENTER_EIP(175H):存放内核态栈顶偏移量
  • IA32_SYSENTER_ESP(176H):存放内核态处理代码偏移量

当执行 sysenter 时,执行以下操作:

  1. 清除 FLAGS 的 VM 标志,确保在保护模式下运行
  2. 清除 FLAGS 的 IF 标志,屏蔽中断
  3. 加载 IA32_SYSENTER_ESP 的值到 esp
  4. 加载 IA32_SYSENTER_EIP 的值到 eip
  5. 加载 SYSENTER_CS_MSR 的值到 CS
  6. SYSENTER_CS_MSR + 8 的值加载到 ss 。因为在GDT中,ss 就跟在 cs 后面
  7. 开始执行(cs:eip)指向的代码

这些MSR在arch/x86/kernel/cpu/common.cenable_sep_cpu中初始化:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
/*
* Set up the CPU state needed to execute SYSENTER/SYSEXIT instructions
* on 32-bit kernels:
*/
#ifdef CONFIG_X86_32
void enable_sep_cpu(void)
{
struct tss_struct *tss;
int cpu;

if (!boot_cpu_has(X86_FEATURE_SEP))
return;

cpu = get_cpu();
tss = &per_cpu(cpu_tss, cpu);

/*
* We cache MSR_IA32_SYSENTER_CS's value in the TSS's ss1 field --
* see the big comment in struct x86_hw_tss's definition.
*/

tss->x86_tss.ss1 = __KERNEL_CS;
wrmsr(MSR_IA32_SYSENTER_CS, tss->x86_tss.ss1, 0);

wrmsr(MSR_IA32_SYSENTER_ESP,
(unsigned long)tss + offsetofend(struct tss_struct, SYSENTER_stack),
0);

wrmsr(MSR_IA32_SYSENTER_EIP, (unsigned long)entry_SYSENTER_32, 0);

put_cpu();
}
#endif

这里将__KERNEL_CS设置到MSR_IA32_SYSENTER_CS中,将tss->x86_tss.ss1谁知道MSR_IA32_SYSENTER_CS中将tss.SYSENTER_stack地址设置到MSR_IA32_SYSENTER_ESP中,最后将内核入口点entry_SYSENTER_32的地址设置到MSR_IA32_SYSENTER_EIP

当用户程序进行系统调用时,实际上在用户态中最终会调用到 VDSO 中映射的__kernel_vsyscall,其定义位于arch/x86/entry/vdso/vdso32/system_call.S中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
__kernel_vsyscall:
CFI_STARTPROC
/*
* Reshuffle regs so that all of any of the entry instructions
* will preserve enough state.
*
* A really nice entry sequence would be:
* pushl %edx
* pushl %ecx
* movl %esp, %ecx
*
* Unfortunately, naughty Android versions between July and December
* 2015 actually hardcode the traditional Linux SYSENTER entry
* sequence. That is severely broken for a number of reasons (ask
* anyone with an AMD CPU, for example). Nonetheless, we try to keep
* it working approximately as well as it ever worked.
*
* This link may eludicate some of the history:
* https://android-review.googlesource.com/#/q/Iac3295376d61ef83e713ac9b528f3b50aa780cd7
* personally, I find it hard to understand what's going on there.
*
* Note to future user developers: DO NOT USE SYSENTER IN YOUR CODE.
* Execute an indirect call to the address in the AT_SYSINFO auxv
* entry. That is the ONLY correct way to make a fast 32-bit system
* call on Linux. (Open-coding int $0x80 is also fine, but it's
* slow.)
*/
pushl %ecx
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET ecx, 0
pushl %edx
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET edx, 0
pushl %ebp
CFI_ADJUST_CFA_OFFSET 4
CFI_REL_OFFSET ebp, 0

#define SYSENTER_SEQUENCE "movl %esp, %ebp; sysenter"
#define SYSCALL_SEQUENCE "movl %ecx, %ebp; syscall"

#ifdef CONFIG_X86_64
/* If SYSENTER (Intel) or SYSCALL32 (AMD) is available, use it. */
ALTERNATIVE_2 "", SYSENTER_SEQUENCE, X86_FEATURE_SYSENTER32, \
SYSCALL_SEQUENCE, X86_FEATURE_SYSCALL32
#else
ALTERNATIVE "", SYSENTER_SEQUENCE, X86_FEATURE_SEP
#endif

/* Enter using int $0x80 */
int $0x80
GLOBAL(int80_landing_pad)

/*
* Restore EDX and ECX in case they were clobbered. EBP is not
* clobbered (the kernel restores it), but it's cleaner and
* probably faster to pop it than to adjust ESP using addl.
*/
popl %ebp
CFI_RESTORE ebp
CFI_ADJUST_CFA_OFFSET -4
popl %edx
CFI_RESTORE edx
CFI_ADJUST_CFA_OFFSET -4
popl %ecx
CFI_RESTORE ecx
CFI_ADJUST_CFA_OFFSET -4
ret
CFI_ENDPROC

.size __kernel_vsyscall,.-__kernel_vsyscall
.previous

__kernel_vsyscall首先将寄存器当前值压栈保存,因为这些寄存器以后要用作系统调用传参。然后填入参数,调用 sysenter

ALTERNATIVE_2实际上是在做选择,如果支持X86_FEATURE_SYSENTER32(Intel CPU),则执行SYSENTER_SEQUENCE,如果支持X86_FEATURE_SYSCALL32(AMD CPU),则执行SYSCALL_SEQUENCE。如果都不支持,那么啥都不干(???)。如果啥都没干,那么接着往下执行,即执行 int $0x80,退化到传统(legacy)方式进行系统调用

注意sysenter指令会覆盖掉esp,因此SYSENTER_SEQUENCE中会将当前esp保存到ebp中。sysenter同样会覆盖eip,但由于返回地址是固定的(__kernel_vsyscall 函数结尾),因此无需保存

前文提到过,执行了sysenter指令之后直接切换到内核态,同时寄存器也都设置好了:eip被设置为IA32_SYSENTER_EIPentry_SYSENTER_32的地址,其定义在arch/x86/entry/entry_32.S

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
ENTRY(entry_SYSENTER_32)
movl TSS_sysenter_sp0(%esp), %esp
sysenter_past_esp:
pushl $__USER_DS /* pt_regs->ss */
pushl %ebp /* pt_regs->sp (stashed in bp) */
pushfl /* pt_regs->flags (except IF = 0) */
orl $X86_EFLAGS_IF, (%esp) /* Fix IF */
pushl $__USER_CS /* pt_regs->cs */
pushl $0 /* pt_regs->ip = 0 (placeholder) */
pushl %eax /* pt_regs->orig_ax */
SAVE_ALL pt_regs_ax=$-ENOSYS /* save rest */

/*
* SYSENTER doesn't filter flags, so we need to clear NT, AC
* and TF ourselves. To save a few cycles, we can check whether
* either was set instead of doing an unconditional popfq.
* This needs to happen before enabling interrupts so that
* we don't get preempted with NT set.
*
* If TF is set, we will single-step all the way to here -- do_debug
* will ignore all the traps. (Yes, this is slow, but so is
* single-stepping in general. This allows us to avoid having
* a more complicated code to handle the case where a user program
* forces us to single-step through the SYSENTER entry code.)
*
* NB.: .Lsysenter_fix_flags is a label with the code under it moved
* out-of-line as an optimization: NT is unlikely to be set in the
* majority of the cases and instead of polluting the I$ unnecessarily,
* we're keeping that code behind a branch which will predict as
* not-taken and therefore its instructions won't be fetched.
*/
testl $X86_EFLAGS_NT|X86_EFLAGS_AC|X86_EFLAGS_TF, PT_EFLAGS(%esp)
jnz .Lsysenter_fix_flags
.Lsysenter_flags_fixed:

/*
* User mode is traced as though IRQs are on, and SYSENTER
* turned them off.
*/
TRACE_IRQS_OFF

movl %esp, %eax
call do_fast_syscall_32

前文提到过,sysenter会将IA32_SYSENTER_ESP加载到esp中,但IA32_SYSENTER_ESP保存的是SYSENTER_stack的地址,需要通过TSS_sysenter_sp0进行修正,指向进程的内核栈

然后开始按照pt_regs的结构将相关寄存器中的值压入栈中,包括在sysenter前保存到ebp的用户态栈顶指针。由于eip无需保存,于是压入0用于占位。

最后调用do_fast_syscall_32,该函数在arch/x86/entry/common.c中定义:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/* Returns 0 to return using IRET or 1 to return using SYSEXIT/SYSRETL. */
__visible long do_fast_syscall_32(struct pt_regs *regs)
{
/*
* Called using the internal vDSO SYSENTER/SYSCALL32 calling
* convention. Adjust regs so it looks like we entered using int80.
*/

unsigned long landing_pad = (unsigned long)current->mm->context.vdso +
vdso_image_32.sym_int80_landing_pad;

/*
* SYSENTER loses EIP, and even SYSCALL32 needs us to skip forward
* so that 'regs->ip -= 2' lands back on an int $0x80 instruction.
* Fix it up.
*/
regs->ip = landing_pad;

enter_from_user_mode();

local_irq_enable();

/* Fetch EBP from where the vDSO stashed it. */
if (
#ifdef CONFIG_X86_64
/*
* Micro-optimization: the pointer we're following is explicitly
* 32 bits, so it can't be out of range.
*/
__get_user(*(u32 *)&regs->bp,
(u32 __user __force *)(unsigned long)(u32)regs->sp)
#else
get_user(*(u32 *)&regs->bp,
(u32 __user __force *)(unsigned long)(u32)regs->sp)
#endif
) {

/* User code screwed up. */
local_irq_disable();
regs->ax = -EFAULT;
prepare_exit_to_usermode(regs);
return 0; /* Keep it simple: use IRET. */
}

/* Now this is just like a normal syscall. */
do_syscall_32_irqs_on(regs);

#ifdef CONFIG_X86_64
/*
* Opportunistic SYSRETL: if possible, try to return using SYSRETL.
* SYSRETL is available on all 64-bit CPUs, so we don't need to
* bother with SYSEXIT.
*
* Unlike 64-bit opportunistic SYSRET, we can't check that CX == IP,
* because the ECX fixup above will ensure that this is essentially
* never the case.
*/
return regs->cs == __USER32_CS && regs->ss == __USER_DS &&
regs->ip == landing_pad &&
(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF)) == 0;
#else
/*
* Opportunistic SYSEXIT: if possible, try to return using SYSEXIT.
*
* Unlike 64-bit opportunistic SYSRET, we can't check that CX == IP,
* because the ECX fixup above will ensure that this is essentially
* never the case.
*
* We don't allow syscalls at all from VM86 mode, but we still
* need to check VM, because we might be returning from sys_vm86.
*/
return static_cpu_has(X86_FEATURE_SEP) &&
regs->cs == __USER_CS && regs->ss == __USER_DS &&
regs->ip == landing_pad &&
(regs->flags & (X86_EFLAGS_RF | X86_EFLAGS_TF | X86_EFLAGS_VM)) == 0;
#endif
}
#endif

由于没有保存eip,我们需要计算系统调用完毕后返回到用户态的地址:

current->mm->context.vdso + vdso_image_32.sym_int80_landing_pad(即跳过sym_int80_landing_pad来到__kernel_vsyscall 的结尾)覆盖掉先前压栈的0

接下来就和int 0x80的流程一样,通过do_syscall_32_irqs_on从系统调用表中找到相应的处理函数进行调用。完成后,如果都符合sysexit的要求,返回1,否则返回0

回到刚刚的entry中:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
call	do_fast_syscall_32
/* XEN PV guests always use IRET path */
ALTERNATIVE "testl %eax, %eax; jz .Lsyscall_32_done", \
"jmp .Lsyscall_32_done", X86_FEATURE_XENPV

/* Opportunistic SYSEXIT */
TRACE_IRQS_ON /* User mode traces as IRQs on. */
movl PT_EIP(%esp), %edx /* pt_regs->ip */
movl PT_OLDESP(%esp), %ecx /* pt_regs->sp */
1: mov PT_FS(%esp), %fs
PTGS_TO_GS
popl %ebx /* pt_regs->bx */
addl $2*4, %esp /* skip pt_regs->cx and pt_regs->dx */
popl %esi /* pt_regs->si */
popl %edi /* pt_regs->di */
popl %ebp /* pt_regs->bp */
popl %eax /* pt_regs->ax */

/*
* Restore all flags except IF. (We restore IF separately because
* STI gives a one-instruction window in which we won't be interrupted,
* whereas POPF does not.)
*/
addl $PT_EFLAGS-PT_DS, %esp /* point esp at pt_regs->flags */
btr $X86_EFLAGS_IF_BIT, (%esp)
popfl

/*
* Return back to the vDSO, which will pop ecx and edx.
* Don't bother with DS and ES (they already contain __USER_DS).
*/
sti
sysexit

根据testl %eax, %eax; jz .Lsyscall_32_done,如果do_fast_syscall_32的返回值(eax)为 0 ,表示不支持快速返回,于是跳转到Lsyscall_32_done,通过iret返回。否则继续执行下面代码,将内核栈中保存的值保存到相应寄存器中,然后通过下面的sysexit返回

注意这里将原有的eip设置到edx、esp设置到ecx,这是因为根据 Intel SDM,sysexit用edx来设置eip,用ecx来设置esp,从而指向先前用户空间的代码偏移和栈偏移。并加载SYSENTER_CS_MSR+16到cs,加载SYSENTER_CS_MSR+24到ss。如此一来就回到了用户态的__kernel_vsyscall尾端

所以,回过头你会发现,在__kernel_vsyscall中,对原先的ecx和edx进行了保存到栈上操作的操作,而从entry_SYSENTER_32返回到__kernel_vsyscall中之后,会从栈上pop出原先的ecx和edx,原因就是在返回到用户态过程中,ecx和edx被esp和eip所占据并拿来操作

我们再次对比sysenterint 0x80,会发现,主要的区别就是,int 0x80以软中断的方式陷入到entry中,而sysenter则是直接将entry地址装入到eip中

syscall/sysret

在32位下,Intel 和 AMD 对快速系统调用指令的定义有分歧,一个使用sysenter,另一个使用syscall。但到了64位下,都统一成syscall了

根据 Intel SDM,syscall会将当前 rip 存到 rcx ,然后将 IA32_LSTAR 加载到 rip。同时将 IA32_STAR[47:32] 加载到cs,IA32_STAR[47:32] + 8 加载到 ss (在 GDT 中,ss 就跟在 cs 后面)

syscall的相关寄存器的初始化工作在arch/x86/kernel/cpu/common.c中完成:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#ifdef CONFIG_X86_64
/*......*/
void syscall_init(void)
{
wrmsr(MSR_STAR, 0, (__USER32_CS << 16) | __KERNEL_CS);
wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);

#ifdef CONFIG_IA32_EMULATION
wrmsrl(MSR_CSTAR, (unsigned long)entry_SYSCALL_compat);
/*
* This only works on Intel CPUs.
* On AMD CPUs these MSRs are 32-bit, CPU truncates MSR_IA32_SYSENTER_EIP.
* This does not cause SYSENTER to jump to the wrong location, because
* AMD doesn't allow SYSENTER in long mode (either 32- or 64-bit).
*/
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)__KERNEL_CS);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, (u64)entry_SYSENTER_compat);
#else
wrmsrl(MSR_CSTAR, (unsigned long)ignore_sysret);
wrmsrl_safe(MSR_IA32_SYSENTER_CS, (u64)GDT_ENTRY_INVALID_SEG);
wrmsrl_safe(MSR_IA32_SYSENTER_ESP, 0ULL);
wrmsrl_safe(MSR_IA32_SYSENTER_EIP, 0ULL);
#endif

/* Flags to clear on syscall */
wrmsrl(MSR_SYSCALL_MASK,
X86_EFLAGS_TF|X86_EFLAGS_DF|X86_EFLAGS_IF|
X86_EFLAGS_IOPL|X86_EFLAGS_AC|X86_EFLAGS_NT);
}

可以看到MSR_STAR的第 32-47 位设置为kernel mode的 cs,48-63位设置为user mode的 cs。而IA32_LSTAR被设置为函数entry_SYSCALL_64的起始地址

于是syscall时,跳转到entry_SYSCALL_64开始执行,其定义在arch/x86/entry/entry_64.S

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
ENTRY(entry_SYSCALL_64)
/*
* Interrupts are off on entry.
* We do not frame this tiny irq-off block with TRACE_IRQS_OFF/ON,
* it is too small to ever cause noticeable irq latency.
*/
SWAPGS_UNSAFE_STACK
SWITCH_KERNEL_CR3_NO_STACK
/*
* A hypervisor implementation might want to use a label
* after the swapgs, so that it can do the swapgs
* for the guest and jump here on syscall.
*/
GLOBAL(entry_SYSCALL_64_after_swapgs)

movq %rsp, PER_CPU_VAR(rsp_scratch)
movq PER_CPU_VAR(cpu_current_top_of_stack), %rsp

TRACE_IRQS_OFF

/* Construct struct pt_regs on stack */
pushq $__USER_DS /* pt_regs->ss */
pushq PER_CPU_VAR(rsp_scratch) /* pt_regs->sp */
pushq %r11 /* pt_regs->flags */
pushq $__USER_CS /* pt_regs->cs */
pushq %rcx /* pt_regs->ip */
pushq %rax /* pt_regs->orig_ax */
pushq %rdi /* pt_regs->di */
pushq %rsi /* pt_regs->si */
pushq %rdx /* pt_regs->dx */
pushq %rcx /* pt_regs->cx */
pushq $-ENOSYS /* pt_regs->ax */
pushq %r8 /* pt_regs->r8 */
pushq %r9 /* pt_regs->r9 */
pushq %r10 /* pt_regs->r10 */
pushq %r11 /* pt_regs->r11 */
sub $(6*8), %rsp /* pt_regs->bp, bx, r12-15 not saved */

/*
* If we need to do entry work or if we guess we'll need to do
* exit work, go straight to the slow path.
*/
movq PER_CPU_VAR(current_task), %r11
testl $_TIF_WORK_SYSCALL_ENTRY|_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
jnz entry_SYSCALL64_slow_path

entry_SYSCALL_64_fastpath:
/*
* Easy case: enable interrupts and issue the syscall. If the syscall
* needs pt_regs, we'll call a stub that disables interrupts again
* and jumps to the slow path.
*/
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
#if __SYSCALL_MASK == ~0
cmpq $__NR_syscall_max, %rax
#else
andl $__SYSCALL_MASK, %eax
cmpl $__NR_syscall_max, %eax
#endif
ja 1f /* return -ENOSYS (already in pt_regs->ax) */
movq %r10, %rcx

/*
* This call instruction is handled specially in stub_ptregs_64.
* It might end up jumping to the slow path. If it jumps, RAX
* and all argument registers are clobbered.
*/
call *sys_call_table(, %rax, 8)
.Lentry_SYSCALL_64_after_fastpath_call:

movq %rax, RAX(%rsp)
1:

/*
* If we get here, then we know that pt_regs is clean for SYSRET64.
* If we see that no exit work is required (which we are required
* to check with IRQs off), then we can go straight to SYSRET64.
*/
DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
movq PER_CPU_VAR(current_task), %r11
testl $_TIF_ALLWORK_MASK, TASK_TI_flags(%r11)
jnz 1f

LOCKDEP_SYS_EXIT
TRACE_IRQS_ON /* user mode is traced as IRQs on */
movq RIP(%rsp), %rcx
movq EFLAGS(%rsp), %r11
RESTORE_C_REGS_EXCEPT_RCX_R11
/*
* This opens a window where we have a user CR3, but are
* running in the kernel. This makes using the CS
* register useless for telling whether or not we need to
* switch CR3 in NMIs. Normal interrupts are OK because
* they are off here.
*/
SWITCH_USER_CR3
movq RSP(%rsp), %rsp
USERGS_SYSRET64

1:
/*
* The fast path looked good when we started, but something changed
* along the way and we need to switch to the slow path. Calling
* raise(3) will trigger this, for example. IRQs are off.
*/
TRACE_IRQS_ON
ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_EXTRA_REGS
movq %rsp, %rdi
call syscall_return_slowpath /* returns with IRQs disabled */
jmp return_from_SYSCALL_64

entry_SYSCALL64_slow_path:
/* IRQs are off. */
SAVE_EXTRA_REGS
movq %rsp, %rdi
call do_syscall_64 /* returns with IRQs disabled */

return_from_SYSCALL_64:
RESTORE_EXTRA_REGS
TRACE_IRQS_IRETQ /* we're about to change IF */

/*
* Try to use SYSRET instead of IRET if we're returning to
* a completely clean 64-bit userspace context.
*/
movq RCX(%rsp), %rcx
movq RIP(%rsp), %r11
cmpq %rcx, %r11 /* RCX == RIP */
jne opportunistic_sysret_failed

/*
* On Intel CPUs, SYSRET with non-canonical RCX/RIP will #GP
* in kernel space. This essentially lets the user take over
* the kernel, since userspace controls RSP.
*
* If width of "canonical tail" ever becomes variable, this will need
* to be updated to remain correct on both old and new CPUs.
*/
.ifne __VIRTUAL_MASK_SHIFT - 47
.error "virtual address width changed -- SYSRET checks need update"
.endif

/* Change top 16 bits to be the sign-extension of 47th bit */
shl $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx
sar $(64 - (__VIRTUAL_MASK_SHIFT+1)), %rcx

/* If this changed %rcx, it was not canonical */
cmpq %rcx, %r11
jne opportunistic_sysret_failed

cmpq $__USER_CS, CS(%rsp) /* CS must match SYSRET */
jne opportunistic_sysret_failed

movq R11(%rsp), %r11
cmpq %r11, EFLAGS(%rsp) /* R11 == RFLAGS */
jne opportunistic_sysret_failed

/*
* SYSCALL clears RF when it saves RFLAGS in R11 and SYSRET cannot
* restore RF properly. If the slowpath sets it for whatever reason, we
* need to restore it correctly.
*
* SYSRET can restore TF, but unlike IRET, restoring TF results in a
* trap from userspace immediately after SYSRET. This would cause an
* infinite loop whenever #DB happens with register state that satisfies
* the opportunistic SYSRET conditions. For example, single-stepping
* this user code:
*
* movq $stuck_here, %rcx
* pushfq
* popq %r11
* stuck_here:
*
* would never get past 'stuck_here'.
*/
testq $(X86_EFLAGS_RF|X86_EFLAGS_TF), %r11
jnz opportunistic_sysret_failed

/* nothing to check for RSP */

cmpq $__USER_DS, SS(%rsp) /* SS must match SYSRET */
jne opportunistic_sysret_failed

/*
* We win! This label is here just for ease of understanding
* perf profiles. Nothing jumps here.
*/
syscall_return_via_sysret:
/* rcx and r11 are already restored (see code above) */
RESTORE_C_REGS_EXCEPT_RCX_R11
/*
* This opens a window where we have a user CR3, but are
* running in the kernel. This makes using the CS
* register useless for telling whether or not we need to
* switch CR3 in NMIs. Normal interrupts are OK because
* they are off here.
*/
SWITCH_USER_CR3
movq RSP(%rsp), %rsp
USERGS_SYSRET64

opportunistic_sysret_failed:
/*
* This opens a window where we have a user CR3, but are
* running in the kernel. This makes using the CS
* register useless for telling whether or not we need to
* switch CR3 in NMIs. Normal interrupts are OK because
* they are off here.
*/
SWITCH_USER_CR3
SWAPGS
jmp restore_c_regs_and_iret
END(entry_SYSCALL_64)

注意syscall不会保存栈指针,因此 handler 首先将当前用户态栈偏移 rsp 存到 per-cpu 变量 rsp_scratch 中,然后将 PER_CPU_VAR(cpu_current_top_of_stack),即内核态的栈偏移加载到 rsp

随后将各寄存器中的值压入内核态的栈中,包括:

  • rax system call number
  • rcx return address
  • r11 saved rflags (note: r11 is callee-clobbered register in C ABI)
  • rdi arg0
  • rsi arg1
  • rdx arg2
  • r10 arg3 (needs to be moved to rcx to conform to C ABI)
  • r8 arg4
  • r9 arg5

这其实就是64位下的pt_reg结构体了

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
struct pt_regs {
/*
* C ABI says these regs are callee-preserved. They aren't saved on kernel entry
* unless syscall needs a complete, fully filled "struct pt_regs".
*/
unsigned long r15;
unsigned long r14;
unsigned long r13;
unsigned long r12;
unsigned long rbp;
unsigned long rbx;
/* These regs are callee-clobbered. Always saved on kernel entry. */
unsigned long r11;
unsigned long r10;
unsigned long r9;
unsigned long r8;
unsigned long rax;
unsigned long rcx;
unsigned long rdx;
unsigned long rsi;
unsigned long rdi;
/*
* On syscall entry, this is syscall#. On CPU exception, this is error code.
* On hw interrupt, it's IRQ number:
*/
unsigned long orig_rax;
/* Return frame for iretq */
unsigned long rip;
unsigned long cs;
unsigned long eflags;
unsigned long rsp;
unsigned long ss;
/* top of stack page */
};

这里我们关注到几个点,为了让用户态在系统调用前后的寄存器没有任何变化,显然,r11和rcx的值是需要提前保存的,因为在syscall的过程中,r11被用来记录标志寄存器的值,rcx被用来记录rip

同时我们转到do_syscall_64函数会发现另一个有趣的点:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
#ifdef CONFIG_X86_64
__visible void do_syscall_64(struct pt_regs *regs)
{
struct thread_info *ti = current_thread_info();
unsigned long nr = regs->orig_ax;

enter_from_user_mode();
local_irq_enable();

if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY)
nr = syscall_trace_enter(regs);

/*
* NB: Native and x32 syscalls are dispatched from the same
* table. The only functional difference is the x32 bit in
* regs->orig_ax, which changes the behavior of some syscalls.
*/
if (likely((nr & __SYSCALL_MASK) < NR_syscalls)) {
regs->ax = sys_call_table[nr & __SYSCALL_MASK](
regs->di, regs->si, regs->dx,
regs->r10, regs->r8, regs->r9);
}

syscall_return_slowpath(regs);
}
#endif

众所周知,在64位下,传参通过寄存器实现,依次为rdi、rsi、rdx、rcx、r8、r9

因此在调用C函数时候,第四个参数是被放在rcx寄存器中的

但是,在系统调用时候,第四个参数是通过r10寄存器传入的,原因也是因为,rcx用于存储rip

如果你仔细观察各类C函数的实现,就会发现,当需要用到系统调用时,会把第四个参数,即rcx给mov到r10中,然后再执行syscall