趣谈Linux操作系统随笔——4.0 系统调用:公司成立好了就要开始接项目
作者:互联网
系统调用:公司成立好了就要开始接项目
- 软件平台:运行于VMware Workstation 12 Player下UbuntuLTS16.04_x64 系统
- 开发环境:Linux-4.19-rc3内核,glibc-2.9
目录
-
系统调用:公司成立好了就要开始接项目
1、系统调用的封装——glibc
以open()
函数为例子,进行源码分析,由于我们是在linux基础上进行分析,所以所在的文件目录结构为:sysdeps\unix
-
在这个目录下有**
syscalls.list
,里面列着所有glibc的函数对应的系统调用**:# File name Caller(系统调用文件名字) Syscall name(系统调用名) Args Strong name Weak names # ...... # open - open Ci:siv __libc_open __open open # ...... #
-
同时还有一个脚本文件
make-syscall.sh
(文件包装器),根据当前目录下的syscalls.list
文件,自动的对于每一个封装好的系统调用,生成一个文件,其中包装的规则在syscall-template.S
,open()
就自动生成了名为open.c的文件#!/bin/sh # Usage: make-syscalls.sh ../sysdeps/unix/common # Expects $sysdirs in environment. ############################################################################## # # 这个脚本用于处理在各种系统调用中编码的系统调用数据。列出用于围绕适当的OS系统调用生成精简程序集系统调用包装器的文件。 # 看到syscall-template.S。更多细节在实际的包装。 # # Syscall Signature Prefixes: # # E: errno and return value are not set by the call # V: errno is not set, but errno or zero (success) is returned from the call # # Syscall Signature Key Letters: # # a: unchecked address (e.g., 1st arg to mmap) # b: non-NULL buffer (e.g., 2nd arg to read; return value from mmap) # B: optionally-NULL buffer (e.g., 4th arg to getsockopt) # f: buffer of 2 ints (e.g., 4th arg to socketpair) # F: 3rd arg to fcntl # i: scalar (any signedness & size: int, long, long long, enum, whatever) # I: 3rd arg to ioctl # n: scalar buffer length (e.g., 3rd arg to read) # N: pointer to value/return scalar buffer length (e.g., 6th arg to recvfrom) # p: non-NULL pointer to typed object (e.g., any non-void* arg) # P: optionally-NULL pointer to typed object (e.g., 2nd argument to gettimeofday) # s: non-NULL string (e.g., 1st arg to open) # S: optionally-NULL string (e.g., 1st arg to acct) # v: vararg scalar (e.g., optional 3rd arg to open) # V: byte-per-page vector (3rd arg to mincore) # W: wait status, optionally-NULL pointer to int (e.g., 2nd arg of wait4) # ############################################################################## thisdir=$1; shift echo '' echo \#### DIRECTORY = $thisdir # 检查每个比这个优先级更高的sysdep dir,并从$调用中删除在其他dir中找到的所有函数,当我们到达定义这些系统调用的目录时 sysdirs=`for dir in $sysdirs; do test $dir = $thisdir && break; echo $dir; done` echo \#### SYSDIRS = $sysdirs # 在当前目录下获取系统调用列表 calls=`sed 's/#.*$// /^[ ]*$/d' $thisdir/syscalls.list` calls=`echo "$calls" | while read file caller rest; do # 删除由$dir中的文件实现的每个系统调用。 # 如果一个系统调用指定了一个“调用者”,那么只有当调用者函数也在这个目录中实现时才编译那个系统调用。 srcfile=-; for dir in $sysdirs; do { test -f $dir/$file.c && srcfile=$dir/$file.c; } || { test -f $dir/$file.S && srcfile=$dir/$file.S; } || { test x$caller != x- && { { test -f $dir/$caller.c && srcfile=$dir/$caller.c; } || { test -f $dir/$caller.S && srcfile=$dir/$caller.S; }; }; } && break; done; echo $file $srcfile $caller $rest; done` # Any calls left? test -n "$calls" || exit 0 # This uses variables $weak, $strong, and $any_versioned. emit_weak_aliases() { # A shortcoming in the current gas is that it will only allow one # version-alias per symbol. So we create new strong aliases as needed. vcount="" # We use the <shlib-compat.h> macros to generate the versioned aliases # so that the version sets can be mapped to the configuration's # minimum version set as per shlib-versions DEFAULT lines. If an # entry point is specified in the form NAME@VERSION:OBSOLETED, a # SHLIB_COMPAT conditional is generated. if [ $any_versioned = t ]; then echo " echo '#include <shlib-compat.h>'; \\" fi for name in $weak; do case $name in *@@*) base=`echo $name | sed 's/@@.*//'` ver=`echo $name | sed 's/.*@@//;s/\./_/g'` echo " echo '#if IS_IN (libc)'; \\" if test -z "$vcount" ; then source=$strong vcount=1 else source="${strong}_${vcount}" vcount=`expr $vcount + 1` echo " echo 'strong_alias ($strong, $source)'; \\" fi echo " echo 'versioned_symbol (libc, $source, $base, $ver)'; \\" echo " echo '#else'; \\" echo " echo 'weak_alias ($strong, $base)'; \\" echo " echo '#endif'; \\" ;; *@*) base=`echo $name | sed 's/@.*//'` ver=`echo $name | sed 's/.*@//;s/\./_/g'` case $ver in *:*) compat_ver=${ver#*:} ver=${ver%%:*} compat_cond=" && SHLIB_COMPAT (libc, $ver, $compat_ver)" ;; *) compat_cond= ;; esac echo " echo '#if defined SHARED && IS_IN (libc)$compat_cond'; \\" if test -z "$vcount" ; then source=$strong vcount=1 else source="${strong}_${vcount}" vcount=`expr $vcount + 1` echo " echo 'strong_alias ($strong, $source)'; \\" fi echo " echo 'compat_symbol (libc, $source, $base, $ver)'; \\" echo " echo '#endif'; \\" ;; !*) name=`echo $name | sed 's/.//'` echo " echo 'strong_alias ($strong, $name)'; \\" echo " echo 'hidden_def ($name)'; \\" ;; *) echo " echo 'weak_alias ($strong, $name)'; \\" echo " echo 'hidden_weak ($name)'; \\" ;; esac done } # 发出规则来编译剩余的系统调用$calls. echo "$calls" | while read file srcfile caller syscall args strong weak; do vdso_syscall= case x"$syscall" in *:*@*) vdso_syscall="${syscall#*:}" syscall="${syscall%:*}" ;; esac case x"$syscall" in x-) callnum=_ ;; *) # 判断$syscall是否在syscall.h中定义了一个数字 callnum=- eval `{ echo "#include <sysdep.h>"; echo "callnum=SYS_ify ($syscall)"; } | $asm_CPP -D__OPTIMIZE__ - | sed -n -e "/^callnum=.*$syscall/d" \ -e "/^\(callnum=\)[ ]*\(.*\)/s//\1'\2'/p"` ;; esac noerrno=0 errval=0 case $args in E*) noerrno=1; args=`echo $args | sed 's/E:\?//'`;; V*) errval=1; args=`echo $args | sed 's/V:\?//'`;; esac # 根据信息派生出参数的数目 case $args in [0-9]) nargs=$args;; ?:) nargs=0;; ?:?) nargs=1;; ?:??) nargs=2;; ?:???) nargs=3;; ?:????) nargs=4;; ?:?????) nargs=5;; ?:??????) nargs=6;; ?:???????) nargs=7;; ?:????????) nargs=8;; ?:?????????) nargs=9;; esac # Make sure only the first syscall rule is used, if multiple dirs # define the same syscall. echo '' echo "#### CALL=$file NUMBER=$callnum ARGS=$args SOURCE=$srcfile" # 如果多个dirs定义相同的系统调用,请确保只使用第一个系统调用规则 any_versioned=f shared_only=f case $weak in *@@*) any_versioned=t ;; *@*) any_versioned=t shared_only=t ;; esac case x$srcfile"$callnum" in x--) # 额外系统调用的未定义调用。 if [ x$caller != x- ]; then if [ $noerrno != 0 ]; then echo >&2 "$0: no number for $fileno, no-error syscall ($strong $weak)" exit 2 fi echo "unix-stub-syscalls += $strong $weak" fi ;; x*-) ;; ### 对于未定义的调用不做任何事情 x-*) echo "ifeq (,\$(filter $file,\$(unix-syscalls)))" if test $shared_only = t; then # The versioned symbols are only in the shared library. echo "ifneq (,\$(filter .os,\$(object-suffixes)))" fi # Accumulate the list of syscall files for this directory. echo "unix-syscalls += $file" test x$caller = x- || echo "unix-extra-syscalls += $file" # Emit a compilation rule for this syscall. if test $shared_only = t; then # The versioned symbols are only in the shared library. echo "shared-only-routines += $file" test -n "$vdso_syscall" || echo "\$(objpfx)${file}.os: \\" else object_suffixes='$(object-suffixes)' test -z "$vdso_syscall" || object_suffixes='$(object-suffixes-noshared)' echo "\ \$(foreach p,\$(sysd-rules-targets),\ \$(foreach o,${object_suffixes},\$(objpfx)\$(patsubst %,\$p,$file)\$o)): \\" fi echo " \$(..)sysdeps/unix/make-syscalls.sh" case x"$callnum" in x_) echo "\ \$(make-target-directory) (echo '/* Dummy module requested by syscalls.list */'; \\" ;; x*) echo "\ \$(make-target-directory) (echo '#define SYSCALL_NAME $syscall'; \\ echo '#define SYSCALL_NARGS $nargs'; \\ echo '#define SYSCALL_SYMBOL $strong'; \\ echo '#define SYSCALL_NOERRNO $noerrno'; \\ echo '#define SYSCALL_ERRVAL $errval'; \\ echo '#include <syscall-template.S>'; \\" ;; esac # Append any weak aliases or versions defined for this syscall function. emit_weak_aliases # And finally, pipe this all into the compiler. echo ' ) | $(compile-syscall) '"\ \$(foreach p,\$(patsubst %$file,%,\$(basename \$(@F))),\$(\$(p)CPPFLAGS))" if test -n "$vdso_syscall"; then # In the shared library, we're going to emit an IFUNC using a vDSO function. # $vdso_syscall looks like "name@KERNEL_X.Y" where "name" is the symbol # name in the vDSO and KERNEL_X.Y is its symbol version. vdso_symbol="${vdso_syscall%@*}" vdso_symver="${vdso_syscall#*@}" vdso_symver=`echo "$vdso_symver" | sed 's/\./_/g'` cat <<EOF \$(foreach p,\$(sysd-rules-targets),\$(objpfx)\$(patsubst %,\$p,$file).os): \\ \$(..)sysdeps/unix/make-syscalls.sh \$(make-target-directory) (echo '#define ${strong} __redirect_${strong}'; \\ echo '#include <dl-vdso.h>'; \\ echo '#undef ${strong}'; \\ echo '#define vdso_ifunc_init() \\'; \\ echo ' PREPARE_VERSION_KNOWN (symver, ${vdso_symver})'; \\ echo '__ifunc (__redirect_${strong}, ${strong},'; \\ echo ' _dl_vdso_vsym ("${vdso_symbol}", &symver), void,'; \\ echo ' vdso_ifunc_init)'; \\ EOF # This is doing "hidden_def (${strong})", but the compiler # doesn't know that we've defined ${strong} in the same file, so # we can't do it the normal way. cat <<EOF echo 'asm (".globl __GI_${strong}");'; \\ echo 'asm ("__GI_${strong} = ${strong}");'; \\ EOF emit_weak_aliases cat <<EOF ) | \$(compile-stdin.c) \ \$(foreach p,\$(patsubst %$file,%,\$(basename \$(@F))),\$(\$(p)CPPFLAGS)) EOF fi if test $shared_only = t; then # The versioned symbols are only in the shared library. echo endif fi echo endif ;; esac done
-
具体的包装规则
syscall-template.S
/* Assembly code template for system call stubs. Copyright (C) 2009-2019 Free Software Foundation, Inc. This file is part of the GNU C Library. The GNU C Library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version. The GNU C Library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details. You should have received a copy of the GNU Lesser General Public License along with the GNU C Library; if not, see <http://www.gnu.org/licenses/>. */ /* The real guts of this work are in the macros defined in the machine- and kernel-specific sysdep.h header file. Cancellable syscalls should be implemented using C implementation with SYSCALL_CANCEL macro. Each system call's object is built by a rule in sysd-syscalls generated by make-syscalls.sh that #include's this file after defining a few macros: SYSCALL_NAME syscall name SYSCALL_NARGS number of arguments this call takes SYSCALL_SYMBOL primary symbol name SYSCALL_NOERRNO 1 to define a no-errno version (see below) SYSCALL_ERRVAL 1 to define an error-value version (see below) We used to simply pipe the correct three lines below through cpp into the assembler. The main reason to have this file instead is so that stub objects can be assembled with -g and get source line information that leads a user back to a source file and these fine comments. The average user otherwise has a hard time knowing which "syscall-like" functions in libc are plain stubs and which have nontrivial C wrappers. Some versions of the "plain" stub generation macros are more than a few instructions long and the untrained eye might not distinguish them from some compiled code that inexplicably lacks source line information. */ #include <sysdep.h> /* This indirection is needed so that SYMBOL gets macro-expanded. */ #define syscall_hidden_def(SYMBOL) hidden_def (SYMBOL) #define T_PSEUDO(SYMBOL, NAME, N) PSEUDO (SYMBOL, NAME, N) #define T_PSEUDO_NOERRNO(SYMBOL, NAME, N) PSEUDO_NOERRNO (SYMBOL, NAME, N) #define T_PSEUDO_ERRVAL(SYMBOL, NAME, N) PSEUDO_ERRVAL (SYMBOL, NAME, N) #define T_PSEUDO_END(SYMBOL) PSEUDO_END (SYMBOL) #define T_PSEUDO_END_NOERRNO(SYMBOL) PSEUDO_END_NOERRNO (SYMBOL) #define T_PSEUDO_END_ERRVAL(SYMBOL) PSEUDO_END_ERRVAL (SYMBOL) #if SYSCALL_NOERRNO /* This kind of system call stub never returns an error. We return the return value register to the caller unexamined. */ T_PSEUDO_NOERRNO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) ret_NOERRNO T_PSEUDO_END_NOERRNO (SYSCALL_SYMBOL) #elif SYSCALL_ERRVAL /* This kind of system call stub returns the errno code as its return value, or zero for success. We may massage the kernel's return value to meet that ABI, but we never set errno here. */ T_PSEUDO_ERRVAL (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) ret_ERRVAL T_PSEUDO_END_ERRVAL (SYSCALL_SYMBOL) #else /* This is a "normal" system call stub: if there is an error, it returns -1 and sets errno. */ T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) ret T_PSEUDO_END (SYSCALL_SYMBOL) #endif syscall_hidden_def (SYSCALL_SYMBOL)
通过查看源码可知,常见的系统调用使用的是如下代码:,其宏定义为
#define T_PSEUDO(SYMBOL, NAME, N) PSEUDO (SYMBOL, NAME, N)
/* This is a "normal" system call stub: if there is an error, it returns -1 and sets errno. */ T_PSEUDO (SYSCALL_SYMBOL, SYSCALL_NAME, SYSCALL_NARGS) ret T_PSEUDO_END (SYSCALL_SYMBOL) #endif syscall_hidden_def (SYSCALL_SYMBOL)
-
在
sysdeps\unix\sysv\linux\sh\sysdep.h
目录下的sysdep.h
文件下可以找到宏定义PSEUDO
的具体实现逻辑#define PSEUDO(name, syscall_name, args) \ .text; \ ENTRY (name); \ DO_CALL (syscall_name, args); \ mov r0,r1; \ mov _IMM12,r2; \ shad r2,r1; \ not r1,r1; \ tst r1,r1; \ bf .Lpseudo_end; \ SYSCALL_ERROR_HANDLER; \ .Lpseudo_end:
通过观察可知,几乎所有的系统调用最终都会调用
DO_CALL
这个宏,但是这个宏在32位和64位的定义是不一样的。
2、32位系统调用过程
在sysdeps\unix\sysv\linux\i386
目录下的sysdep.h
文件,有DO_CALL
的定义:
2.1 执行32位对应的DO_CALL
-
可以看到,系统调用中的每个参数所存放的对应寄存器中如下所示(系统调用号保存在eax寄存器中),最终会执行
ENTER_KERNEL
陷入内核# define ENTER_KERNEL int $0x80 /* Linux takes system call arguments in registers: syscall number %eax call-clobbered arg 1 %ebx call-saved arg 2 %ecx call-clobbered arg 3 %edx call-clobbered arg 4 %esi call-saved arg 5 %edi call-saved arg 6 %ebp call-saved ...... */ #undef DO_CALL #define DO_CALL(syscall_name, args) \ PUSHARGS_##args \ DOARGS_##args \ movl $SYS_ify (syscall_name), %eax; \ ENTER_KERNEL \ POPARGS_##args
2.2 在DO_CALL中陷入内核ENTER_KERNEL
-
通过宏定义可以看到,
ENTER_KERNEL
的实质定义为int $0x80
,即触发一个软中断,通过它就可以陷入(trap)内核即调用
ENTRY(entry_INT80_32)
汇编函数,这个需要结合linux-4.19的源码来看,在linux-4.19-rc3\arch\x86\entry\entry_32.S
文件中通过注释可以看到,其寄存器与glibc中的DO_CALL定义中的注释相同
/* Arguments: * eax system call number * ebx arg1 * ecx arg2 * edx arg3 * esi arg4 * edi arg5 * ebp arg6 */ ENTRY(entry_INT80_32) ASM_CLAC pushl %eax /* pt_regs->orig_ax */ SAVE_ALL pt_regs_ax=$-ENOSYS switch_stacks=1 /* save rest */ /* * User mode is traced as though IRQs are on, and the interrupt gate * turned them off. */ TRACE_IRQS_OFF movl %esp, %eax call do_int80_syscall_32 .Lsyscall_32_done: /*......*/ INTERRUPT_RETURN
其调用流程为:
通过push和SAVE_ALL将当前用户态的寄存器,保存在pt_regs结构里面;
然后调用
do_int80_syscall_32
函数 ,进行系统调用; 调用
INTERRUPT_RETURN
返回;-
pt_regs结构体定义:
struct pt_regs { /* * NB: 32-bit x86 CPUs are inconsistent as what happens in the * following cases (where %seg represents a segment register): * * - pushl %seg: some do a 16-bit write and leave the high * bits alone * - movl %seg, [mem]: some do a 16-bit write despite the movl * - IDT entry: some (e.g. 486) will leave the high bits of CS * and (if applicable) SS undefined. * * Fortunately, x86-32 doesn't read the high bits on POP or IRET, * so we can just treat all of the segment registers as 16-bit * values. */ unsigned long bx; unsigned long cx; unsigned long dx; unsigned long si; unsigned long di; unsigned long bp; unsigned long ax; unsigned short ds; unsigned short __dsh; unsigned short es; unsigned short __esh; unsigned short fs; unsigned short __fsh; unsigned short gs; unsigned short __gsh; unsigned long orig_ax; unsigned long ip; unsigned short cs; unsigned short __csh; unsigned long flags; unsigned long sp; unsigned short ss; unsigned short __ssh; };
-
do_int80_syscall_32()实现:
- 将系统调用号从eax里面取出来
- 根据系统调用号,在
#define ia32_sys_call_table sys_call_table
系统调用表中找到相应的函数进行调用 - 寄存器中保存的参数取出来,作为函数参数。
/* Handles int $0x80 */ __visible void do_int80_syscall_32(struct pt_regs *regs) { enter_from_user_mode(); local_irq_enable(); do_syscall_32_irqs_on(regs); } /* ...... */ static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs) { struct thread_info *ti = current_thread_info(); unsigned int nr = (unsigned int)regs->orig_ax; /* ...... */ /* * It's possible that a 32-bit syscall implementation * takes a 64-bit parameter but nonetheless assumes that * the high bits are zero. Make sure we zero-extend all * of the args. */ regs->ax = ia32_sys_call_table[nr]( (unsigned int)regs->bx, (unsigned int)regs->cx, (unsigned int)regs->dx, (unsigned int)regs->si, (unsigned int)regs->di, (unsigned int)regs->bp); syscall_return_slowpath(regs); }
-
INTERRUPT_RETURN定义:
iret指令将原来用户态保存的现场恢复回来,包含代码段、指令指针寄存器等,这时候用户态进程恢复执行。
#define INTERRUPT_RETURN iret
-
2.3 小结
在32位中实现系统调用,以open为例子,过程如下:
-
在用户层执行
open(char *pathname, int flags, mode_t mode)
; -
进入glibc库:
2.1 最终会调用DO_CALL(syscall_name, args),在其中进行:
2.1.1 把系统调用号与参数保存到寄存器;
2.1.2 调用
ENTER_KERNEL
陷入内核: 2.1.2.1 通过push和SAVE_ALL将当前用户态的寄存器,保存在pt_regs结构里面;
2.1.2.2 然后调用
do_int80_syscall_32
函数 ,进行系统调用:(进入内核) 2.1.2.2.1 将系统调用号从eax里面取出来;
2.1.2.2.2 根据系统调用号,在
#define ia32_sys_call_table sys_call_table
系统调用表中找到相应的函数进行调用; 2.1.2.2.3 寄存器中保存的参数取出来,作为函数参数;
2.1.3 调用
INTERRUPT_RETURN
返回,恢复用户态;
3、64位系统调用过程
3.1 执行64位对应的DO_CALL
在sysdeps\unix\sysv\linux\x86_64\sysdep.h
文件下
- 可以看到,系统调用中的每个参数所存放的对应寄存器中如下所示(系统调用号保存在rax寄存器中),最终会执行
syscall
陷入内核
/* The Linux/x86-64 kernel expects the system call parameters in
registers according to the following table:
syscall number rax
arg 1 rdi
arg 2 rsi
arg 3 rdx
arg 4 r10
arg 5 r8
arg 6 r9
*/
# undef DO_CALL
# define DO_CALL(syscall_name, args) \
DOARGS_##args \
movl $SYS_ify (syscall_name), %eax; \
syscall;
3.2 在DO_CALL
中陷入内核syscall
-
syscall
指令还使用了一种特殊的寄存器,我们叫特殊模块寄存器(Model Specific Registers,简称MSR)。这种寄存器是CPU为了完成某些特殊控制功能为目的的寄存器,其中就有系统调用。
-
rdmsr
和wrmsr
是用来读写特殊模块寄存器的。wrmsrl(MSR_LSTAR, (unsigned long)entry_SYSCALL_64);
-
MSR_LSTAR
就是这样一个特殊的寄存器,当syscall指令调用的时候,会从这个寄存器里面拿出函数地址来调用,也就是调用entry_SYSCALL_64
。在
arch/x86/entry/entry_64.S
中定义了entry_SYSCALL_64。ENTRY(entry_SYSCALL_64) /* ....... */ /* Construct struct pt_regs on stack */ pushq $__USER_DS /* pt_regs->ss */ pushq PER_CPU_VAR(rsp_scratch) /* pt_regs->sp */ pushq %r11 /* pt_regs->flags */ pushq $__USER_CS /* pt_regs->cs */ pushq %rcx /* pt_regs->ip */ GLOBAL(entry_SYSCALL_64_after_hwframe) pushq %rax /* pt_regs->orig_ax */ PUSH_AND_CLEAR_REGS rax=$-ENOSYS TRACE_IRQS_OFF /* IRQs are off. */ movq %rax, %rdi movq %rsp, %rsi call do_syscall_64 /* returns with IRQs disabled */ /* ...... */ syscall_return_via_sysret: /* rcx and r11 are already restored (see code above) */ UNWIND_HINT_EMPTY POP_REGS pop_rdi=0 skip_r11rcx=1 /* * Now all regs are restored except RSP and RDI. * Save old stack pointer and switch to trampoline stack. */ movq %rsp, %rdi movq PER_CPU_VAR(cpu_tss_rw + TSS_sp0), %rsp pushq RSP-RDI(%rdi) /* RSP */ pushq (%rdi) /* RDI */ /* * We are on the trampoline stack. All regs except RDI are live. * We can do future final exit work right here. */ SWITCH_TO_USER_CR3_STACK scratch_reg=%rdi popq %rdi popq %rsp USERGS_SYSRET64 END(entry_SYSCALL_64)
其调用流程为:
通过pushq将当前用户态的寄存器,保存在pt_regs结构里面;
然后调用
do_syscall_64
函数 ,进行系统调用; 调用
USERGS_SYSRET64
返回;-
do_syscall_64
函数实现:过程为:
- 从rax里面拿出系统调用号;
- 根据系统调用号,在系统调用表sys_call_table中找到相应的函数进行调用;
- 将寄存器中保存的参数取出来,作为函数参数。
__visible void do_syscall_64(struct pt_regs *regs) { struct thread_info *ti = current_thread_info(); unsigned long nr = regs->orig_ax; ...... if (likely((nr & __SYSCALL_MASK) < NR_syscalls)) { regs->ax = sys_call_table[nr & __SYSCALL_MASK]( regs->di, regs->si, regs->dx, regs->r10, regs->r8, regs->r9); } syscall_return_slowpath(regs); }
-
所以,无论是32位,还是64位,都会到系统调用表
sys_call_table
这里来。 -
USERGS_SYSRET64
:sysretq
指令将原来用户态保存的现场恢复回来,包含代码段、指令指针寄存器等,这时候用户态进程恢复执行。#define USERGS_SYSRET64 \ swapgs; \ sysretq;
-
3.3 小结
在64位中实现系统调用,以open为例子,过程如下:
-
在用户层执行
open(char *pathname, int flags, mode_t mode)
; -
进入glibc库:
2.1 最终会调用DO_CALL(syscall_name, args),在其中进行:
2.1.1 把系统调用号与参数保存到寄存器;
2.1.2 调用syscall
陷入内核:
2.1.2.1 通过pushq将当前用户态的寄存器,保存在pt_regs结构里面;
2.1.2.2 然后调用do_syscall_64
函数 ,进行系统调用:(进入内核)
2.1.2.2.1 将系统调用号从rax里面取出来;
2.1.2.2.2 根据系统调用号,在系统调用表sys_call_table中找到相应的函数进行调用;
2.1.2.2.3 寄存器中保存的参数取出来,作为函数参数;
2.1.3 调用USERGS_SYSRET64
返回,恢复用户态;
4、系统调用表的生成(在Linux内核中)
4.1 32位系统调用表的生成
4.1.1 系统调用表的定义
-
位置:
arch\x86\entry\syscalls\syscall_32.tbl
-
源码:
# # 32-bit system call numbers and entry vectors # # The format is: # <number> <abi> <name> <entry point> <compat entry point> # <系统调用号> <寄存器保护规则> <系统调用名称> <系统调用在内核的实现函数(入口点)> <兼容入口点> # # The __ia32_sys and __ia32_compat_sys stubs are created on-the-fly for # sys_*() system calls and compat_sys_*() compat system calls if # IA32_EMULATION is defined, and expect struct pt_regs *regs as their only # parameter. # # The abi is always "i386" for this file. # 0 i386 restart_syscall sys_restart_syscall __ia32_sys_restart_syscall 1 i386 exit sys_exit __ia32_sys_exit 2 i386 fork sys_fork __ia32_sys_fork 3 i386 read sys_read __ia32_sys_read 4 i386 write sys_write __ia32_sys_write 5 i386 open sys_open __ia32_compat_sys_open 6 i386 close sys_close __ia32_sys_close 7 i386 waitpid sys_waitpid __ia32_sys_waitpid 8 i386 creat sys_creat __ia32_sys_creat
4.1.2 系统调用表的声明
-
位置:
include\linux\syscalls.h
-
源码:
/* __ARCH_WANT_SYSCALL_NO_AT */ asmlinkage long sys_open(const char __user *filename, int flags, umode_t mode); asmlinkage long sys_link(const char __user *oldname, const char __user *newname); asmlinkage long sys_unlink(const char __user *pathname);
4.1.3 系统调用表的实现
以open函数为例子:
-
位置:
fs\open.c
-
源码:
/* ...... */ SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) { if (force_o_largefile()) flags |= O_LARGEFILE; return do_sys_open(AT_FDCWD, filename, flags, mode); } SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, umode_t, mode) { if (force_o_largefile()) flags |= O_LARGEFILE; return do_sys_open(dfd, filename, flags, mode); } /* ...... */
-
可以看到其形式十分奇怪,通过查看
syscalls.h
声明,可以知道是根据参数的数目选择对应的宏#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__) #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__) #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__) #define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__) #define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__) #define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__) #define SYSCALL_DEFINE_MAXARGS 6 #define SYSCALL_DEFINEx(x, sname, ...) \ SYSCALL_METADATA(sname, x, __VA_ARGS__) \ __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) #define __PROTECT(...) asmlinkage_protect(__VA_ARGS__) /* * The asmlinkage stub is aliased to a function named __se_sys_*() which * sign-extends 32-bit ints to longs whenever needed. The actual work is * done within __do_sys_*(). */ #ifndef __SYSCALL_DEFINEx #define __SYSCALL_DEFINEx(x, name, ...) \ __diag_push(); \ __diag_ignore(GCC, 8, "-Wattribute-alias", \ "Type aliasing is used to sanitize syscall arguments");\ asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \ __attribute__((alias(__stringify(__se_sys##name)))); \ ALLOW_ERROR_INJECTION(sys##name, ERRNO); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\ asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ { \ long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__));\ __MAP(x,__SC_TEST,__VA_ARGS__); \ __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \ return ret; \ } \ __diag_pop(); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) #endif /* __SYSCALL_DEFINEx */
-
宏展开后,
SYSCALL_DEFINE3
得到其具体实现如下:asmlinkage long sys_open(const char __user * filename, int flags, int mode) { long ret; if (force_o_largefile()) flags |= O_LARGEFILE; ret = do_sys_open(AT_FDCWD, filename, flags, mode); asmlinkage_protect(3, ret, filename, flags, mode); return ret; }
4.1.4 建立联系
-
利用
arch/x86/entry/syscalls/Makefile
# SPDX-License-Identifier: GPL-2.0 out := arch/$(SRCARCH)/include/generated/asm #输出文件地址 uapi := arch/$(SRCARCH)/include/generated/uapi/asm # 如果当前没有创建,则建立输出文件 _dummy := $(shell [ -d '$(out)' ] || mkdir -p '$(out)') \ $(shell [ -d '$(uapi)' ] || mkdir -p '$(uapi)') # 所需要的定义文件 syscall32 := $(srctree)/$(src)/syscall_32.tbl # 32位系统使用指定目录下syscall_32.tbl syscall64 := $(srctree)/$(src)/syscall_64.tbl # 64位系统使用指定目录下syscall_64.tbl # 所需要脚本文件的地址 syshdr := $(srctree)/$(src)/syscallhdr.sh systbl := $(srctree)/$(src)/syscalltbl.sh quiet_cmd_syshdr = SYSHDR $@ cmd_syshdr = $(CONFIG_SHELL) '$(syshdr)' '$<' '$@' \ '$(syshdr_abi_$(basetarget))' \ '$(syshdr_pfx_$(basetarget))' \ '$(syshdr_offset_$(basetarget))' quiet_cmd_systbl = SYSTBL $@ cmd_systbl = $(CONFIG_SHELL) '$(systbl)' $< $@ quiet_cmd_hypercalls = HYPERCALLS $@ cmd_hypercalls = $(CONFIG_SHELL) '$<' $@ $(filter-out $<,$^) # 所需要依赖与指定的协议 syshdr_abi_unistd_32 := i386 $(uapi)/unistd_32.h: $(syscall32) $(syshdr) $(call if_changed,syshdr) syshdr_abi_unistd_32_ia32 := i386 syshdr_pfx_unistd_32_ia32 := ia32_ $(out)/unistd_32_ia32.h: $(syscall32) $(syshdr) $(call if_changed,syshdr) syshdr_abi_unistd_x32 := common,x32 syshdr_offset_unistd_x32 := __X32_SYSCALL_BIT $(uapi)/unistd_x32.h: $(syscall64) $(syshdr) $(call if_changed,syshdr) syshdr_abi_unistd_64 := common,64 $(uapi)/unistd_64.h: $(syscall64) $(syshdr) $(call if_changed,syshdr) syshdr_abi_unistd_64_x32 := x32 syshdr_pfx_unistd_64_x32 := x32_ # 输出文件名与地址 $(out)/unistd_64_x32.h: $(syscall64) $(syshdr) $(call if_changed,syshdr) $(out)/syscalls_32.h: $(syscall32) $(systbl) $(call if_changed,systbl) $(out)/syscalls_64.h: $(syscall64) $(systbl) $(call if_changed,systbl) $(out)/xen-hypercalls.h: $(srctree)/scripts/xen-hypercalls.sh $(call if_changed,hypercalls) $(out)/xen-hypercalls.h: $(srctree)/include/xen/interface/xen*.h # 建立联系并生成输出文件 uapisyshdr-y += unistd_32.h unistd_64.h unistd_x32.h syshdr-y += syscalls_32.h syshdr-$(CONFIG_X86_64) += unistd_32_ia32.h unistd_64_x32.h syshdr-$(CONFIG_X86_64) += syscalls_64.h syshdr-$(CONFIG_XEN) += xen-hypercalls.h targets += $(uapisyshdr-y) $(syshdr-y) PHONY += all all: $(addprefix $(uapi)/,$(uapisyshdr-y)) all: $(addprefix $(out)/,$(syshdr-y)) @:
-
依赖两个脚本
第一个脚本
arch/x86/entry/syscalls/syscallhdr.sh
,会在文件中生成#define NR_open
#!/bin/sh # SPDX-License-Identifier: GPL-2.0 in="$1" out="$2" my_abis=`echo "($3)" | tr ',' '|'` prefix="$4" offset="$5" fileguard=_ASM_X86_`basename "$out" | sed \ -e 'y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/' \ -e 's/[^A-Z0-9_]/_/g' -e 's/__/_/g'` grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( echo "#ifndef ${fileguard}" echo "#define ${fileguard} 1" echo "" # 生成 #define NR_open while read nr abi name entry ; do if [ -z "$offset" ]; then echo "#define __NR_${prefix}${name} $nr" else echo "#define __NR_${prefix}${name} ($offset + $nr)" fi done echo "" echo "#endif /* ${fileguard} */" ) > "$out"
第二个脚本
arch/x86/entry/syscalls/syscalltbl.sh
,会在文件中生成SYSCALL(NR_open, sys_open)
#!/bin/sh # SPDX-License-Identifier: GPL-2.0 in="$1" out="$2" syscall_macro() { abi="$1" nr="$2" entry="$3" # Entry can be either just a function name or "function/qualifier" real_entry="${entry%%/*}" if [ "$entry" = "$real_entry" ]; then qualifier= else qualifier=${entry#*/} fi # 生成SYSCALL(NR_open, sys_open) echo "__SYSCALL_${abi}($nr, $real_entry, $qualifier)" } emit() { abi="$1" nr="$2" entry="$3" compat="$4" umlentry="" if [ "$abi" = "64" -a -n "$compat" ]; then echo "a compat entry for a 64-bit syscall makes no sense" >&2 exit 1 fi # For CONFIG_UML, we need to strip the __x64_sys prefix if [ "$abi" = "64" -a "${entry}" != "${entry#__x64_sys}" ]; then umlentry="sys${entry#__x64_sys}" fi if [ -z "$compat" ]; then if [ -n "$entry" -a -z "$umlentry" ]; then syscall_macro "$abi" "$nr" "$entry" elif [ -n "$umlentry" ]; then # implies -n "$entry" echo "#ifdef CONFIG_X86" syscall_macro "$abi" "$nr" "$entry" echo "#else /* CONFIG_UML */" syscall_macro "$abi" "$nr" "$umlentry" echo "#endif" fi else echo "#ifdef CONFIG_X86_32" if [ -n "$entry" ]; then syscall_macro "$abi" "$nr" "$entry" fi echo "#else" syscall_macro "$abi" "$nr" "$compat" echo "#endif" fi } grep '^[0-9]' "$in" | sort -n | ( while read nr abi name entry compat; do abi=`echo "$abi" | tr '[a-z]' '[A-Z]'` if [ "$abi" = "COMMON" -o "$abi" = "64" ]; then # COMMON is the same as 64, except that we don't expect X32 # programs to use it. Our expectation has nothing to do with # any generated code, so treat them the same. emit 64 "$nr" "$entry" "$compat" elif [ "$abi" = "X32" ]; then # X32 is equivalent to 64 on an X32-compatible kernel. echo "#ifdef CONFIG_X86_X32_ABI" emit 64 "$nr" "$entry" "$compat" echo "#endif" elif [ "$abi" = "I386" ]; then emit "$abi" "$nr" "$entry" "$compat" else echo "Unknown abi $abi" >&2 exit 1 fi done ) > "$out"
-
生成输出文件,建立系统调用号和系统调用实现函数之间的对应关系。
根据
syscall_32.tbl
生成unistd_32.h
,位置:arch\sh\include\uapi\asm\unistd_32.h
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ #ifndef __ASM_SH_UNISTD_32_H #define __ASM_SH_UNISTD_32_H /* * Copyright (C) 1999 Niibe Yutaka */ /* * This file contains the system call numbers. */ #define __NR_restart_syscall 0 #define __NR_exit 1 #define __NR_fork 2 #define __NR_read 3 #define __NR_write 4 #define __NR_open 5 #define __NR_close 6 #define __NR_waitpid 7 #define __NR_creat 8 #define __NR_link 9 #define __NR_unlink 10 #define __NR_execve 11 #define __NR_chdir 12 #define __NR_time 13 #define __NR_mknod 14 #define __NR_chmod 15 #define __NR_lchown 16 /* ...... */
根据
syscall_64.tbl
生成unistd_64.h
,位置:arch\sh\include\uapi\asm\unistd_64.h
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ #ifndef __ASM_SH_UNISTD_64_H #define __ASM_SH_UNISTD_64_H /* * include/asm-sh/unistd_64.h * * This file contains the system call numbers. * * Copyright (C) 2000, 2001 Paolo Alberelli * Copyright (C) 2003 - 2007 Paul Mundt * Copyright (C) 2004 Sean McGoogan * * This file is subject to the terms and conditions of the GNU General Public * License. See the file "COPYING" in the main directory of this archive * for more details. */ #define __NR_restart_syscall 0 #define __NR_exit 1 #define __NR_fork 2 #define __NR_read 3 #define __NR_write 4 #define __NR_open 5 #define __NR_close 6 #define __NR_waitpid 7 #define __NR_creat 8 #define __NR_link 9 #define __NR_unlink 10 #define __NR_execve 11 #define __NR_chdir 12 #define __NR_time 13 #define __NR_mknod 14 #define __NR_chmod 15 #define __NR_lchown 16 /* ...... */
4.1.5 系统调用表的最终
-
位置:
arch\x86\entry\syscall_32.c
-
源码:
// SPDX-License-Identifier: GPL-2.0 /* System call table for i386. */ /* ...... */ __visible const sys_call_ptr_t ia32_sys_call_table[__NR_syscall_compat_max+1] = { /* * Smells like a compiler bug -- it doesn't work * when the & below is removed. */ [0 ... __NR_syscall_compat_max] = &sys_ni_syscall, #include <asm/syscalls_32.h>
4.2 64位系统调用表的生成
4.2.1 系统调用表的定义
-
位置:
arch\x86\entry\syscalls\syscall_64.tbl
-
源码:与32位对比,其系统调用号、abi协议是不一样的,且没有兼容入口点
# # 64-bit system call numbers and entry vectors # # The format is: # <number> <abi> <name> <entry point> # # The __x64_sys_*() stubs are created on-the-fly for sys_*() system calls # # The abi is "common", "64" or "x32" for this file. # 0 common read __x64_sys_read 1 common write __x64_sys_write 2 common open __x64_sys_open 3 common close __x64_sys_close 4 common stat __x64_sys_newstat 5 common fstat __x64_sys_newfstat 6 common lstat __x64_sys_newlstat 7 common poll __x64_sys_poll 8 common lseek __x64_sys_lseek 9 common mmap __x64_sys_mmap 10 common mprotect __x64_sys_mprotect 11 common munmap __x64_sys_munmap ## ......
4.2.2 系统调用表的声明
-
位置:
include\linux\syscalls.h
-
源码:32位与64位一致
/* __ARCH_WANT_SYSCALL_NO_AT */ asmlinkage long sys_open(const char __user *filename, int flags, umode_t mode); asmlinkage long sys_link(const char __user *oldname, const char __user *newname); asmlinkage long sys_unlink(const char __user *pathname);
4.2.3 系统调用表的实现
以open函数为例子:
-
位置:
fs\open.c
-
源码:32位与64位一致
/* ...... */ SYSCALL_DEFINE3(open, const char __user *, filename, int, flags, umode_t, mode) { if (force_o_largefile()) flags |= O_LARGEFILE; return do_sys_open(AT_FDCWD, filename, flags, mode); } SYSCALL_DEFINE4(openat, int, dfd, const char __user *, filename, int, flags, umode_t, mode) { if (force_o_largefile()) flags |= O_LARGEFILE; return do_sys_open(dfd, filename, flags, mode); } /* ...... */
-
可以看到其形式十分奇怪,通过查看
syscalls.h
声明,可以知道是根据参数的数目选择对应的宏#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__) #define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__) #define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__) #define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__) #define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__) #define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__) #define SYSCALL_DEFINE_MAXARGS 6 #define SYSCALL_DEFINEx(x, sname, ...) \ SYSCALL_METADATA(sname, x, __VA_ARGS__) \ __SYSCALL_DEFINEx(x, sname, __VA_ARGS__) #define __PROTECT(...) asmlinkage_protect(__VA_ARGS__) /* * The asmlinkage stub is aliased to a function named __se_sys_*() which * sign-extends 32-bit ints to longs whenever needed. The actual work is * done within __do_sys_*(). */ #ifndef __SYSCALL_DEFINEx #define __SYSCALL_DEFINEx(x, name, ...) \ __diag_push(); \ __diag_ignore(GCC, 8, "-Wattribute-alias", \ "Type aliasing is used to sanitize syscall arguments");\ asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) \ __attribute__((alias(__stringify(__se_sys##name)))); \ ALLOW_ERROR_INJECTION(sys##name, ERRNO); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__));\ asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)); \ asmlinkage long __se_sys##name(__MAP(x,__SC_LONG,__VA_ARGS__)) \ { \ long ret = __do_sys##name(__MAP(x,__SC_CAST,__VA_ARGS__));\ __MAP(x,__SC_TEST,__VA_ARGS__); \ __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__)); \ return ret; \ } \ __diag_pop(); \ static inline long __do_sys##name(__MAP(x,__SC_DECL,__VA_ARGS__)) #endif /* __SYSCALL_DEFINEx */
-
宏展开后,
SYSCALL_DEFINE3
得到其具体实现如下:asmlinkage long sys_open(const char __user * filename, int flags, int mode) { long ret; if (force_o_largefile()) flags |= O_LARGEFILE; ret = do_sys_open(AT_FDCWD, filename, flags, mode); asmlinkage_protect(3, ret, filename, flags, mode); return ret; }
4.2.4 建立联系
-
利用
arch/x86/entry/syscalls/Makefile
# SPDX-License-Identifier: GPL-2.0 out := arch/$(SRCARCH)/include/generated/asm #输出文件地址 uapi := arch/$(SRCARCH)/include/generated/uapi/asm # 如果当前没有创建,则建立输出文件 _dummy := $(shell [ -d '$(out)' ] || mkdir -p '$(out)') \ $(shell [ -d '$(uapi)' ] || mkdir -p '$(uapi)') # 所需要的定义文件 syscall32 := $(srctree)/$(src)/syscall_32.tbl # 32位系统使用指定目录下syscall_32.tbl syscall64 := $(srctree)/$(src)/syscall_64.tbl # 64位系统使用指定目录下syscall_64.tbl # 所需要脚本文件的地址 syshdr := $(srctree)/$(src)/syscallhdr.sh systbl := $(srctree)/$(src)/syscalltbl.sh quiet_cmd_syshdr = SYSHDR $@ cmd_syshdr = $(CONFIG_SHELL) '$(syshdr)' '$<' '$@' \ '$(syshdr_abi_$(basetarget))' \ '$(syshdr_pfx_$(basetarget))' \ '$(syshdr_offset_$(basetarget))' quiet_cmd_systbl = SYSTBL $@ cmd_systbl = $(CONFIG_SHELL) '$(systbl)' $< $@ quiet_cmd_hypercalls = HYPERCALLS $@ cmd_hypercalls = $(CONFIG_SHELL) '$<' $@ $(filter-out $<,$^) # 所需要依赖与指定的协议 syshdr_abi_unistd_32 := i386 $(uapi)/unistd_32.h: $(syscall32) $(syshdr) $(call if_changed,syshdr) syshdr_abi_unistd_32_ia32 := i386 syshdr_pfx_unistd_32_ia32 := ia32_ $(out)/unistd_32_ia32.h: $(syscall32) $(syshdr) $(call if_changed,syshdr) syshdr_abi_unistd_x32 := common,x32 syshdr_offset_unistd_x32 := __X32_SYSCALL_BIT $(uapi)/unistd_x32.h: $(syscall64) $(syshdr) $(call if_changed,syshdr) syshdr_abi_unistd_64 := common,64 $(uapi)/unistd_64.h: $(syscall64) $(syshdr) $(call if_changed,syshdr) syshdr_abi_unistd_64_x32 := x32 syshdr_pfx_unistd_64_x32 := x32_ # 输出文件名与地址 $(out)/unistd_64_x32.h: $(syscall64) $(syshdr) $(call if_changed,syshdr) $(out)/syscalls_32.h: $(syscall32) $(systbl) $(call if_changed,systbl) $(out)/syscalls_64.h: $(syscall64) $(systbl) $(call if_changed,systbl) $(out)/xen-hypercalls.h: $(srctree)/scripts/xen-hypercalls.sh $(call if_changed,hypercalls) $(out)/xen-hypercalls.h: $(srctree)/include/xen/interface/xen*.h # 建立联系并生成输出文件 uapisyshdr-y += unistd_32.h unistd_64.h unistd_x32.h syshdr-y += syscalls_32.h syshdr-$(CONFIG_X86_64) += unistd_32_ia32.h unistd_64_x32.h syshdr-$(CONFIG_X86_64) += syscalls_64.h syshdr-$(CONFIG_XEN) += xen-hypercalls.h targets += $(uapisyshdr-y) $(syshdr-y) PHONY += all all: $(addprefix $(uapi)/,$(uapisyshdr-y)) all: $(addprefix $(out)/,$(syshdr-y)) @:
-
依赖两个脚本
第一个脚本
arch/x86/entry/syscalls/syscallhdr.sh
,会在文件中生成#define NR_open
#!/bin/sh # SPDX-License-Identifier: GPL-2.0 in="$1" out="$2" my_abis=`echo "($3)" | tr ',' '|'` prefix="$4" offset="$5" fileguard=_ASM_X86_`basename "$out" | sed \ -e 'y/abcdefghijklmnopqrstuvwxyz/ABCDEFGHIJKLMNOPQRSTUVWXYZ/' \ -e 's/[^A-Z0-9_]/_/g' -e 's/__/_/g'` grep -E "^[0-9A-Fa-fXx]+[[:space:]]+${my_abis}" "$in" | sort -n | ( echo "#ifndef ${fileguard}" echo "#define ${fileguard} 1" echo "" # 生成 #define NR_open while read nr abi name entry ; do if [ -z "$offset" ]; then echo "#define __NR_${prefix}${name} $nr" else echo "#define __NR_${prefix}${name} ($offset + $nr)" fi done echo "" echo "#endif /* ${fileguard} */" ) > "$out"
第二个脚本
arch/x86/entry/syscalls/syscalltbl.sh
,会在文件中生成SYSCALL(NR_open, sys_open)
#!/bin/sh # SPDX-License-Identifier: GPL-2.0 in="$1" out="$2" syscall_macro() { abi="$1" nr="$2" entry="$3" # Entry can be either just a function name or "function/qualifier" real_entry="${entry%%/*}" if [ "$entry" = "$real_entry" ]; then qualifier= else qualifier=${entry#*/} fi # 生成SYSCALL(NR_open, sys_open) echo "__SYSCALL_${abi}($nr, $real_entry, $qualifier)" } emit() { abi="$1" nr="$2" entry="$3" compat="$4" umlentry="" if [ "$abi" = "64" -a -n "$compat" ]; then echo "a compat entry for a 64-bit syscall makes no sense" >&2 exit 1 fi # For CONFIG_UML, we need to strip the __x64_sys prefix if [ "$abi" = "64" -a "${entry}" != "${entry#__x64_sys}" ]; then umlentry="sys${entry#__x64_sys}" fi if [ -z "$compat" ]; then if [ -n "$entry" -a -z "$umlentry" ]; then syscall_macro "$abi" "$nr" "$entry" elif [ -n "$umlentry" ]; then # implies -n "$entry" echo "#ifdef CONFIG_X86" syscall_macro "$abi" "$nr" "$entry" echo "#else /* CONFIG_UML */" syscall_macro "$abi" "$nr" "$umlentry" echo "#endif" fi else echo "#ifdef CONFIG_X86_32" if [ -n "$entry" ]; then syscall_macro "$abi" "$nr" "$entry" fi echo "#else" syscall_macro "$abi" "$nr" "$compat" echo "#endif" fi } grep '^[0-9]' "$in" | sort -n | ( while read nr abi name entry compat; do abi=`echo "$abi" | tr '[a-z]' '[A-Z]'` if [ "$abi" = "COMMON" -o "$abi" = "64" ]; then # COMMON is the same as 64, except that we don't expect X32 # programs to use it. Our expectation has nothing to do with # any generated code, so treat them the same. emit 64 "$nr" "$entry" "$compat" elif [ "$abi" = "X32" ]; then # X32 is equivalent to 64 on an X32-compatible kernel. echo "#ifdef CONFIG_X86_X32_ABI" emit 64 "$nr" "$entry" "$compat" echo "#endif" elif [ "$abi" = "I386" ]; then emit "$abi" "$nr" "$entry" "$compat" else echo "Unknown abi $abi" >&2 exit 1 fi done ) > "$out"
-
生成输出文件,建立系统调用号和系统调用实现函数之间的对应关系。
根据
syscall_32.tbl
生成unistd_32.h
,位置:arch\sh\include\uapi\asm\unistd_32.h
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ #ifndef __ASM_SH_UNISTD_32_H #define __ASM_SH_UNISTD_32_H /* * Copyright (C) 1999 Niibe Yutaka */ /* * This file contains the system call numbers. */ #define __NR_restart_syscall 0 #define __NR_exit 1 #define __NR_fork 2 #define __NR_read 3 #define __NR_write 4 #define __NR_open 5 #define __NR_close 6 #define __NR_waitpid 7 #define __NR_creat 8 #define __NR_link 9 #define __NR_unlink 10 #define __NR_execve 11 #define __NR_chdir 12 #define __NR_time 13 #define __NR_mknod 14 #define __NR_chmod 15 #define __NR_lchown 16 /* ...... */
根据
syscall_64.tbl
生成unistd_64.h
,位置:arch\sh\include\uapi\asm\unistd_64.h
/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ #ifndef __ASM_SH_UNISTD_64_H #define __ASM_SH_UNISTD_64_H /* * include/asm-sh/unistd_64.h * * This file contains the system call numbers. * * Copyright (C) 2000, 2001 Paolo Alberelli * Copyright (C) 2003 - 2007 Paul Mundt * Copyright (C) 2004 Sean McGoogan * * This file is subject to the terms and conditions of the GNU General Public * License. See the file "COPYING" in the main directory of this archive * for more details. */ #define __NR_restart_syscall 0 #define __NR_exit 1 #define __NR_fork 2 #define __NR_read 3 #define __NR_write 4 #define __NR_open 5 #define __NR_close 6 #define __NR_waitpid 7 #define __NR_creat 8 #define __NR_link 9 #define __NR_unlink 10 #define __NR_execve 11 #define __NR_chdir 12 #define __NR_time 13 #define __NR_mknod 14 #define __NR_chmod 15 #define __NR_lchown 16 /* ...... */
4.2.5 系统调用表的最终
-
位置:
arch\x86\entry\syscall_32.c
-
源码:
// SPDX-License-Identifier: GPL-2.0 /* System call table for i386. */ /* ...... */ #define __SYSCALL_64(nr, sym, qual) [nr] = sym, asmlinkage const sys_call_ptr_t sys_call_table[__NR_syscall_max+1] = { /* * Smells like a compiler bug -- it doesn't work * when the & below is removed. */ [0 ... __NR_syscall_max] = &sys_ni_syscall, #include <asm/syscalls_64.h> };
5、总结
5.1 系统调用的封装
-
用户进程调用 open 函数
-
glibc 的
syscal.list
列出 glibc 函数对应的系统调用 -
glibc 的脚本
make_syscall.sh
根据syscal.list
生成对应的宏定义(函数映射到系统调用) -
glibc 的
syscal-template.S
使用这些宏, 定义了系统调用的调用方式(也是通过宏) -
其中会调用
DO_CALL
(也是一个宏), 32位与 64位实现不同
5.2 32位系统调用过程
-
32位
DO_CALL
(位于 i386 目录下 sysdep.h) -
将调用参数放入寄存器中, 由系统调用名得到系统调用号, 放入 eax
-
执行
ENTER_KERNEL
(一个宏), 对应int $0x80
触发软中断, 进入内核 -
调用软中断处理函数
entry_INT80_32
(内核启动时, 由trap_init()
配置) -
entry_INT80_32
将用户态寄存器存入pt_regs
中(保存现场以及系统调用参数), 调用do_syscall_32_iraq_on
-
do_syscall_32_iraq_on
从pt_regs
中取系统调用号(eax), 从系统调用表得到对应实现函数, 取pt_regs
中存储的参数, 调用系统调用 -
entry_INT80_32
调用INTERRUPT_RUTURN
(一个宏)对应iret
指令, 系统调用结果存在pt_regs
的 eax 位置, 根据 pt_regs 恢复用户态进程
5.3 64位系统调用过程
-
64位
DO_CALL
(位于 x86_64 目录下 sysdep.h) -
通过系统调用名得到系统调用号, 存入 rax; 不同中断, 执行
syscall
指令 -
MSR(特殊模块寄存器), 辅助完成某些功能(包括系统调用)
-
trap_init()
会调用cpu_init->syscall_init
设置该寄存器 -
syscall
从 MSR 寄存器中, 拿出函数地址进行调用, 即调用entry_SYSCALL_64
-
entry_SYSCALL_64
先保存用户态寄存器到pt_regs
中 -
调用
entry_SYSCALL64_slow_pat->do_syscall_64
-
do_syscall
_64 从 rax 取系统调用号, 从系统调用表得到对应实现函数, 取pt_regs
中存储的参数, 调用系统调用 -
返回执行
USERGS_SYSRET64
(一个宏), 对应执行swapgs
和sysretq
指令; 系统调用结果存在pt_regs
的 ax 位置, 根据pt_regs
恢复用户态进程
5.4 系统调用表的生成
-
系统调用表
sys_call_table
-
32位 定义在
arch/x86/entry/syscalls/syscall_32.tbl
-
64位 定义在
arch/x86/entry/syscalls/syscall_64.tbl
-
syscall_*.tbl
内容包括: 系统调用号, 系统调用名, 内核实现函数名(以 sys 开头) -
内核实现函数的声明:
include/linux/syscall.h
-
内核实现函数的实现: 某个 .c 文件, 例如
sys_open
的实现在fs/open.c
-
.c 文件中, 以宏的方式替代函数名, 用多层宏构建函数头
-
编译过程中, 通过
syscall_*.tbl
生成unistd_*.h
文件 -
unistd_*.h
包含系统调用与实现函数的对应关系 -
syscall_*.h include 了 unistd_*.h
头文件, 并定义了系统调用表(数组)
标签:__,趣谈,4.0,syscall,echo,sys,调用,Linux,define 来源: https://blog.csdn.net/weixin_42813232/article/details/110062765