Introduction
This project servers as a memory dump of my head. It contains more or less structure data about the technology I’m lerning about and as such the primary purpose of it is to help me remind of details which would others go away. Although it’s trying to be correct it is by no means perfect. It’s subject of continuous changes as I get deeper and deeper in the topics.
If you find it intreresting, found some obvious mistakes or want to contribute feel free to drop me a line at dbognar@protonmail.com
Misconceptions about system programming
- It’s hard to learn
- It’s too low level
- It’s not flexibel enough
- It’s dangerous and unsafe
- It takes long to write a program
- It’s targeting only one platform
Questions to answer:
- Why does the kernel use a user-space loader (ld.so) instead of loading the shared libraries itself (like it does with the elf binary)?
- Why do we need argc if the end of argv is marked by a null pointer?
- Should I write this book in as a set of coding challenges?
- Every section could start with a challenge description. Eg: print out the CLI arguments
- It could provide some background knowledge
- And an implementation of mine
Building standalone binary
In this chapter we’re going to create a standalone elf binary which only depends on the core rust library. For that we’re going through the following steps:
- Create initial project
- Disable the Rust standard library
- Disable standard startup logic
- Implement startup logic
- Implement teardown logic
- Implement the standard library
Initialize a project
Since we only support the linux platform let’s call our new library linux as it will be an interface to the linux kernel.
To get a deeper understanding how the Rust ecosystem works we won’t use cargo at this but write out all the commands which
Cargo uses to build the libraries and binaries. Let’s create a simple Rust binary with:
> echo 'fn main() {}' > bin.rs
> rustc bin.rs
> ./bin
> echo $?
0
The only thing our program currently does is giving back a number as return code but it’s gonna be more than enough for first.
Disable the Rust standard library
#![no_std]
To disable the Rust standard library we have to add the #![no_std] at the top of the source file:
#![no_std]
fn main() {}
If we try to rebuild the code we get the following errors:
> rustc bin.rs
error: `#[panic_handler]` function required, but not found
error: unwinding panics are not supported without std
Panic handler
It seems like the std lib provides a panic-handler which is needed to be able to compile the code. So let’s implement it by adding the following lines at the end of the main.rs file:
#![allow(unused)]
fn main() {
#[panic_handler]
fn panic_handler(_: &core::panic::PanicInfo) -> ! {
loop {}
}
}
Unwinding
But what should we do with the second error message: unwinding panics are not supported without std?
What does unwinding mean? We can disable the unwinding
support by aborting the execution in case of panic. As a result we get another type of error message.
> rustc -C panic=abort bin.rs
error: using `fn main` requires the standard library
#![no_main]
So the main function depends on the std too, but how we can start a program if there is no main function?
Luckily the rustc gives us nice tips how we can solve this problem. We have to disable the compiler generated
main function and implement a Linux specific version of it. One can do this by adding the
#![no_main] attribute.
#![no_std]
#![no_main]
fn main() {}
#[panic_handler]
fn panic_handler(_: &core::panic::PanicInfo) -> ! {
loop {}
}
> rustc -C panic=abort bin.rs
error: linking with `cc` failed: exit status: 1
|
= note: /usr/bin/ld: /usr/lib/gcc/x86_64-linux-gnu/11/../../../x86_64-linux-gnu/Scrt1.o: in function `_start':
(.text+0x21): undefined reference to `__libc_start_main'
collect2: error: ld returned 1 exit status
As you have probably expected it doesn’t compile. (From now on I’m going to cleanup the long error messages a bit
to only show the relevant informations to us) But more interestingly it doens’t complain about the missing main function
but the missing __libc_start_main function. Which is a bit weird because we’re compiling Rust and not C code.
Disable standard startup logic
To investigate the problem let’s go back to the std world and create a new binary which we can debug in gdb.
> echo 'fn main() {}' > std.rs
> rustc std.rs
> gdb ./std
(gdb) set backtrace past-main on
(gdb) set backtrace past-entry on
(gdb) break main
(gdb) run
(gdb) backtrace
#0 0x000055555555c320 in main ()
#1 0x00007ffff7d8fd90 in __libc_start_call_main (main=main@entry=0x55555555c320 <main>, argc=argc@entry=1, argv=argv@entry=0x7fffffffe948) at ../sysdeps/nptl/libc_start_call_main.h:58
#2 0x00007ffff7d8fe40 in __libc_start_main_impl (main=0x55555555c320 <main>, argc=1, argv=0x7fffffffe948, init=<optimized out>, fini=<optimized out>, rtld_fini=<optimized out>, stack_end=0x7fffffffe938) at ../csu/libc-start.c:392
#3 0x000055555555c155 in _start ()
The standard Rust binary seems to be using some libc symboles to start the main function.
There is the _start function which calls __libc_start_main_impl which calls __libc_start_call_main
which calls the main function at the end. But do we really need these symboles? Do we need a main function at all? Or can
we simply use the _start function as an entry point? Let’s rewrite the code like this:
#![allow(unused)]
#![no_std]
#![no_main]
fn main() {
fn _start() {}
#[panic_handler]
fn panic_handler(_: &core::panic::PanicInfo) -> ! {
loop {}
}
}
and try to compile the binary without the general startup logic provided by gcc
> rustc -C panic=abort bin.rs -C link-args='-nostartfiles -static'
> ./bin
Segmentation fault (core dumped)
Implement startup logic
It look like we made a step further. We can compile our code now just we’re unable to run it. To find a reason of a segfault it’s typically good idea to run the binary in gdb.
> gdb ./bin
(gdb) set backtrace past-main on
(gdb) set backtrace past-entry on
(gdb) run
Starting program: /home/taabodal/work/blog/blog/src/chapter-01/bin
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000000 in ?? ()
(gdb) backtrace
#0 0x0000000000000000 in ?? ()
#1 0x0000000000000001 in ?? ()
#2 0x00007fffffffebda in ?? ()
#3 0x0000000000000000 in ?? ()
That’s not to much information a bunch of zeros in the backtrace and some questionmarks… But where is the _start function
which we have defined? Let’s try another tool to print the symboles of an executable:
> nm ./bin
0000000000401000 R __bss_start
0000000000401000 R _edata
0000000000401000 R _end
U _start
Okay, so it has at least some data which we can read. The nm command shows the address (column 1) the type the (column 2)
and the name of the symbole (column 3). The R type means that the symbole is in the read-only data section of the binary
and U type means that the symbole is undefined. So the conclusion is that the _start function which we just added to
the source is undefined. Which also explains why it doesn’t show any memory address for this function.
Rust has a different philosophy about public and private function compared with other popular languages like C or Java.
In C or Java is everything public until you mark it specifically private. For example in C one can mark a function private
for a compilation unit with the static keyword. As opposed to this in Rust is everythin private until you make it specificly
public. So how can we make our _start function public? Let’s decorate it with the
#![no_mangle] attribute. This attribute has to effects
on the decorated function:
- Disables name mangling (more about that later)
- Makes the function public for the compilation unit
#![allow(unused)]
#![no_std]
#![no_main]
fn main() {
#[no_mangle]
fn _start() {}
#[panic_handler]
fn panic_handler(_: &core::panic::PanicInfo) -> ! {
loop {}
}
}
After the function was exported the output of nm looks already much better (T means: the symbole is in the .text section of the code)
> nm bin
0000000000402000 T __bss_start
0000000000402000 T _edata
0000000000402000 T _end
0000000000401000 T _start
Implement teardown logic
We have proved that we have the _start function implemented, so why does the segfault happen? Our function is empty
so it definitelly doesn’t do any invalid memory access, or does it? Is our function really empty? Let’s checkout the
code generated by the compiler:
> objdump --disassemble=_start -M intel ./bin
0000000000401000 <_start>:
401000: c3 ret
Even though in the Rust source the _start function is completelly empty the compiler still generates a
return instruction for us. The first line of the documentation says already
what we have missed:
Transfers program control to a return address located on the top of the stack. The address is usually placed on the stack by a CALL instruction, and the return is made to the instruction that follows the CALL instruction
If the _start function is the first code which gets executed then there is not return value on the stack which can be
used to jump to after finishing the _start function. But what should we do if we can not return from a function?
The answer is: tell the kernel, that we’re done and the process should be destroyed without executing further instructions.
We can do that by applying some assembly code in place of the ret instruction. Let’s rewrite the _start function like this:
#![allow(unused)]
fn main() {
#[no_mangle]
fn _start() -> ! {
unsafe {
core::arch::asm!(
"mov rax,0x3c",
"mov rdi,0x0",
"syscall",
options(nostack, noreturn),
)
}
}
}
The compiler will generate the following assembly code for us:
> rustc -C panic=abort bin.rs -C link-args='-nostartfiles -static'
> objdump --disassemble=_start -M intel ./bin
0000000000401000 <_start>:
401001: 48 c7 c0 3c 00 00 00 mov rax,0x3c
401008: 48 c7 c7 00 00 00 00 mov rdi,0x0
40100f: 0f 05 syscall
401011: 0f 0b ud2
It has replaced the return instruction with the small code we provided and something else. So what does these lines do?
The mov rax,0x3c moves the integer value 60 into the rax register of the CPU. This value is used by the kernel to identify the
request as exit. The second instruction moves the integer value 0 into the rdi register. This will be the return code
of our program. The syscall transfers the execution of the process to the kernel but since the process will be destroyed
the last instruction ud2 will never be executed by the CPU. And it’s perfect like that because the ud2 is not a valid
x86_86 instruction. This way the compiler makes sure that if the syscall returns the process will fail with immediatelly
and Illegal Instruction error. This is the result of the options(noreturn).
I encourage you to prove it yourself by putting the ud2 instruction before the syscall
instruction and let the process crash. It looks like this:
> ./bin
Illegal instruction (core dumped)
But if you remove the ud2 instruction again, the execution of the binary gives you back 0 as return code:
> ./bin
> echo $?
0
And if you modify the value of the rdi register by replacing the 0x0 with 13 for example it gives back 13 as return code:
> ./bin
> echo $?
13
Feel free to remove the options(nostack) attribute too and compare the generated assembly code with the original version.
Try to figure out why is the code generated like that. (We’re getting back to that later on)
Implement standard library
Until now we’ve implemented everyting in a single binary but what we’re aiming for with the project is creating a Linux
specific standard library. So let’s move most of the code into a file called linux.rs and add the call to the main function
into the _start function. The library file look this now:
#![allow(unused)]
#![no_std]
#![no_main]
fn main() {
#[no_mangle]
fn _start() -> ! {
unsafe {
core::arch::asm!(
"call main",
"mov rdi,rax",
"mov rax,0x3c",
"syscall",
options(nostack, noreturn),
)
}
}
#[panic_handler]
fn panic_handler(_: &core::panic::PanicInfo) -> ! {
loop {}
}
}
We’re calling main at first, an as the System V ABI describes the return value of the function will be placed into
the rax register. We can simply move this value from rax to rdi so the kernel can use this information as a return
code of the process. After that we write into the source of the executable something like this:
#![no_std]
#![no_main]
extern crate linux;
#[no_mangle]
fn main() -> u8 { 0 }
Since it’s getting difficult to write out all the rustc commands let’s create a build script to build
our library and our binary. The cargo.sh looks like this:
#!/bin/bash
clean() {
rm -rf target
}
build() {
mkdir -p target
rustc -C panic=abort --crate-type=lib linux.rs -o target/liblinux.rlib
rustc -C panic=abort -C link-args='-nostartfiles -static' -L target ./bin.rs -o target/bin
}
run() {
build
./target/bin
}
case "$1" in
clean) clean;;
build) build;;
run) run;;
*) echo "Invalid argument '$1'";;
esac
After adding execute permissions to the mini cargo script it can be used like this:
> chmod +x ./cargo.sh
> ./cargo.sh run
> echo $?
0
Implementing standard streams
In this chapter we’re going to continue builind the Linux standard library by going through the following steps:
- Syscalls in general
- Implement read and write syscalls
- Make syscalls safe
- Make syscalls idiomatic
- Abstract standard streams
- Implement string formatting
Syscalls in general
In chapter one we already implemented a systemcall called exit. We didn’t talk much about how it works. Since systemcalls
are the foundation of the communication between the user and kernel space we will implement a couple of them throughout the
following chapters. As result it’s important to get a basic understanding how they work.
Systemcalls work quite similar to function calls in the sinn that a couple of registers will be upated with some data, the execution of the current code will be interrupted to call another code section. This other code will use the values of the registers, do some operation with them and wenn it finishes the execution returns back to the original point to the caller function can continue with the result of the call. An important difference though is that by calling the syscall a contex switch will occur. This means that instead of simply jumping to another code segment of the same executable the process will be interrupted the CPU will switch to kernel mode and the code of the kernel continue to execute. The same happens at the end of the systemcall: the CPU switches back to user-mode and continues to execute the user-space code. To tell the CPU to make contex switches there are two instructions on x86 family called syscall and sysret. The first is used by user-space codes to switch to kernel and the second is used by the kernel to switch back to user-mode.
There are many systemcalls defined by the Linux kernel. The id of these systemcalls can be found in the kernes source tree. The 64 bit version of the x86 architecture can be found for example here
If you have already done some lower level programming (for example C/C++) you most likely already know some of these calls. The standard C library warps these systemcalls into simple functions so you can call them in youre code without even realizing that a contex switch is needed. Some famous examples are the following:
- read
- write
- open
- close
- socket
- connect
- accept
- exit
Since we don’t use the standard C library we need to implement these wrappers in rust to be able to use them in our binaries.
To be able to pass arguments to the kernel we need to specific registers. The question is which register should we use?
The references which describe how a binary code needs to be implemented / interpeted called Application Binary Interface (ABI).
Linux uses the System V ABI specification. There are many interesting
stuff to read about in this PDF but the most important part now for us are the calling conventions. It turns out the the function
calling convention of the C language and the syscall interface are not the same. While the function arguments are passed in the
rdi, rsi, rdx, rcx, r8, r9 registers the syscall interface uses the rdi, rsi, rdx, r10, r8 and r9
registers. Appart from that it’s important that the rax register is used to pass the syscall id and to retrieve the result
of the syscall. To conform to these requirements we can implement a macro
to provide a simple way of starting a syscall. Let’s create a file called syscalls.rs and add a pub mod syscalls to the linux.rs file.
#![allow(unused)]
fn main() {
macro_rules! syscall {
($rax:expr) => {{
core::arch::asm!(
"syscall",
inout("rax") $rax,
);
$rax
}};
($rax:expr, $rdi:expr) => {{
let mut rax: isize;
core::arch::asm!(
"syscall",
inlateout("rax") $rax => rax,
in("rdi") $rdi,
);
rax
}};
($rax:expr, $rdi:expr, $rsi:expr) => {{
let mut rax: isize;
core::arch::asm!(
"syscall",
inlateout("rax") $rax => rax,
in("rdi") $rdi,
in("rsi") $rsi,
);
rax
}};
($rax:expr, $rdi:expr, $rsi:expr, $rdx:expr) => {{
let mut rax: isize;
core::arch::asm!(
"syscall",
inlateout("rax") $rax => rax,
in("rdi") $rdi,
in("rsi") $rsi,
in("rdx") $rdx,
);
rax
}};
($rax:expr, $rdi:expr, $rsi:expr, $rdx:expr, $r10:expr) => {{
let mut rax: isize;
core::arch::asm!(
"syscall",
inlateout("rax") $rax => rax,
in("rdi") $rdi,
in("rsi") $rsi,
in("rdx") $rdx,
in("r10") $r10,
);
rax
}};
($rax:expr, $rdi:expr, $rsi:expr, $rdx:expr, $r10:expr, $r8:expr) => {{
let mut rax: isize;
core::arch::asm!(
"syscall",
inlateout("rax") $rax => rax,
in("rdi") $rdi,
in("rsi") $rsi,
in("rdx") $rdx,
in("r10") $r10,
in("r8") $r8,
);
rax
}};
($rax:expr, $rdi:expr, $rsi:expr, $rdx:expr, $r10:expr, $r8:expr, $r9:expr) => {{
let mut rax: isize;
core::arch::asm!(
"syscall",
inlateout("rax") $rax => rax,
in("rdi") $rdi,
in("rsi") $rsi,
in("rdx") $rdx,
in("r10") $r10,
in("r8") $r8,
in("r9") $r9,
);
rax
}};
}
}
This macro can be called with variadic (1-7) number of arguments which will be passed into the specified registers.
After the registers were filled with the data the syscall instruction will be executed to hand over the execution
to the kernel. Note that the asm macro of the rust core library requires the parameters to be placed after the
assembly code it self even though they will be set before the execution.
Read, write, exit
The simplest way to lookup how the standard C library has implemented a systemcall wrapper is to check out the manual page of the it. For example: man read.2, man write.2, man exit.2
The function signatures written in C look like this
ssize_t read(int fd, void *buf, size_t count);
ssize_t write(int fd, const void *buf, size_t count);
void exit(int rc);
Let’s update our syscalls.rs file with the following functions:
#![allow(unused)]
fn main() {
const SYS_READ: isize = 0;
const SYS_WRITE: isize = 1;
const SYS_EXIT: isize = 60;
pub fn read(fd: i32, buf: *mut u8, count: usize) -> isize {
unsafe { syscall!(SYS_READ, fd, buf, count) }
}
pub fn write(fd: i32, buf: *const u8, count: usize) -> isize {
unsafe { syscall!(SYS_WRITE, fd, buf, count) }
}
pub fn exit(rc: u8) -> ! {
unsafe { syscall!(SYS_EXIT, rc as u32); }
unreachable!();
}
}
This allows us to read some user input and write it to the stdout as follows:
#[no_mangle]
fn main() {
let mut buf = [0u8;1024];
let ptr = &mut buf as *mut u8;
linux::syscall::read(0, ptr, buf.len());
linux::syscall::write(1, ptr, buf.len());
0
}
But we have still a problem… Our program doesn’t compile anymore. We have just introduced an undefined reference
> ./cargo.sh build
error: linking with `cc` failed: exit status: 1
/blog/chapter-02/src/bin.rs:7: undefined reference to `memset'
It turns out that to be able to use the rust syntax let buf = [0u8;1024] the core library needs the memset symbole. This
makes sense since this expression fills up a memory region with 1024 zeros.
There are a couple symboles the core library needs to be able to work. These are typically provided by the standard C library but
since we have disabled any libraries apart from the core lib we have to implement them manually.
The documentation says the expected symboles are:
- memcpy
- memmove
- memset
- memcmp
- bcmp
- strlen
There are some other expected symboles like rust_begin_panic and rust_eh_personality but we will only implement these
step by step to be able to explore which functionality of the core library needs them. Let’s implement the memset for now
in the ffi module. We need to add a pub mod ffi; in to the linux.rs file and create ffi.rs with the content:
#![allow(unused)]
fn main() {
use core::convert::TryInto;
#[no_mangle]
fn memset(buffer: *mut u8, byte: u8, len: usize) -> *mut u8 {
for idx in 0 .. len {
let offset = idx.try_into().unwrap();
unsafe { buffer.offset(offset).write(byte); }
}
buffer
}
}
And recompile the code
> echo "hello world" | ./cargo.sh run
hello world
Safe syscalls
Wenn we write unsafe code we sign a contract with the compiler that our code is never going to be unsound. In the Rust world
a codeblock is known to sound if it can never cause undefined behaviour. Luckily it’s quiet well defined
what “undefined” means. There a list of actions which
causes undefined behavior and if we can be sure you are not hitting any of the items of list our code in said to be sound.
Even if this list is quite straitforward it’s easy to miss some small detail just like we did in the previous paragraphs.
Our code look good, right? It has basically the same signature like the C functions and it passes all the arguments to the kernel.
It doesn’t do something like dereferencing raw pointers, it doesn’t do array indexing, doesn’t free up memory, so what could
go wrong then? Well let’s rewrite the main function and see what happens.
#[no_mangle]
fn main() -> u8 {
linux::syscall::write(1, b"X" as *const u8, 1024);
0
}
If we run this code we just experience undefined behaviour: We pass the kernel a one byte length array and a length paramter 1024. As a result it tries to write 1024 bytes after the position of our byte array and it is absolutelly not defined what will happen in such a scenario. In our case since the byte array was in the read only section of the binary it picks up the bytes from there.
./target/bin
xinternal error: entered unreachable codesyscall.rsHhzRx
A C
UAC
The conclusion is that Rust is only safe if every part of the code is known to be sound. Our code is not sound because the
safe rust code can pass such parameters to it which causes undefined behaviour. Let’s fix that by utilizing a primitive
type in the Rust core library called slice. Since the slice
bundles the buffer and its length a user of our code can not pass a length paramter which is bigger than the size of the slice.
To be more precise it can pass to our function a slice which is has an invalid length parameter but to create this slice
one need to use an other unsafe block and the auther of this unsafe block has signed the same contract with the compiler, that
it can never produce undefined behaviour. So you see the point. If all the unsafe blocks are sound then the whole language is
safe. But if any of these block is unsound the whole ecosystem is corrupted. So let’s be causios with unsafe blocks.
Here is a fix for our syscalls:
#![allow(unused)]
fn main() {
pub fn read(fd: u32, buf: &mut [u8]) -> isize {
unsafe { syscall!(SYS_READ, fd, buf.as_ptr(), buf.len()) }
}
pub fn write(fd: u32, buf: &[u8]) -> isize {
unsafe { syscall!(SYS_WRITE, fd, buf.as_ptr(), buf.len()) }
}
}
The main function works like this:
#[no_mangle]
fn main() -> u8 {
linux::syscall::write(1, b"x");
0
}
Since there is no way to missuse this syscall if you run it, it will write exaclty one character to the screen:
> ./cargo.sh run
x
Idiomatic syscall
Although our code is now safe it is still not really idiomatic. In C programming it’s normal to
return with a number wich represents the result of the function. For example all of our syscalls return with a negativ
integer in case of an error. But in rust we have a nicer way to handle error which is based on the
Result enum. Let’s create an Error enum and a Result enum
to represent the result of our syscalls. The list of the error codes that a syscall may return can be found
in errno-base.h and
errno.h After combining the content of these
files we can build a huge enum which represents these error codes
#![allow(unused)]
fn main() {
use core::fmt;
pub type Result<T> = core::result::Result<T, Error>;
#[derive(Debug)]
pub enum Error {
EPERM = 1,
ENOENT = 2,
ESRCH = 3,
EINTR = 4,
EIO = 5,
ENXIO = 6,
E2BIG = 7,
ENOEXEC = 8,
EBADF = 9,
ECHILD = 10,
EAGAIN = 11,
ENOMEM = 12,
EACCES = 13,
EFAULT = 14,
ENOTBLK = 15,
EBUSY = 16,
EEXIST = 17,
EXDEV = 18,
ENODEV = 19,
ENOTDIR = 20,
EISDIR = 21,
EINVAL = 22,
ENFILE = 23,
EMFILE = 24,
ENOTTY = 25,
ETXTBSY = 26,
EFBIG = 27,
ENOSPC = 28,
ESPIPE = 29,
EROFS = 30,
EMLINK = 31,
EPIPE = 32,
EDOM = 33,
ERANGE = 34,
EDEADLK = 35,
ENAMETOOLONG = 36,
ENOLCK = 37,
ENOSYS = 38,
ENOTEMPTY = 39,
ELOOP = 40,
EWOULDBLOCK = 41,
ENOMSG = 42,
EIDRM = 43,
ECHRNG = 44,
EL2NSYNC = 45,
EL3HLT = 46,
EL3RST = 47,
ELNRNG = 48,
EUNATCH = 49,
ENOCSI = 50,
EL2HLT = 51,
EBADE = 52,
EBADR = 53,
EXFULL = 54,
ENOANO = 55,
EBADRQC = 56,
EBADSLT = 57,
EDEADLOCK = 58,
EBFONT = 59,
ENOSTR = 60,
ENODATA = 61,
ETIME = 62,
ENOSR = 63,
ENONET = 64,
ENOPKG = 65,
EREMOTE = 66,
ENOLINK = 67,
EADV = 68,
ESRMNT = 69,
ECOMM = 70,
EPROTO = 71,
EMULTIHOP = 72,
EDOTDOT = 73,
EBADMSG = 74,
EOVERFLOW = 75,
ENOTUNIQ = 76,
EBADFD = 77,
EREMCHG = 78,
ELIBACC = 79,
ELIBBAD = 80,
ELIBSCN = 81,
ELIBMAX = 82,
ELIBEXEC = 83,
EILSEQ = 84,
ERESTART = 85,
ESTRPIPE = 86,
EUSERS = 87,
ENOTSOCK = 88,
EDESTADDRREQ = 89,
EMSGSIZE = 90,
EPROTOTYPE = 91,
ENOPROTOOPT = 92,
EPROTONOSUPPORT = 93,
ESOCKTNOSUPPORT = 94,
EOPNOTSUPP = 95,
EPFNOSUPPORT = 96,
EAFNOSUPPORT = 97,
EADDRINUSE = 98,
EADDRNOTAVAIL = 99,
ENETDOWN = 100,
ENETUNREACH = 101,
ENETRESET = 102,
ECONNABORTED = 103,
ECONNRESET = 104,
ENOBUFS = 105,
EISCONN = 106,
ENOTCONN = 107,
ESHUTDOWN = 108,
ETOOMANYREFS = 109,
ETIMEDOUT = 110,
ECONNREFUSED = 111,
EHOSTDOWN = 112,
EHOSTUNREACH = 113,
EALREADY = 114,
EINPROGRESS = 115,
ESTALE = 116,
EUCLEAN = 117,
ENOTNAM = 118,
ENAVAIL = 119,
EISNAM = 120,
EREMOTEIO = 121,
EDQUOT = 122,
ENOMEDIUM = 123,
EMEDIUMTYPE = 124,
ECANCELED = 125,
ENOKEY = 126,
EKEYEXPIRED = 127,
EKEYREVOKED = 128,
EKEYREJECTED = 129,
EOWNERDEAD = 130,
ENOTRECOVERABLE = 131,
ERFKILL = 132,
EHWPOISON = 133,
}
impl fmt::Display for Error {
fn fmt(&self, f: &mut fmt::Formatter) -> fmt::Result {
write!(f, "{}", self.as_str())
}
}
impl From<Error> for isize {
fn from(error: Error) -> Self {
match error {
Error::EPERM => 1,
Error::ENOENT => 2,
Error::ESRCH => 3,
Error::EINTR => 4,
Error::EIO => 5,
Error::ENXIO => 6,
Error::E2BIG => 7,
Error::ENOEXEC => 8,
Error::EBADF => 9,
Error::ECHILD => 10,
Error::EAGAIN => 11,
Error::ENOMEM => 12,
Error::EACCES => 13,
Error::EFAULT => 14,
Error::ENOTBLK => 15,
Error::EBUSY => 16,
Error::EEXIST => 17,
Error::EXDEV => 18,
Error::ENODEV => 19,
Error::ENOTDIR => 20,
Error::EISDIR => 21,
Error::EINVAL => 22,
Error::ENFILE => 23,
Error::EMFILE => 24,
Error::ENOTTY => 25,
Error::ETXTBSY => 26,
Error::EFBIG => 27,
Error::ENOSPC => 28,
Error::ESPIPE => 29,
Error::EROFS => 30,
Error::EMLINK => 31,
Error::EPIPE => 32,
Error::EDOM => 33,
Error::ERANGE => 34,
Error::EDEADLK => 35,
Error::ENAMETOOLONG => 36,
Error::ENOLCK => 37,
Error::ENOSYS => 38,
Error::ENOTEMPTY => 39,
Error::ELOOP => 40,
Error::EWOULDBLOCK => 41,
Error::ENOMSG => 42,
Error::EIDRM => 43,
Error::ECHRNG => 44,
Error::EL2NSYNC => 45,
Error::EL3HLT => 46,
Error::EL3RST => 47,
Error::ELNRNG => 48,
Error::EUNATCH => 49,
Error::ENOCSI => 50,
Error::EL2HLT => 51,
Error::EBADE => 52,
Error::EBADR => 53,
Error::EXFULL => 54,
Error::ENOANO => 55,
Error::EBADRQC => 56,
Error::EBADSLT => 57,
Error::EDEADLOCK => 58,
Error::EBFONT => 59,
Error::ENOSTR => 60,
Error::ENODATA => 61,
Error::ETIME => 62,
Error::ENOSR => 63,
Error::ENONET => 64,
Error::ENOPKG => 65,
Error::EREMOTE => 66,
Error::ENOLINK => 67,
Error::EADV => 68,
Error::ESRMNT => 69,
Error::ECOMM => 70,
Error::EPROTO => 71,
Error::EMULTIHOP => 72,
Error::EDOTDOT => 73,
Error::EBADMSG => 74,
Error::EOVERFLOW => 75,
Error::ENOTUNIQ => 76,
Error::EBADFD => 77,
Error::EREMCHG => 78,
Error::ELIBACC => 79,
Error::ELIBBAD => 80,
Error::ELIBSCN => 81,
Error::ELIBMAX => 82,
Error::ELIBEXEC => 83,
Error::EILSEQ => 84,
Error::ERESTART => 85,
Error::ESTRPIPE => 86,
Error::EUSERS => 87,
Error::ENOTSOCK => 88,
Error::EDESTADDRREQ => 89,
Error::EMSGSIZE => 90,
Error::EPROTOTYPE => 91,
Error::ENOPROTOOPT => 92,
Error::EPROTONOSUPPORT => 93,
Error::ESOCKTNOSUPPORT => 94,
Error::EOPNOTSUPP => 95,
Error::EPFNOSUPPORT => 96,
Error::EAFNOSUPPORT => 97,
Error::EADDRINUSE => 98,
Error::EADDRNOTAVAIL => 99,
Error::ENETDOWN => 100,
Error::ENETUNREACH => 101,
Error::ENETRESET => 102,
Error::ECONNABORTED => 103,
Error::ECONNRESET => 104,
Error::ENOBUFS => 105,
Error::EISCONN => 106,
Error::ENOTCONN => 107,
Error::ESHUTDOWN => 108,
Error::ETOOMANYREFS => 109,
Error::ETIMEDOUT => 110,
Error::ECONNREFUSED => 111,
Error::EHOSTDOWN => 112,
Error::EHOSTUNREACH => 113,
Error::EALREADY => 114,
Error::EINPROGRESS => 115,
Error::ESTALE => 116,
Error::EUCLEAN => 117,
Error::ENOTNAM => 118,
Error::ENAVAIL => 119,
Error::EISNAM => 120,
Error::EREMOTEIO => 121,
Error::EDQUOT => 122,
Error::ENOMEDIUM => 123,
Error::EMEDIUMTYPE => 124,
Error::ECANCELED => 125,
Error::ENOKEY => 126,
Error::EKEYEXPIRED => 127,
Error::EKEYREVOKED => 128,
Error::EKEYREJECTED => 129,
Error::EOWNERDEAD => 130,
Error::ENOTRECOVERABLE => 131,
Error::ERFKILL => 132,
Error::EHWPOISON => 133,
}
}
}
impl From<isize> for Error {
fn from(number: isize) -> Self {
match number {
1 => Self::EPERM,
2 => Self::ENOENT,
3 => Self::ESRCH,
4 => Self::EINTR,
5 => Self::EIO,
6 => Self::ENXIO,
7 => Self::E2BIG,
8 => Self::ENOEXEC,
9 => Self::EBADF,
10 => Self::ECHILD,
11 => Self::EAGAIN,
12 => Self::ENOMEM,
13 => Self::EACCES,
14 => Self::EFAULT,
15 => Self::ENOTBLK,
16 => Self::EBUSY,
17 => Self::EEXIST,
18 => Self::EXDEV,
19 => Self::ENODEV,
20 => Self::ENOTDIR,
21 => Self::EISDIR,
22 => Self::EINVAL,
23 => Self::ENFILE,
24 => Self::EMFILE,
25 => Self::ENOTTY,
26 => Self::ETXTBSY,
27 => Self::EFBIG,
28 => Self::ENOSPC,
29 => Self::ESPIPE,
30 => Self::EROFS,
31 => Self::EMLINK,
32 => Self::EPIPE,
33 => Self::EDOM,
34 => Self::ERANGE,
35 => Self::EDEADLK,
36 => Self::ENAMETOOLONG,
37 => Self::ENOLCK,
38 => Self::ENOSYS,
39 => Self::ENOTEMPTY,
40 => Self::ELOOP,
41 => Self::EWOULDBLOCK,
42 => Self::ENOMSG,
43 => Self::EIDRM,
44 => Self::ECHRNG,
45 => Self::EL2NSYNC,
46 => Self::EL3HLT,
47 => Self::EL3RST,
48 => Self::ELNRNG,
49 => Self::EUNATCH,
50 => Self::ENOCSI,
51 => Self::EL2HLT,
52 => Self::EBADE,
53 => Self::EBADR,
54 => Self::EXFULL,
55 => Self::ENOANO,
56 => Self::EBADRQC,
57 => Self::EBADSLT,
58 => Self::EDEADLOCK,
59 => Self::EBFONT,
60 => Self::ENOSTR,
61 => Self::ENODATA,
62 => Self::ETIME,
63 => Self::ENOSR,
64 => Self::ENONET,
65 => Self::ENOPKG,
66 => Self::EREMOTE,
67 => Self::ENOLINK,
68 => Self::EADV,
69 => Self::ESRMNT,
70 => Self::ECOMM,
71 => Self::EPROTO,
72 => Self::EMULTIHOP,
73 => Self::EDOTDOT,
74 => Self::EBADMSG,
75 => Self::EOVERFLOW,
76 => Self::ENOTUNIQ,
77 => Self::EBADFD,
78 => Self::EREMCHG,
79 => Self::ELIBACC,
80 => Self::ELIBBAD,
81 => Self::ELIBSCN,
82 => Self::ELIBMAX,
83 => Self::ELIBEXEC,
84 => Self::EILSEQ,
85 => Self::ERESTART,
86 => Self::ESTRPIPE,
87 => Self::EUSERS,
88 => Self::ENOTSOCK,
89 => Self::EDESTADDRREQ,
90 => Self::EMSGSIZE,
91 => Self::EPROTOTYPE,
92 => Self::ENOPROTOOPT,
93 => Self::EPROTONOSUPPORT,
94 => Self::ESOCKTNOSUPPORT,
95 => Self::EOPNOTSUPP,
96 => Self::EPFNOSUPPORT,
97 => Self::EAFNOSUPPORT,
98 => Self::EADDRINUSE,
99 => Self::EADDRNOTAVAIL,
100 => Self::ENETDOWN,
101 => Self::ENETUNREACH,
102 => Self::ENETRESET,
103 => Self::ECONNABORTED,
104 => Self::ECONNRESET,
105 => Self::ENOBUFS,
106 => Self::EISCONN,
107 => Self::ENOTCONN,
108 => Self::ESHUTDOWN,
109 => Self::ETOOMANYREFS,
110 => Self::ETIMEDOUT,
111 => Self::ECONNREFUSED,
112 => Self::EHOSTDOWN,
113 => Self::EHOSTUNREACH,
114 => Self::EALREADY,
115 => Self::EINPROGRESS,
116 => Self::ESTALE,
117 => Self::EUCLEAN,
118 => Self::ENOTNAM,
119 => Self::ENAVAIL,
120 => Self::EISNAM,
121 => Self::EREMOTEIO,
122 => Self::EDQUOT,
123 => Self::ENOMEDIUM,
124 => Self::EMEDIUMTYPE,
125 => Self::ECANCELED,
126 => Self::ENOKEY,
127 => Self::EKEYEXPIRED,
128 => Self::EKEYREVOKED,
129 => Self::EKEYREJECTED,
130 => Self::EOWNERDEAD,
131 => Self::ENOTRECOVERABLE,
132 => Self::ERFKILL,
133 => Self::EHWPOISON,
other => panic!("Invalid error code: {}", other),
}
}
}
impl Error {
pub fn as_str(&self) -> &'static str {
match self {
Self::EPERM => "Operation not permitted",
Self::ENOENT => "No such file or directory",
Self::ESRCH => "No such process",
Self::EINTR => "Interrupted system call",
Self::EIO => "I/O error",
Self::ENXIO => "No such device or address",
Self::E2BIG => "Arg list too long",
Self::ENOEXEC => "Exec format error",
Self::EBADF => "Bad file number",
Self::ECHILD => "No child processes",
Self::EAGAIN => "Try again",
Self::ENOMEM => "Out of memory",
Self::EACCES => "Permission denied",
Self::EFAULT => "Bad address",
Self::ENOTBLK => "Block device required",
Self::EBUSY => "Device or resource busy",
Self::EEXIST => "File exists",
Self::EXDEV => "Cross-device link",
Self::ENODEV => "No such device",
Self::ENOTDIR => "Not a directory",
Self::EISDIR => "Is a directory",
Self::EINVAL => "Invalid argument",
Self::ENFILE => "File table overflow",
Self::EMFILE => "Too many open files",
Self::ENOTTY => "Not a typewriter",
Self::ETXTBSY => "Text file busy",
Self::EFBIG => "File too large",
Self::ENOSPC => "No space left on device",
Self::ESPIPE => "Illegal seek",
Self::EROFS => "Read-only file system",
Self::EMLINK => "Too many links",
Self::EPIPE => "Broken pipe",
Self::EDOM => "Math argument out of domain of func",
Self::ERANGE => "Math result not representable",
Self::EDEADLK => "Resource deadlock would occur",
Self::ENAMETOOLONG => "File name too long",
Self::ENOLCK => "No record locks available",
Self::ENOSYS => "Function not implemented",
Self::ENOTEMPTY => "Directory not empty",
Self::ELOOP => "Too many symbolic links encountered",
Self::EWOULDBLOCK => "Operation would block",
Self::ENOMSG => "No message of desired type",
Self::EIDRM => "Identifier removed",
Self::ECHRNG => "Channel number out of range",
Self::EL2NSYNC => "Level 2 not synchronized",
Self::EL3HLT => "Level 3 halted",
Self::EL3RST => "Level 3 reset",
Self::ELNRNG => "Link number out of range",
Self::EUNATCH => "Protocol driver not attached",
Self::ENOCSI => "No CSI structure available",
Self::EL2HLT => "Level 2 halted",
Self::EBADE => "Invalid exchange",
Self::EBADR => "Invalid request descriptor",
Self::EXFULL => "Exchange full",
Self::ENOANO => "No anode",
Self::EBADRQC => "Invalid request code",
Self::EBADSLT => "Invalid slot",
Self::EDEADLOCK => "File locking deadlock error",
Self::EBFONT => "Bad font file format",
Self::ENOSTR => "Device not a stream",
Self::ENODATA => "No data available",
Self::ETIME => "Timer expired",
Self::ENOSR => "Out of streams resources",
Self::ENONET => "Machine is not on the network",
Self::ENOPKG => "Package not installed",
Self::EREMOTE => "Object is remote",
Self::ENOLINK => "Link has been severed",
Self::EADV => "Advertise error",
Self::ESRMNT => "Srmount error",
Self::ECOMM => "Communication error on send",
Self::EPROTO => "Protocol error",
Self::EMULTIHOP => "Multihop attempted",
Self::EDOTDOT => "RFS specific error",
Self::EBADMSG => "Not a data message",
Self::EOVERFLOW => "Value too large for defined data type",
Self::ENOTUNIQ => "Name not unique on network",
Self::EBADFD => "File descriptor in bad state",
Self::EREMCHG => "Remote address changed",
Self::ELIBACC => "Can not access a needed shared library",
Self::ELIBBAD => "Accessing a corrupted shared library",
Self::ELIBSCN => ".lib section in a.out corrupted",
Self::ELIBMAX => "Attempting to link in too many shared libraries",
Self::ELIBEXEC => "Cannot exec a shared library directly",
Self::EILSEQ => "Illegal byte sequence",
Self::ERESTART => "Interrupted system call should be restarted",
Self::ESTRPIPE => "Streams pipe error",
Self::EUSERS => "Too many users",
Self::ENOTSOCK => "Socket operation on non-socket",
Self::EDESTADDRREQ => "Destination address required",
Self::EMSGSIZE => "Message too long",
Self::EPROTOTYPE => "Protocol wrong type for socket",
Self::ENOPROTOOPT => "Protocol not available",
Self::EPROTONOSUPPORT => "Protocol not supported",
Self::ESOCKTNOSUPPORT => "Socket type not supported",
Self::EOPNOTSUPP => "Operation not supported on transport endpoint",
Self::EPFNOSUPPORT => "Protocol family not supported",
Self::EAFNOSUPPORT => "Address family not supported by protocol",
Self::EADDRINUSE => "Address already in use",
Self::EADDRNOTAVAIL => "Cannot assign requested address",
Self::ENETDOWN => "Network is down",
Self::ENETUNREACH => "Network is unreachable",
Self::ENETRESET => "Network dropped connection because of reset",
Self::ECONNABORTED => "Software caused connection abort",
Self::ECONNRESET => "Connection reset by peer",
Self::ENOBUFS => "No buffer space available",
Self::EISCONN => "Transport endpoint is already connected",
Self::ENOTCONN => "Transport endpoint is not connected",
Self::ESHUTDOWN => "Cannot send after transport endpoint shutdown",
Self::ETOOMANYREFS => "Too many references: cannot splice",
Self::ETIMEDOUT => "Connection timed out",
Self::ECONNREFUSED => "Connection refused",
Self::EHOSTDOWN => "Host is down",
Self::EHOSTUNREACH => "No route to host",
Self::EALREADY => "Operation already in progress",
Self::EINPROGRESS => "Operation now in progress",
Self::ESTALE => "Stale NFS file handle",
Self::EUCLEAN => "Structure needs cleaning",
Self::ENOTNAM => "Not a XENIX named type file",
Self::ENAVAIL => "No XENIX semaphores available",
Self::EISNAM => "Is a named type file",
Self::EREMOTEIO => "Remote I/O error",
Self::EDQUOT => "Quota exceeded",
Self::ENOMEDIUM => "No medium found",
Self::EMEDIUMTYPE => "Wrong medium type",
Self::ECANCELED => "Operation Canceled",
Self::ENOKEY => "Required key not available",
Self::EKEYEXPIRED => "Key has expired",
Self::EKEYREVOKED => "Key has been revoked",
Self::EKEYREJECTED => "Key was rejected by service",
Self::EOWNERDEAD => "Owner died",
Self::ENOTRECOVERABLE => "State not recoverable",
Self::ERFKILL => "Operation not possible due to RF-kill",
Self::EHWPOISON => "Memory page has hardware error",
}
}
}
}
Once we have Result and Error we can reimplement our syscalls as follows:
#![allow(unused)]
fn main() {
use core::convert::TryInto;
use crate::error::{Error, Result};
#[no_mangle]
pub fn read(fd: i32, buf: &mut [u8]) -> Result<usize> {
let rc = unsafe { syscall!(SYS_READ, fd, buf.as_ptr(), buf.len()) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(rc.try_into().unwrap())
}
#[no_mangle]
pub fn write(fd: i32, buf: &[u8]) -> Result<usize> {
let rc = unsafe { syscall!(SYS_WRITE, fd, buf.as_ptr(), buf.len()) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(rc.try_into().unwrap())
}
}
and the main function should look like:
#[no_mangle]
fn main() -> u8 {
linux::syscall::write(1, b"Hello world\n").unwrap();
0
}
Once we recompile and run the code we can see the text on the stdout
> ./cargo.sh run
Hello world
But what happens if we specify a wrong file number. Let’s use 3 as file descriptor instead of 1. Since we never opened a file with a descriptor 3 we should see an error now. Let’s recompile and run
> ./cargo.sh run
our program starts hammering on the CPU and never exists. Sounds familiar? The write syscall returns and error, we unwrap it
and as a result our code panics, But we implemented our panic handler in the first chapter like this:
#![allow(unused)]
fn main() {
#[panic_handler]
fn panic_handler(_: &core::panic::PanicInfo) -> ! {
loop {}
}
}
Let’s fix that calling the exit syscall instead of looping forever:
#![allow(unused)]
fn main() {
#[panic_handler]
fn panic_handler(_: &core::panic::PanicInfo) -> ! {
syscall::exit(255);
}
}
Now if we run the same code with file descriptor 3 the process should simply exit with error code 255.
> ./cargo.sh run; echo $?
255
Standard IO
We already implemented Display and Debug for our error type so why don’t we simply print them on the stderr?
The PanicInfo also implements these traits, so we should be
able to write them out, but how should we creata a string or more preciselly a bytearray from these types?
There is a nice macro in the core library called write!
which could be used to format the output. Let’s try that in the panic_handler function.
#![allow(unused)]
fn main() {
#[panic_handler]
fn panic_handler(info: &core::panic::PanicInfo) -> ! {
write!(1u8, "{:?}", info);
syscall::exit(255);
}
}
As you probably have expect, we get a compilation error:
./cargo.sh build
error[E0599]: cannot write into `u8`
--> linux.rs:24:12
|
24 | write!(1u8, "{:?}", info);
| -------^^^--------------- method not found in `u8`
|
note: must implement `io::Write`, `fmt::Write`, or have a `write_fmt` method
--> linux.rs:24:12
|
24 | write!(1u8, "{:?}", info);
| ^^^
help: a writer is needed before this format string
--> linux.rs:24:12
|
24 | write!(1u8, "{:?}", info);
| ^
We can not write into u8… Which kind of makes sense. The write! macro is part of the core library which has no
idea about the write syscall we just implemented. We should somehow inverse the dependencies and the compiler message
helps us to do that. The first argument of the write! macro needs to implement the io::Write, fmt::Write traits
or needs to have a write_fmt method. Let’s wrap some integers into a struct and implement the
fmt::Write trait for it.
(The io::Write trait is part of the std library
which we don’t have access to)
Let’s create a new module, called io. We need to include it into the linux.rs with pub mode io; and create a new
file called io.rs with the following content:
#![allow(unused)]
fn main() {
use core::fmt;
use crate::error::Result;
pub struct Stdio {
fd: u32,
}
impl Stdio {
pub fn read(&self, buf: &mut [u8]) -> Result<usize> {
crate::syscall::read(self.fd, buf)
}
pub fn write(&self, buf: &[u8]) -> Result<usize> {
crate::syscall::write(self.fd, buf)
}
}
impl fmt::Write for Stdio {
fn write_str(&mut self, s: &str) -> fmt::Result {
match self.write(s.as_bytes()) {
Ok(_) => Ok(()),
Err(_) => Err(fmt::Error),
}
}
}
pub fn stdin() -> Stdio {
Stdio { fd: 0 }
}
pub fn stdout() -> Stdio {
Stdio { fd: 1 }
}
pub fn stderr() -> Stdio {
Stdio { fd: 2 }
}
}
After that we can rewrite the panic-handler like this:
#![allow(unused)]
fn main() {
use core::fmt::Write;
#[panic_handler]
fn panic_handler(info: &core::panic::PanicInfo) -> ! {
let _ = write!(io::stderr(), "{}\n", info);
syscall::exit(255);
}
}
But if we try to build the code we get yet another linker error about the missing memcpy function. No problem. We already
expected that just didn’t know when it is going to come. So let’s put our memcpy implementation next to the memset in the
ffi.rs file:
#![allow(unused)]
fn main() {
#[no_mangle]
unsafe fn memcpy(dst: *mut u8, src: *const u8, len: usize) -> *mut u8 {
for idx in 0 .. len {
let offset = idx.try_into().unwrap();
unsafe {
let byte = src.offset(offset).read();
dst.offset(offset).write(byte);
}
}
dst
}
}
Exceptions in Rust: https://github.com/rust-lang/rfcs/blob/master/text/1236-stabilize-catch-panic.md
Last by not least we get an undefine reference error to rust_eh_personality TODO: what’s this?
#![allow(unused)]
#![feature(lang_items)]
#![allow(internal_features)]
fn main() {
#[lang = "eh_personality"]
fn rust_eh_personality() {}
}
Print macros
The write macro is already a big improvement but we can go further. Let’s define two macros to print a text onto the stdout and stderr. The can be defined in the io.rs file.
#![allow(unused)]
fn main() {
#[macro_export]
macro_rules! print {
($fmt:literal $(,$($args:expr)*)?) => {{
use core::fmt::Write;
write!($crate::io::stdout(), $fmt, $($($args),*)?).unwrap();
}}
}
#[macro_export]
macro_rules! println {
($fmt:literal $(,$($args:expr)*)?) => {{
$crate::print!("{}\n", format_args!($fmt, $($($args),*)?))
}}
}
#[macro_export]
macro_rules! eprint {
($fmt:literal $(,$($args:expr)*)?) => {{
use core::fmt::Write;
write!($crate::io::stderr(), $fmt, $($($args),*)?).unwrap();
}}
}
#[macro_export]
macro_rules! eprintln {
($fmt:literal $(,$($args:expr)*)?) => {{
$crate::eprint!("{}\n", format_args!($fmt, $($($args),*)?))
}}
}
}
and the bin.rs like this: (Note the new
#[macro_use] attribute on the
extern linux crate)
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
#[no_mangle]
fn main() -> u8 {
print!("Hello");
eprintln!(" {}", "world");
0
}
File operations
- open, close
- stat, fstat, lstat, fstatat
- fcntl, fsync, fdatasync
- truncate, ftruncate, fallocate
- lseek
- seek, drop (close)
- BufRead, BufWrite – prove with perf the many syscalls
Open and close
First of all we need to implement two syscall the open and the close to be able to work with files. If you lookup the
manual page of open and close
it says that the function signature look like:
int open(const char *path, int flags);
int close(int fd);
This should be quit simple to implement in Rust. Let’s add the following functions to our syscall.rs:
#![allow(unused)]
fn main() {
const SYS_OPEN: isize = 2;
const SYS_CLOSE: isize = 3;
#[no_mangle]
pub fn open(path: &str, flags: u64, mode: u64) -> Result<u32> {
let rc = unsafe { syscall!(SYS_OPEN, path.as_ptr(), flags, mode) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(u32::try_from(rc).unwrap())
}
#[no_mangle]
pub fn close(fd: u32) -> Result<()> {
let rc = unsafe { syscall!(SYS_CLOSE, fd) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
}
And call them from the main function like this:
#[no_mangle]
fn main() -> u8 {
let fd = linux::syscall::open("./bin.rs", 0, 0).unwrap();
linux::syscall::close(fd).unwrap();
0
}
If try to run this code the following happens:
> ./cargo.sh run
panicked at ./bin.rs:10:50:
called `Result::unwrap()` on an `Err` value: ENAMETOOLONG
The error message is quite straighforward: The name of the file is too long. Heh? 8 character is too long? We have most
likely messed something up. So how does the kernel determine the length of our string? It uses the strlen function which
expects a string to be null terminated. As opposed to this the Rust str
are not null terminated but it works as a byte slice. As a result
the kernel does out of bound access on our str, so we just violated the rules of Rust and cause undefined behaviour and made
the whole library unsound. Nice…
We can prove it by adding a null byte into our str and letting the code run:
#[no_mangle]
fn main() -> u8 {
let fd = linux::syscall::open("./bin.rs\0", 0, 0).unwrap();
linux::syscall::close(fd).unwrap();
0
}
> ./cargo.sh run
Now seems to be all fine. But as the unsafe rules says: an unsafe block is only safe if it can not be called from safe code
in a way that it causes undefined behaviour. This means that we can not expect the user to put a null at the end of a str
every time a file needs to be opened. We have convert the rust str into a null terminated string. And there is a nice struct
for it: CString. The only problem is that it is defined in the
alloc crate which we don’t want to depend on. Let’s avoid implementing our own allocation primitives for now and simply
use a stack array to build our null terminated string. So let’s rewrite our open function like this:
#![allow(unused)]
fn main() {
#[no_mangle]
pub fn open(path: &str, flags: u64, mode: u64) -> Result<u32> {
let mut dst = [0u8;crate::limits::PATH_MAX];
let src = path.as_bytes();
if src.len() >= crate::limits::PATH_MAX {
return Err(Error::ENAMETOOLONG);
}
for idx in 0 .. src.len() {
dst[idx] = src[idx];
}
let rc = unsafe { syscall!(SYS_OPEN, dst.as_ptr(), flags, mode) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(u32::try_from(rc).unwrap())
}
}
There are a couple of limits defined in the linux kernel. For example here
To conform these limits, we include a module to the linux.rs with pub mod limits; and also create a file
called limits.rs with the content of
#![allow(unused)]
fn main() {
pub const PATH_MAX: usize = 4096;
}
After that we can remove the \0 termination from our str and it should just work now:
> ./cargo.sh run
Let’s define the options for the open syscall: You can the options in the fcntl.h
And the opening mode flages in the stat.h
We can simply add these values into the syscall.rs file:
#![allow(unused)]
fn main() {
pub const O_ACCMODE: u64 = 0o0000003;
pub const O_RDONLY: u64 = 0o0000000;
pub const O_WRONLY: u64 = 0o0000001;
pub const O_RDWR: u64 = 0o0000002;
pub const O_CREAT: u64 = 0o0000100;
pub const O_EXCL: u64 = 0o0000200;
pub const O_NOCTTY: u64 = 0o0000400;
pub const O_TRUNC: u64 = 0o0001000;
pub const O_APPEND: u64 = 0o0002000;
pub const O_NONBLOCK: u64 = 0o0004000;
pub const O_DSYNC: u64 = 0o0010000;
pub const O_DIRECT: u64 = 0o0040000;
pub const O_LARGEFILE: u64 = 0o0100000;
pub const O_DIRECTORY: u64 = 0o0200000;
pub const O_NOFOLLOW: u64 = 0o0400000;
pub const O_NOATIME: u64 = 0o1000000;
pub const O_CLOEXEC: u64 = 0o2000000;
pub const O_SYNC: u64 = 0o4000000;
pub const O_PATH: u64 = 0o10000000;
pub const O_TMPFILE: u64 = 0o20000000;
pub const O_NDELAY: u64 = O_NONBLOCK;
pub const S_IRWXU: u64 = 0o700; // RWX mask for owner
pub const S_IRUSR: u64 = 0o400; // R for ownwer
pub const S_IWUSR: u64 = 0o200; // W for ownwer
pub const S_IXUSR: u64 = 0o100; // X for ownwer
pub const S_IRWXG: u64 = 0o070; // RWX for group
pub const S_IRGRP: u64 = 0o040; // R for group
pub const S_IWGRP: u64 = 0o020; // W for group
pub const S_IXGRP: u64 = 0o010; // X for group
pub const S_IRWXO: u64 = 0o007; // RWX for other
pub const S_IROTH: u64 = 0o004; // R for other
pub const S_IWOTH: u64 = 0o002; // W for other
pub const S_IXOTH: u64 = 0o001; // X for other
}
So we can have a basic file handling functionality:
#[no_mangle]
fn main() -> u8 {
use linux::syscall::*;
let fd = open("hello.txt", O_CREAT|O_RDWR|O_DSYNC, S_IRUSR|S_IWUSR).unwrap();
write(fd, b"hello world\n").unwrap();
close(fd).unwrap();
0
}
And we can run it like this:
> ./cargo.sh run
> cat hello.txt
hello world
stat, fstat, lstat
The C wrapper of the stat and fstat syscalls look like this:
int stat(const char *pathname, struct stat *statbuf);
int fstat(int fd, struct stat *statbuf);
In it’s quite common to create a struct on the stack and pass it into a function as a pointer. The function initializes
the struct and after that we can use it. It makes a lot of sense because so we can use the return value as an error type.
Zero means typically that the function succeeded while something else means typically an error. As opposed to this we
have Result types in Rust. So would be better to create the stat struct on the stack of the syscall wrapper and
give it back as Ok(stat) in case of success? To find out let’s implement two versions of this function:
#![allow(unused)]
fn main() {
const SYS_FSTAT: isize = 5;
#[repr(C)]
#[derive(Debug, Default)]
pub struct stat64 {
pub st_dev: u64,
pub st_ino: u64,
pub st_nlink: u64,
pub st_mode: u32,
pub st_uid: u32,
pub st_gid: u32,
__pad0: i32,
pub st_rdev: u64,
pub st_size: i64,
pub st_blksize: i64,
pub st_blocks: i64,
pub st_atime: i64,
pub st_atime_nsec: i64,
pub st_mtime: i64,
pub st_mtime_nsec: i64,
pub st_ctime: i64,
pub st_ctime_nsec: i64,
__reserved: [i64; 3],
}
#[no_mangle]
pub fn fstat1(fd: u32, stat: &mut stat64) -> Result<()> {
let rc = unsafe { syscall!(SYS_FSTAT, fd, stat) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
#[no_mangle]
pub fn fstat2(fd: u32) -> Result<stat64> {
let mut stat = stat64::default();
let rc = unsafe { syscall!(SYS_FSTAT, fd, &mut stat) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(stat)
}
}
If we build the code and dump the assembly it’s easy to see the difference between the two functions:
> ./cargo.sh build
> ./cargo.sh dump fstat1
0000000000401f70 <fstat1>:
401f70: 55 push rbp
401f71: 48 89 e5 mov rbp,rsp
401f74: b8 05 00 00 00 mov eax,0x5
401f79: 0f 05 syscall
401f7b: 48 85 c0 test rax,rax
401f7e: /-- 78 04 js 401f84 <fstat1+0x14>
401f80: | 31 c0 xor eax,eax
401f82: | 5d pop rbp
401f83: | c3 ret
401f84: \-> 48 f7 d8 neg rax
401f87: 48 89 c7 mov rdi,rax
401f8a: 5d pop rbp
401f8b: ff 25 d7 4f 00 00 jmp QWORD PTR [rip+0x4fd7] # 406f68 <_GLOBAL_OFFSET_TABLE_+0x20>
> ./cargo.sh dump fstat2
0000000000401fa0 <fstat2>:
401fa0: 55 push rbp
401fa1: 48 89 e5 mov rbp,rsp
401fa4: 53 push rbx
401fa5: 48 81 ec 98 00 00 00 sub rsp,0x98
401fac: 89 f1 mov ecx,esi
401fae: 48 89 fb mov rbx,rdi
401fb1: 0f 57 c0 xorps xmm0,xmm0
401fb4: 0f 29 45 e0 movaps XMMWORD PTR [rbp-0x20],xmm0
401fb8: 0f 29 45 d0 movaps XMMWORD PTR [rbp-0x30],xmm0
401fbc: 0f 29 45 c0 movaps XMMWORD PTR [rbp-0x40],xmm0
401fc0: 0f 29 45 b0 movaps XMMWORD PTR [rbp-0x50],xmm0
401fc4: 0f 29 45 a0 movaps XMMWORD PTR [rbp-0x60],xmm0
401fc8: 0f 29 45 90 movaps XMMWORD PTR [rbp-0x70],xmm0
401fcc: 0f 29 45 80 movaps XMMWORD PTR [rbp-0x80],xmm0
401fd0: 0f 29 85 70 ff ff ff movaps XMMWORD PTR [rbp-0x90],xmm0
401fd7: 0f 29 85 60 ff ff ff movaps XMMWORD PTR [rbp-0xa0],xmm0
401fde: 48 8d b5 60 ff ff ff lea rsi,[rbp-0xa0]
401fe5: b8 05 00 00 00 mov eax,0x5
401fea: 89 cf mov edi,ecx
401fec: 0f 05 syscall
401fee: 48 85 c0 test rax,rax
401ff1: /----- 78 13 js 402006 <fstat2+0x66>
401ff3: | 48 8d 7b 08 lea rdi,[rbx+0x8]
401ff7: | ba 90 00 00 00 mov edx,0x90
401ffc: | ff 15 6e 4f 00 00 call QWORD PTR [rip+0x4f6e] # 406f70 <_GLOBAL_OFFSET_TABLE_+0x28>
402002: | 31 c0 xor eax,eax
402004: | /-- eb 11 jmp 402017 <fstat2+0x77>
402006: \--|-> 48 f7 d8 neg rax
402009: | 48 89 c7 mov rdi,rax
40200c: | ff 15 56 4f 00 00 call QWORD PTR [rip+0x4f56] # 406f68 <_GLOBAL_OFFSET_TABLE_+0x20>
402012: | 88 43 01 mov BYTE PTR [rbx+0x1],al
402015: | b0 01 mov al,0x1
402017: \-> 88 03 mov BYTE PTR [rbx],al
402019: 48 89 d8 mov rax,rbx
40201c: 48 81 c4 98 00 00 00 add rsp,0x98
402023: 5b pop rbx
402024: 5d pop rbp
402025: c3 ret
The second version of fstat is more thant twice as long as the first. But is it enough to throw it away? To be able to answer
the question we have to go a bit deeper in the code of fstat2 and analyse what’s actually happening here
After aligning the satck (push rbx) we reserve 0x90 byte space on the stack for the stat64 struct. This space has to be
zerod out and to make it fast the compiler zeros out the xmm0 SIMD register and uses it to copy zeros on the stack.
401fa0: 55 push rbp
401fa1: 48 89 e5 mov rbp,rsp
401fa4: 53 push rbx
401fa5: 48 81 ec 98 00 00 00 sub rsp,0x98
401fac: 89 f1 mov ecx,esi
401fae: 48 89 fb mov rbx,rdi
401fb1: 0f 57 c0 xorps xmm0,xmm0
401fb4: 0f 29 45 e0 movaps XMMWORD PTR [rbp-0x20],xmm0
401fb8: 0f 29 45 d0 movaps XMMWORD PTR [rbp-0x30],xmm0
401fbc: 0f 29 45 c0 movaps XMMWORD PTR [rbp-0x40],xmm0
401fc0: 0f 29 45 b0 movaps XMMWORD PTR [rbp-0x50],xmm0
401fc4: 0f 29 45 a0 movaps XMMWORD PTR [rbp-0x60],xmm0
401fc8: 0f 29 45 90 movaps XMMWORD PTR [rbp-0x70],xmm0
401fcc: 0f 29 45 80 movaps XMMWORD PTR [rbp-0x80],xmm0
401fd0: 0f 29 85 70 ff ff ff movaps XMMWORD PTR [rbp-0x90],xmm0
401fd7: 0f 29 85 60 ff ff ff movaps XMMWORD PTR [rbp-0xa0],xmm0
Once we have initialized the struct we have to pass it together with the fd to the syscall
401fde: 48 8d b5 60 ff ff ff lea rsi,[rbp-0xa0]
401fe5: b8 05 00 00 00 mov eax,0x5
401fea: 89 cf mov edi,ecx
401fec: 0f 05 syscall
We check the return code of the syscall and if it’s not zero we jump forward to the error handling (401e63)
401fee: 48 85 c0 test rax,rax
401ff1: /----- 78 13 js 402006 <fstat2+0x66>
If the return code was zero call memcpy. The paramters are rdi (dst) which is calculated from rbx, rsi (src) which
is the stat64 struct on the current function and edx (len) which is the size of the stat64 struct. So question is where
do we copy the initialized struct? If you look the first section of this code it says mov rbx,rdi which is kind of interesting
because rdi is used for the first parameter of the function calls which should be the filedescriptor in this case.
Let’s investigate that in gdb (see bellow).
401ff3: | 48 8d 7b 08 lea rdi,[rbx+0x8]
401ff7: | ba 90 00 00 00 mov edx,0x90
401ffc: | ff 15 6e 4f 00 00 call QWORD PTR [rip+0x4f6e] # 406f70 <_GLOBAL_OFFSET_TABLE_+0x28>
402002: | 31 c0 xor eax,eax
402004: | /-- eb 11 jmp 402017 <fstat2+0x77>
Do the error handling here
402006: \--|-> 48 f7 d8 neg rax
402009: | 48 89 c7 mov rdi,rax
40200c: | ff 15 56 4f 00 00 call QWORD PTR [rip+0x4f56] # 406f68 <_GLOBAL_OFFSET_TABLE_+0x20>
402012: | 88 43 01 mov BYTE PTR [rbx+0x1],al
402015: | b0 01 mov al,0x1
Teardown the function and return with Result<stat64>. Release the 0x90 bytes and the extra 8 alignment byte from the stack
and return to the caller function.
402017: \-> 88 03 mov BYTE PTR [rbx],al
402019: 48 89 d8 mov rax,rbx
40201c: 48 81 c4 98 00 00 00 add rsp,0x98
402023: 5b pop rbx
402024: 5d pop rbp
402025: c3 ret
> gdb ./target/bin
(gdb) set disassembly-flavor intel
(gdb) break fstat2
(gdb) run
Breakpoint 1, linux::syscall::{impl#1}::default () at syscall.rs:52
52 #[derive(Debug, Default)]
(gdb) disassemble
Dump of assembler code for function linux::syscall::fstat2:
0x0000000000401fa0 <+0>: push rbp
0x0000000000401fa1 <+1>: mov rbp,rsp
0x0000000000401fa4 <+4>: push rbx
0x0000000000401fa5 <+5>: sub rsp,0x98
0x0000000000401fac <+12>: mov ecx,esi
0x0000000000401fae <+14>: mov rbx,rdi
=> 0x0000000000401fb1 <+17>: xorps xmm0,xmm0
...
(gdb) info registers esi rdi
esi 0x3 3
rdi 0x7fffffffe7f8 140737488349176
Something is definitelly weird. The esi (alias rsi) which should contain the second parameter of the function is set to
the filedescriptor (3) and the rdi has some random address in it. But the fstat2 doesn’t even have two parameters…
So what’s happening here? If we look up the 3.2.3 Parameter Passing chapter of the
System V ABI and scroll down to the “Returning of Values” section
it has an interesting point:
If the type has class MEMORY, then the caller provides space for the return value and passes the address of this storage in rdi as if it were the first argument to the function. In effect, this address becomes a hidden first argument. This storage must not overlap any data visible to the callee through other names than this argument. On return %rax will contain the address that has been passed in by the caller in %rdi
So we could summarize the call to the two fstat functions as follows:
#![allow(unused)]
fn main() {
pub fn fstat1(fd: u32, stat: &mut stat64) -> Result<()>;
}
- The caller reserves space for stat64
- The caller zeros out stat64
- fstat updates stat64
- fstat returns the result
#![allow(unused)]
fn main() {
pub fn fstat2(fd: u32) -> Result<stat64>;
}
- The caller reserves space for the first stat64
- fstat reserves space for the second stat64
- fstat zeros out the second stat64
- fstat updates the second stat64
- fstat overwrites the first stat64 with the second stat64
- fstat returns the result
Beside the fact that the fstat1 function is much more lightweight (no extra allocation + memcpy) we can also reuse the
stat64 struct in case of checking multiple files. So we don’t have to reintialize it over and over again, which took at
least 10 instruction long. As a conclusion let’s drop the fstat2 function rename fstat1 to fstat.
Similarly we can also implement stat and lstat as follows
#![allow(unused)]
fn main() {
const SYS_STAT: isize = 4;
const SYS_FSTAT: isize = 5;
const SYS_LSTAT: isize = 6;
#[no_mangle]
pub fn stat(path: &str, stat: &mut stat64) -> Result<()> {
let mut dst = [0u8;crate::limits::PATH_MAX];
cpath(path.as_bytes(), &mut dst)?;
let rc = unsafe { syscall!(SYS_STAT, dst.as_ptr(), stat) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
#[no_mangle]
pub fn fstat(fd: u32, stat: &mut stat64) -> Result<()> {
let rc = unsafe { syscall!(SYS_FSTAT, fd, stat) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
#[no_mangle]
pub fn lstat(path: &str, stat: &mut stat64) -> Result<()> {
let mut dst = [0u8;crate::limits::PATH_MAX];
cpath(path.as_bytes(), &mut dst)?;
let rc = unsafe { syscall!(SYS_LSTAT, dst.as_ptr(), stat) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
}
And now we can use them in the main function like this:
#[no_mangle]
fn main() -> u8 {
let fd = linux::syscall::open("hello", 0, 0).unwrap();
let mut stat = linux::syscall::stat64::default();
linux::syscall::fstat(fd, &mut stat).unwrap();
println!("{:#?}", stat);
0
}
The result should look something like this:
> ./cargo.sh run
stat64 {
st_dev: 64768,
st_ino: 940171,
st_nlink: 1,
st_mode: 33188,
st_uid: 1066219479,
st_gid: 1068570817,
__pad0: 0,
st_rdev: 0,
st_size: 9,
st_blksize: 4096,
st_blocks: 8,
st_atime: 1719926584,
st_atime_nsec: 93534043,
st_mtime: 1719926583,
st_mtime_nsec: 457537512,
st_ctime: 1719926583,
st_ctime_nsec: 457537512,
__reserved: [
0,
0,
0,
],
}
truncate, ftruncate, fallocate
To set the size of a file we can use the truncate and allocate syscall family. Let’s implement these syscalls in syscall.rs:
#![allow(unused)]
fn main() {
const SYS_TRUNCATE: isize = 76;
const SYS_FTRUNCATE: isize = 77;
const SYS_FALLOCATE: isize = 285;
#[no_mangle]
pub fn truncate(path: &str, len: u64) -> Result<()> {
let mut dst = [0u8;crate::limits::PATH_MAX];
cpath(path.as_bytes(), &mut dst)?;
let rc = unsafe { syscall!(SYS_TRUNCATE, dst.as_ptr(), len) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
#[no_mangle]
pub fn ftruncate(fd: u32, len: u64) -> Result<()> {
let rc = unsafe { syscall!(SYS_FTRUNCATE, fd, len) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
#[no_mangle]
pub fn fallocate(fd: u32, mode: u32, offset: u64, len: u64) -> Result<()> {
let rc = unsafe { syscall!(SYS_FALLOCATE, fd, mode, offset, len) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
}
We can use them like this:
#[no_mangle]
fn main() -> u8 {
use linux::syscall::*;
let mut stat = stat64::default();
let fd = open("buffer", O_CREAT|O_APPEND|O_RDWR, S_IRWXU).unwrap();
fallocate(fd, 0, 0, 1024).unwrap();
fstat(fd, &mut stat).unwrap();
println!("size: {}", stat.st_size);
ftruncate(fd, 512).unwrap();
fstat(fd, &mut stat).unwrap();
println!("size: {}", stat.st_size);
close(fd).unwrap();
0
}
So the result is:
> ./cargo.sh run
size: 1024
size: 512
fsync, fdatasync
#![allow(unused)]
fn main() {
const SYS_FSYNC: isize = 74;
const SYS_FDATASYNC: isize = 75;
#[no_mangle]
pub fn fsync(fd: u32) -> Result<()> {
let rc = unsafe { syscall!(SYS_FSYNC, fd) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
#[no_mangle]
pub fn fdatasync(fd: u32) -> Result<()> {
let rc = unsafe { syscall!(SYS_FDATASYNC, fd) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(())
}
}
lseek
The lseek sysall can be used to modify the cursor of the current file: The options can be found here
#![allow(unused)]
fn main() {
const SYS_LSEEK: isize = 8;
#[no_mangle]
pub fn lseek(fd: u32, offset: u64, whence: i32) -> Result<u64> {
let rc = unsafe { syscall!(SYS_LSEEK, fd, offset, whence) };
if rc < 0 {
return Err(Error::from(rc * -1))
}
Ok(u64::try_from(rc).unwrap())
}
}
The main function should look like this:
#[no_mangle]
fn main() -> u8 {
use linux::syscall::*;
let fd = open("buffer", O_CREAT|O_APPEND|O_RDWR, S_IRWXU).unwrap();
fallocate(fd, 0, 0, 1024).unwrap();
let pos = lseek(fd, 512, SEEK_SET).unwrap();
println!("Cursor position: {}", pos);
read(0, &mut [0u8]).unwrap();
close(fd).unwrap();
0
}
Let’s start our program like and let it block on the read syscall
> ./cargo.sh run
Cursor position: 512
We can check the status of the file in the proc filesystem like this:
> cat /proc/$(pidof bin)/fdinfo/3
pos: 512
flags: 0102002
mnt_id: 30
ino: 940180
Memory
The .text section
Let’s create a small binary without much bloat and checkout its memory footprint
section .text
global _start
_start:
mov rax,0x22
syscall
All it does is providing an entry point to the process and pauses the execution by calling the pause system call.
We can compile, link, run and display the memory of it as follows:
> nasm -f elf64 main.s && ld ./main.o && strip -s ./a.out
> ./a.out & cat /proc/$!/maps
00400000-00401000 r--p 00000000 fd:00 940117 /a.out
00401000-00402000 r-xp 00001000 fd:00 940117 /a.out
7ffc92d30000-7ffc92d51000 rw-p 00000000 00:00 0 [stack]
7ffc92d51000-7ffc92d55000 r--p 00000000 00:00 0 [vvar]
7ffc92d55000-7ffc92d57000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
If you’re unfamiliar with this syntax: $! is a bash variable and it contains the process id of the last started process.
With & our process goes into the background so we can use the same terminal to print the memory mappings of it which are
expressed by the kernel at the location /proc/<pid>/maps as a simple file.
The columns above have the following values:
- memory address range
- permissions (r=read, w=write, x=exec, p=private, s=shared)
- file offset (only if the mapping is file-backed)
- device id (major:minor)
- inode id
- either the file name or some human readable identifyer of the memory range
The first two lines in the mapping shows us how our binary was mapped:
00400000-00401000: the elf header can be found in this read-only region00401000-00402000: this is the .text section of our binary which contains the code to be executed
We can see something similar if we look at the section headers in the file too:
> readelf -W -S ./a.out
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000401000 001000 000007 00 AX 0 0 16
[ 2] .shstrtab STRTAB 0000000000000000 001007 000011 00 0 0 1
The .data section
Let’s create another section in our binary the .data by adding some initialized data to it:
section .data
db "Hello world"
section .text
global _start
_start:
mov rax,34
syscall
If we now run our program we see an extra line about the data section
> nasm -f elf64 main.s && ld ./main.o && strip -s ./a.out
> ./a.out & cat /proc/$!/maps
00400000-00401000 r--p 00000000 fd:00 940117 /a.out
00401000-00402000 r-xp 00001000 fd:00 940117 /a.out
00402000-00403000 rw-p 00002000 fd:00 940117 /a.out
7ffd711d1000-7ffd711f2000 rw-p 00000000 00:00 0 [stack]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
As we can see the 00402000-00403000 section is read-write enabled but it can not be executed.
> readelf -W -S ./a.out
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000401000 001000 000007 00 AX 0 0 16
[ 2] .data PROGBITS 0000000000402000 002000 00000b 00 WA 0 0 4
[ 3] .shstrtab STRTAB 0000000000000000 00200b 000017 00 0 0 1
The .rodata section
Let’s create another section in our binary the .rodata by adding some initialized read-only data to it:
section .rodata
db "Hello world"
section .text
global _start
_start:
mov rax,34
syscall
If we now run our program we see an extra line about the rodata section which is mapped as r--p now.
> nasm -f elf64 main.s && ld ./main.o && strip -s ./a.out
> ./a.out & cat /proc/$!/maps
00400000-00401000 r--p 00000000 fd:00 940149 /a.out
00401000-00402000 r-xp 00001000 fd:00 940149 /a.out
00402000-00403000 r--p 00002000 fd:00 940149 /a.out
7ffc213e4000-7ffc21405000 rw-p 00000000 00:00 0 [stack]
7ffc21558000-7ffc2155c000 r--p 00000000 00:00 0 [vvar]
7ffc2155c000-7ffc2155e000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
The elf file looks like this:
> readelf -W -S ./a.out
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000401000 001000 000007 00 AX 0 0 16
[ 2] .rodata PROGBITS 0000000000402000 002000 00000b 00 A 0 0 4
[ 3] .shstrtab STRTAB 0000000000000000 00200b 000019 00 0 0 1
The .bss section
To reserve some extra space we can use during the execution of the process we can use the .bss section
section .bss
resq 1024
section .text
global _start
_start:
mov rax,34
syscall
This creates a buffer which will be initialized with zeros at the startup of the process but it doesn’t take up space
in the binary itself. We can see this section maped as reas-write too right under the .data section. Since it’s only
logically defined by the executable the new line doesn’t show the relation to the elf file.
> nasm -f elf64 main.s && ld ./main.o && strip -s ./a.out
> ./a.out & cat /proc/$!/maps
00400000-00401000 r--p 00000000 fd:00 940150 /a.out
00401000-00402000 r-xp 00001000 fd:00 940150 /a.out
00403000-00405000 rw-p 00000000 00:00 0
7ffc17944000-7ffc17965000 rw-p 00000000 00:00 0 [stack]
7ffc179c2000-7ffc179c6000 r--p 00000000 00:00 0 [vvar]
7ffc179c6000-7ffc179c8000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
And the elf file:
> readelf -W -S ./a.out
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000401000 001000 000007 00 AX 0 0 16
[ 2] .bss NOBITS 0000000000402000 002000 002000 00 WA 0 0 4
[ 3] .shstrtab STRTAB 0000000000000000 001007 000016 00 0 0 1
The heap
Let’s reserve another type of memory. For the heap allocation we need to ask the kernel to move the break point of the
process a bit higher. There is a system call for that called brk(). If it is called with 0 as argument it returns the
current break point of the process and it if it’s called with a valid address it will be set as the new breakpoint.
The assembly code looks like this:
section .text
global _start
_start:
; old = brk(0);
mov rdi,0x0
mov rax,0xc
syscall
; new = brk(old + 0x1000)
add rax,0x1000
mov rdi,rax
mov rax,0xc
syscall
; pause()
mov rax,34
syscall
If we execute see a new line again called [heap]. Similarly to the .data and .bss sections it is also mapped into
the low address region of the virtual address space but differently from them the size of it can be changed. It grows towards
the high memory address region.
> nasm -f elf64 main.s && ld ./main.o && strip -s ./a.out
> ./a.out & cat /proc/$!/maps
00400000-00401000 r--p 00000000 fd:00 940155 /a.out
00401000-00402000 r-xp 00001000 fd:00 940155 /a.out
009ca000-009cb000 rw-p 00000000 00:00 0 [heap]
7ffd483c0000-7ffd483e1000 rw-p 00000000 00:00 0 [stack]
7ffd483e4000-7ffd483e8000 r--p 00000000 00:00 0 [vvar]
7ffd483e8000-7ffd483ea000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
The elf
> readelf -W -S ./a.out
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000401000 001000 000023 00 AX 0 0 16
[ 2] .shstrtab STRTAB 0000000000000000 001023 000011 00 0 0 1
The stack
We can also change the size of the stack but I don’t know how….
The vdso, vvar and vsyscall
The v in the name of these sections means “virtual”. The vdso section is a dynamic library mapped by the kernel
into the address space of the process and it allows to call some system calls with faster execution time. Since the
call of these functions doesn’t require a context switch like a normal system call it can provide a significante performance
improvement to our program. Let’s dump the content of it to check the available symboles. We need our pause program again:
section .text
global _start
_start:
mov rax,34
syscall
Let’s check the location of the vdso on the usual way:
> nasm -f elf64 main.s && ld ./main.o && strip -s ./a.out
> ./a.out & pid=$!; cat /proc/$pid/maps
00400000-00401000 r--p 00000000 fd:00 940149 /a.out
00401000-00402000 r-xp 00001000 fd:00 940149 /a.out
00402000-00403000 r--p 00002000 fd:00 940149 /a.out
7ffd2023d000-7ffd2025e000 rw-p 00000000 00:00 0 [stack]
7ffd203ed000-7ffd203f1000 r--p 00000000 00:00 0 [vvar]
7ffd203f1000-7ffd203f3000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
Once we know the start address (0x7ffd203f1000) and the length (0x7ffd203f3000 - 0x7ffd203f1000) of the vdso section we
can use the dd command to dump the content of it. Note that we need root access to do this.
sudo dd if=/proc/$pid/mem of=vdso bs=1 skip=$((0x7ffd203f1000)) count=$((0x7ffd203f3000 - 0x7ffd203f1000))
After that we can analyse it just like any ather shared object files:
> readelf -W -s ./vdso
Symbol table '.dynsym' contains 13 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000c10 5 FUNC WEAK DEFAULT 11 clock_gettime@@LINUX_2.6
2: 0000000000000bd0 5 FUNC GLOBAL DEFAULT 11 __vdso_gettimeofday@@LINUX_2.6
3: 0000000000000c20 99 FUNC WEAK DEFAULT 11 clock_getres@@LINUX_2.6
4: 0000000000000c20 99 FUNC GLOBAL DEFAULT 11 __vdso_clock_getres@@LINUX_2.6
5: 0000000000000bd0 5 FUNC WEAK DEFAULT 11 gettimeofday@@LINUX_2.6
6: 0000000000000be0 42 FUNC GLOBAL DEFAULT 11 __vdso_time@@LINUX_2.6
7: 0000000000000cc0 157 FUNC GLOBAL DEFAULT 11 __vdso_sgx_enter_enclave@@LINUX_2.6
8: 0000000000000be0 42 FUNC WEAK DEFAULT 11 time@@LINUX_2.6
9: 0000000000000c10 5 FUNC GLOBAL DEFAULT 11 __vdso_clock_gettime@@LINUX_2.6
10: 0000000000000000 0 OBJECT GLOBAL DEFAULT ABS LINUX_2.6
11: 0000000000000c90 38 FUNC GLOBAL DEFAULT 11 __vdso_getcpu@@LINUX_2.6
12: 0000000000000c90 38 FUNC WEAK DEFAULT 11 getcpu@@LINUX_2.6
Stack
Stack overflow
The common source of stack overflows is recursive functions which never returns:
Alignment
Now that we have implemented a couple of handy helper methods we can go back to the question from chapter one: What
is the Options(nostack) in the _start
assembly block used for. Let’s put a panic into the main function:
#[no_mangle]
fn main() -> u8 {
painc!();
}
and execute the program
> ./cargo.sh build
> ./target/bin
panicked at ./bin.rs:9:5:
explicit panic
It all looks fine, right? But what happens if you remove the nostack option from the assembly block of the _start function?
> ./cargo.sh build
> ./target/bin
Segmentation fault (core dumped)
The process crashes with segfault. Let’s analyse that in gdb:
gdb ./target/bin
(gdb) set disassembly-flavor intel
(gdb) run
Starting program: /home/taabodal/work/blog/src/chapter-02/target/bin
Program received signal SIGSEGV, Segmentation fault.
0x0000000000402557 in rust_begin_unwind ()
(gdb) disassemble
Dump of assembler code for function rust_begin_unwind:
...
0x0000000000402552 <+50>: movups xmm0,XMMWORD PTR [rsp+0x78]
=> 0x0000000000402557 <+55>: movaps XMMWORD PTR [rsp+0x60],xmm0
0x000000000040255c <+60>: movaps xmm0,XMMWORD PTR [rsp+0x60]
0x0000000000402561 <+65>: movaps XMMWORD PTR [rsp+0x50],xmm0
...
End of assembler dump.
From the output above I removed some lines to make it easier to digest. The process crashes at the instruction
movaps XMMWORD PTR [rsp+0x60],xmm0. The line above moveups seems to be quiet similar but it doesn’t crashes.
Let’s lookup what these instructions are doing:
movaps: Move Aligned Packed Single Precision Floating-Point Valuesmovups: Move Unaligned Packed Single Precision Floating-Point Values
The key difference between these two is alignment of the memory address. While movups doesn’t expect any alignment
of the memory address the movaps expects that it is 16/32/64 byte aligned:
When the source or destination operand is a memory operand, the operand must be aligned on a 16-byte (128-bit version), 32-byte (VEX.256 encoded version) or 64-byte (EVEX.512 encoded version) boundary or a general-protection exception (#GP) will be generated.
The instruction which crashes process uses [rsp+0x60] as memory address. 0x60 is 16 byte aligned but what is the value
of the rsp register? Let’s go back to gdb and print the current value of the register with
(gdb) info registers rsp
rsp 0x7fffffffe828 0x7fffffffe828
It seems like we have found the reason: The value of rsp is not 16 byte aligned so [rsp+0x60] wont be 16 byte aligned
either which causes the processor to throw a general-protection exception.
That’s all nice but if the aligment of the memory address is so important then why does the compiler not check if rsp
is in good state before calling movaps? As always the System V ABI
has the answer for this question. In the section 3.2.2 The Stack Frame it says:
The end of the input argument area shall be aligned on a 16 byte boundary. In other words, the value (%rsp + 8) is always a multiple of 16 when control is transferred to the function entry point.
Since it’s documented in the ABIs calling convention the compiler can asume that before a function is called the rsp
is 16 byte aligned. So if it doesn’t do stack operation which misaligns the stack it should remain 16 byte aligned.
Let’s go back gdb and check the stack alignment throught of our process:
> gdb ./target/bin
(gdb) set disassembly-flavor intel
(gdb) break _start
(gdb) break rust_begin_unwind
(gdb) run
Breakpoint 1, 0x0000000000402500 in _start ()
(gdb) info registers rsp
rsp 0x7fffffffe960 0x7fffffffe960
(gdb) continue
Breakpoint 2, 0x0000000000402520 in rust_begin_unwind ()
(gdb) info registers rsp
rsp 0x7fffffffe8b0 0x7fffffffe8b0
(gdb) disassemble
Dump of assembler code for function rust_begin_unwind:
=> 0x0000000000402520 <+0>: sub rsp,0x88
0x0000000000402527 <+7>: mov QWORD PTR [rsp+0x10],rdi
...
As wee can see the rsp register is 16 byte aligned at the beginning of both, the _start and rust_begin_unwind functions.
The problem seems to be comming after that: the first instruction of the rust_begin_unwind function substract 0x88 from
the stack pointer which becomes unaligned this way. But why does it do that if it knows that movaps needs 16 byte alignment?
The reason for that is that the rsp has to be 16 byte aligned before the call
instruction is executed. Since call instruction pushes the return current value of the instruction pointer (rip) onto the stack
which is 8 byte long the compiler needs to compensate this as the first step of every function call. So the sub rsp,0x88 should
actually make the rsp 16 byte aligned again which means that it wasn’t aligned at all wenn the rust_begin_unwind function
was started. To find out when did it get misaligned we need to go up on the stack frames and check the rsp registers. Let’s
see how does the stackframes look like:
(gdb) backtrace
#0 0x0000000000402520 in rust_begin_unwind ()
#1 0x0000000000401033 in core::panicking::panic_fmt () at library/core/src/panicking.rs:72
#2 0x00000000004010dc in core::panicking::panic () at library/core/src/panicking.rs:146
#3 0x000000000040128d in main ()
(gdb) up
#1 0x0000000000401033 in core::panicking::panic_fmt () at library/core/src/panicking.rs:72
72 in library/core/src/panicking.rs
(gdb) info registers rsp
rsp 0x7fffffffe8b8 0x7fffffffe8b8
(gdb) up
#2 0x00000000004010dc in core::panicking::panic () at library/core/src/panicking.rs:146
146 in library/core/src/panicking.rs
(gdb) info registers rsp
rsp 0x7fffffffe8f8 0x7fffffffe8f8
(gdb) up
#3 0x000000000040128d in main ()
(gdb) info registers rsp
rsp 0x7fffffffe948 0x7fffffffe948
If if go up on the stackframes we can checkout the value of registers right before call instruction. The bad news is that
the rsp seems to be misaligned already in the main function. This means that the whole code is corrupted. Since main
function is called from our _start function let’s invesigate that one:
> gdb ./target/bin
(gdb) set disassembly-flavor intel
(gdb) break _start
(gdb) run
(gdb) disassemble
Dump of assembler code for function _start:
=> 0x0000000000402500 <+0>: push rax
0x0000000000402501 <+1>: call 0x401270 <main>
0x0000000000402506 <+6>: mov rdi,rax
0x0000000000402509 <+9>: mov rax,0x3c
0x0000000000402510 <+16>: syscall
0x0000000000402512 <+18>: ud2
We seems to have the same construct here. The first instruction of the function push rax realigns the stack after that
the main will be called. The only difference is since the _start function is the entry point of our code it has never
been called and as such this is the only function which is started with the stack 16 byte aligned. As a result the first
instruction which was meant to compensate the misalignment of the stack will be the reason of the misalignment of it.
So let’s get back to the options(nostack). The documentation says:
The
asm!block does not push data to the stack, or write to the stack red-zone (if supported by the target). If this option is not used then the stack pointer is guaranteed to be suitably aligned (according to the target ABI) for a function call.
If we compile and dump the _start function with nostack option enabled then we get the working assembly code:
> ./cargo.sh build
> ./cargo.sh dump _start
0000000000402500 <_start>:
402500: e8 6b ed ff ff call 401270 <main>
402505: 48 89 c7 mov rdi,rax
402508: 48 c7 c0 3c 00 00 00 mov rax,0x3c
40250f: 0f 05 syscall
402511: 0f 0b ud2
and without this option we get the crashing assembly code:
> ./cargo.sh build
> ./cargo.sh dump _start
0000000000402500 <_start>:
402500: 50 push rax
402501: e8 6a ed ff ff call 401270 <main>
402506: 48 89 c7 mov rdi,rax
402509: 48 c7 c0 3c 00 00 00 mov rax,0x3c
402510: 0f 05 syscall
402512: 0f 0b ud2
So what’s here happening, isn’t it exactly the opposite of what the documentation says? And the answer is no.
The compiler guaranties the stack alignment the right way in case of a function call. But no knowledge about that the
_start code section never gets called. It thinks that it’s a function just like any other. We can prove this
by moving the call main outside of the assembly block. It will generate the push rax even if the assembly block
has the nostack option enabled.
#[no_mangle]
fn _start() -> ! {
extern "C" { fn main() -> u8; }
unsafe { main(); }
unsafe {
core::arch::asm!(
"mov rdi,rax",
"mov rax,0x3c",
"syscall",
options(nostack, noreturn),
)
}
}
> ./cargo.sh build
> ./cargo.sh dump _start
0000000000402500 <_start>:
402500: 50 push rax
402501: 48 8d 05 68 ed ff ff lea rax,[rip+0xffffffffffffed68] # 401270 <main>
402508: ff d0 call rax
40250a: 48 89 c7 mov rdi,rax
40250d: 48 c7 c0 3c 00 00 00 mov rax,0x3c
402514: 0f 05 syscall
402516: 0f 0b ud2
Now that we agreed that stack alignment is important let’s make it permanent to avoid this bug in the future.
The simplest way to clean up the last 16 byte of the number is and rsp,-0x10. Let’s add this to the beginning
of the asm block:
#![allow(unused)]
fn main() {
#[no_mangle]
fn _start() -> ! {
unsafe {
core::arch::asm!(
"add rsp,-0x10",
"call main",
"mov rdi,rax",
"mov rax,0x3c",
"syscall",
options(nostack, noreturn),
)
}
}
}
Now it should work even without the nostack option because we the generated first instruction push rax will have
simply no effect on our code. It’s a better to use the and instead of the sub or pop instruction here because
sub and pop would remove 8 bytes in every case while the and instruction only modifies the rsp if it wasn’t
aligned.
Last but not least the System V ABI also says that the user space code is responsible for cleaning up the rbp register:
The content of this register is unspecified at process initialization time, but the user code should mark the deepest stack frame by setting the frame pointer to zero.
So let’s do that by adding an extra assembly line xor rbp,rbp.
#![allow(unused)]
fn main() {
#[no_mangle]
fn _start() -> ! {
unsafe {
core::arch::asm!(
"xor rbp,rbp",
"add rsp,-0x10",
"call main",
"mov rdi,rax",
"mov rax,0x3c",
"syscall",
options(nostack, noreturn),
)
}
}
}
The main function
Let’s print our args like this:
cat -E -T /proc/self/cmdline | tr '\000' '\n'
cat
-E
-T
/proc/self/cmdline
In a C program we get the command line arguments directly from the main function like this
int main(int argc, char **argv);
If we also need to access the environment variables we can extend the function signature like this
int main(int argc, char **argv, char **envp);
Or further extend it to get auxiliary informations passed to the process like this:
int main(int argc, char **argv, char **envp, auxv_t *auxv);
But where does these information come from? To be able to answer this question we need to go back to the System V ABI and read the section of 3.4.1 Initial Stack and Register State. It says that the stack of the process will be initialized as follows:
- Unspecified block
- Info block: the command line arguments and environment varibales are copied here
- Unspecified block
- End of auxiliary vector (null entry)
- Auxiliary vector entries (
auxv_t *auxv) - End of environment pointer vector (null pointer)
- Environment pointer vector entries (
char **envp) - End of argument pointer vector (null pointer)
- Argument pointer vector (
char **argv) - Argument pointer vector lengs (
int argc)
The argument and environment ponter vectors are just an array of pointers pointing to the Info block of the stack. To check the value of it we can use gdb like this:
> gdb --args ./target/bin --arg1 --arg2
(gdb) break _start
(gdb) run
(gdb) x/8s *(char**)($rsp + 8)
0x7fffffffebd6: "/blog/src/chapter-03/target/bin"
0x7fffffffec09: "--arg1"
0x7fffffffec10: "--arg2"
0x7fffffffec17: "SHELL=/bin/bash"
0x7fffffffec27: "LESS=-RSF"
0x7fffffffec31: "TERM_PROGRAM_VERSION=3.2a"
0x7fffffffec4b: "TMUX=/tmp/tmux-1066129479/default,2230,8"
0x7fffffffec74: "EDITOR=vim"
The x let’s you examine a memory location of the program and the /8s specifies that 8 strings should be displayed.
The $rsp + 8 is the location of the char **argv and cast it derefence it you get the wanted memory location.
After the list of command line arguments we can see a list of environment variables. Feel free to play around the x
command of gdb if you’re unfamiliar to it. You can get the help of it like help x.
Command line arguments
Let’s try to implement a C like command line argument handling. As we have learn in chapter 2 the C ABI uses the rdi,
rsi, rdx, rcx, r8 and r9 registers to pass the arguments to a function so to pass argc and argv to main
we just need to fill these registers with the values we can found on the stack. Let’s rewrite our _start function like this:
#![allow(unused)]
fn main() {
#[no_mangle]
fn _start() -> ! {
unsafe {
core::arch::asm!(
"and rsp,-16",
"mov rdi,[rsp]",
"lea rsi,[rsp+8]",
"call main",
"mov rdi,rax",
"mov rax,0x3c",
"syscall",
options(nostack, noreturn),
)
}
}
}
If you look at the assembly code there is important difference between argc (rdi) and argv (rsi): The argc is
passed by value while the argv is passed by a reference. As such in case of argc we load the value pointed by rsp
into the rdi in the instruction mov rdi,[rsp]. As opposed to this the lea
instruction instead of loading the value it just calculates the memory address at [rsp+8] and puts this address to rsi.
As a result argc can be interpreted az an integer value while argv can be interpreted as pointer to to an array of strings.
We can now rewrite the main function like this:
#[no_mangle]
fn main(argc: usize, argv: *const *const i8) -> u8 {
use core::convert::TryInto;
for offset in 0 .. argc {
unsafe {
let ptr = *argv.offset(offset as isize);
println!("{}", core::ffi::CStr::from_ptr(ptr).to_str().unwrap());
}
}
0
}
And if we try to compile we can see an almost expected error message: Missing strlen symbole:
> ./cargo.sh run
error: linking with `cc` failed: exit status: 1
= note: /usr/bin/ld: target/bin.bin.97e806d2324bed6f-cgu.0.rcgu.o: in function `core::ffi::c_str::CStr::from_ptr':
bin.97e806d2324bed6f-cgu.0:(.text._ZN4core3ffi5c_str4CStr8from_ptr17hac38e50840c901dfE+0xc): undefined reference to `strlen'
So let’s add strlen to the ffi module:
#![allow(unused)]
fn main() {
#[no_mangle]
fn strlen(buf: *const u8) -> usize {
let mut len = 0;
while unsafe { *buf.offset(len) != 0 } {
len += 1;
}
let x = len.try_into().unwrap();
x
}
}
And now we have access to the command line arguments:
> ./cargo.sh build
> ./target/bin arg1 arg2
./target/bin
arg1
arg2
It works but this way the main function needs to implement an unsafe block to access the arguments. I think we can do better.
The Rust standard library provides an args() which returns an
Args struct which implements the
Iterator trait so one can iterate over the arguments without
the need of unsafe blocks. Let’s take as an example and implement our env module. Let’s create a new file called env.rs
and include it into the linux.rs with pub mod env;.
To be able to do some initialization we won’t call the main function directly from _start but we will implement a __rust_main
function (just like we have seen __libc_main in the first chapter) and do the process initialization there. Let’s do that
by modifying the linux.rs file like this:
extern "C" { fn main() -> u8; }
#[no_mangle]
fn _start() -> ! {
unsafe {
core::arch::asm!(
"xor rbp,rbp",
"and rsp,-16",
"mov rdi,rsp",
"call __rust_main",
"mov rdi,rax",
"mov rax,0x3c",
"syscall",
options(nostack, noreturn),
)
}
}
#[no_mangle]
fn __rust_main(rsp: isize) -> u8 {
unsafe { main() }
}
Once we start writing Rust code it’s really hard to get a pointer to the beginning of the stack where argc, argv, etc.
are located so we pass this pointer directly from assembly to our __rust_main function as an argument. The rest of the pointer
operations can be done via the Rust interface. The main function can be rewritten like this:
#[no_mangle]
fn main() -> u8 { 0 }
Let’s add the logic to store the pointer of argv which can be later used to implement the env::args() funciton.
#![allow(unused)]
fn main() {
use core::sync::atomic::{AtomicPtr, Ordering};
pub(crate) static ARGV: AtomicPtr<*const i8> = AtomicPtr::new(core::ptr::null_mut());
#[no_mangle]
fn __rust_main(rsp: *const u8) -> u8 {
let argv = unsafe { rsp.offset(8) as *mut *const i8 };
ARGV.store(argv, Ordering::Relaxed);
unsafe { main() }
}
}
The env.rs looks like this:
#![allow(unused)]
fn main() {
use core::ffi::CStr;
use core::sync::atomic::Ordering;
pub struct Pointers {
next: isize,
ptrs: *const *const i8,
}
impl core::iter::Iterator for Pointers {
type Item = &'static str;
fn next(&mut self) -> Option<Self::Item> {
unsafe {
let ptr = *self.ptrs.offset(self.next);
self.next += 1;
match ptr.is_null() {
true => None,
false => CStr::from_ptr(ptr).to_str().ok()
}
}
}
}
pub fn args() -> Pointers {
Pointers {
next: 0,
ptrs: crate::ARGV.load(Ordering::Relaxed),
}
}
}
And we can reimplement the main function as follows:
#[no_mangle]
fn main() -> u8 {
for arg in linux::env::args() {
println!("{}", arg);
}
0
}
Run the program like this:
> ./cargo.sh build
> ./target/bin a1 a2
./target/bin
a1
a2
Environment variables
Let’s print our env like this:
> cat /proc/self/environ | tr '\000' '\n'
SHELL=/bin/bash
LESS=-RSF
TERM_PROGRAM_VERSION=3.2a
EDITOR=vim
....
_=/usr/bin/cat
We already have almost everyting to get access to the environment variables of our process. Let’s update the startup logic like this:
#![allow(unused)]
fn main() {
pub(crate) static ENVP: AtomicPtr<*const i8> = AtomicPtr::new(core::ptr::null_mut());
#[no_mangle]
fn __rust_main(rsp: *const u8) -> u8 {
let argc = unsafe { *(rsp as *const isize) };
let argv = unsafe { rsp.offset(8) as *mut *const i8 };
let envp = unsafe { rsp.offset(8 + 8 + argc * 8) as *mut *const i8 };
ARGV.store(argv, Ordering::Relaxed);
ENVP.store(envp, Ordering::Relaxed);
unsafe { main() }
}
}
The environment logic like this:
#![allow(unused)]
fn main() {
pub fn envp() -> Pointers {
Pointers {
next: 0,
ptrs: crate::ENVP.load(Ordering::Relaxed),
}
}
}
The main function like this:
#[no_mangle]
fn main() -> u8 {
for arg in linux::env::envp() {
println!("{}", arg);
}
0
}
So we can print the environment variables like this:
> ./cargo.sh build
> ./target/bin
SHELL=/bin/bash
LESS=-RSF
TERM_PROGRAM_VERSION=3.2a
TMUX=/tmp/tmux-1066129479/default,2230,8
EDITOR=vim
...
Apart from that the standard library provides a neat function called vars
and var. Let’s implement those too by adding the followings to the env.rs:
#![allow(unused)]
fn main() {
pub struct Variables {
ptrs: Pointers,
}
impl core::iter::Iterator for Variables {
type Item = (&'static str, &'static str);
fn next(&mut self) -> Option<Self::Item> {
self.ptrs.next().map(|s| s.split_once('=')).flatten()
}
}
pub fn vars() -> Variables {
Variables { ptrs: envp() }
}
pub fn var(key: &str) -> Option<&'static str> {
vars().find(|(k, _)| *k == key).map(|(_, v)| v)
}
}
After that we can update the main function like this:
#[no_mangle]
fn main() -> u8 {
println!("MYVAR={:?}", linux::env::var("MYVAR"));
0
}
But this time we get an symbole error on compilation:
> ./cargo.sh build
= note: /usr/bin/ld: /home/taabodal/work/blog/src/chapter-03/target/liblinux.rlib(liblinux.linux.77104c24dad4cdd3-cgu.0.rcgu.o): in function `<[A] as core::slice::cmp::SlicePartialEq<B>>::equal':
linux.77104c24dad4cdd3-cgu.0:(.text._ZN73_$LT$$u5b$A$u5d$$u20$as$u20$core..slice..cmp..SlicePartialEq$LT$B$GT$$GT$5equal17h27d80543cacf2715E+0x38): undefined reference to `memcmp'
So let’s implement memcmp by putting the following code into the ffi.rs module:
#![allow(unused)]
fn main() {
#[no_mangle]
unsafe fn memcmp(s1: *const u8, s2: *const u8, len: usize) -> i32 {
for idx in 0 .. len {
let offset = idx.try_into().unwrap();
unsafe {
let b1 = s1.offset(offset).read();
let b2 = s2.offset(offset).read();
if b1 != b2 {
return (b1 - b2).into();
}
}
}
0
}
}
So we can run our program like this:
> ./cargo.sh build
> ./target/bin
MYVAR=None
> MYVAR="hello world" ./target/bin
MYVAR=Some("hello world")
Auxiliary vector
LD Magic:
> LD_DEBUG=bindings python
> LD_SHOW_AUXV=1 cat /dev/null
https://cseweb.ucsd.edu/~gbournou/CSE131/the_inside_story_on_shared_libraries_and_dynamic_loading.pdf
Let’s check out the auxv passed by the kernel to the cat command:
> LD_SHOW_AUXV=1 cat /dev/null
AT_SYSINFO_EHDR: 0x7ffeeb1dd000
AT_MINSIGSTKSZ: 3632
AT_HWCAP: f8bfbff
AT_PAGESZ: 4096
AT_CLKTCK: 100
AT_PHDR: 0x555a0920b040
AT_PHENT: 56
AT_PHNUM: 13
AT_BASE: 0x7f73bbebc000
AT_FLAGS: 0x0
AT_ENTRY: 0x555a0920e760
AT_UID: 1066129479
AT_EUID: 1066129479
AT_GID: 1065878017
AT_EGID: 1065878017
AT_SECURE: 0
AT_RANDOM: 0x7ffeeb0d43d9
AT_HWCAP2: 0x2
AT_EXECFN: /usr/bin/cat
AT_PLATFORM: x86_64
A bit lower level way to the the same is:
> od -t x8 /proc/self/auxv
0000000 0000000000000021 00007fff77dbd000
0000020 0000000000000033 0000000000000e30
0000040 0000000000000010 000000000f8bfbff
0000060 0000000000000006 0000000000001000
0000100 0000000000000011 0000000000000064
0000120 0000000000000003 00005633999f3040
0000140 0000000000000004 0000000000000038
0000160 0000000000000005 000000000000000d
0000200 0000000000000007 00007f43dc1f2000
0000220 0000000000000008 0000000000000000
0000240 0000000000000009 00005633999f6be0
0000260 000000000000000b 000000003f8bd847
0000300 000000000000000c 000000003f8bd847
0000320 000000000000000d 000000003f880201
0000340 000000000000000e 000000003f880201
0000360 0000000000000017 0000000000000000
0000400 0000000000000019 00007fff77d0daf9
0000420 000000000000001a 0000000000000002
0000440 000000000000001f 00007fff77d0dfec
0000460 000000000000000f 00007fff77d0db09
0000500 0000000000000000 0000000000000000
#![allow(unused)]
fn main() {
pub struct AuxVector {
next: isize,
buf: *const auxv_t,
}
impl core::iter::Iterator for AuxVector {
type Item = AT;
fn next(&mut self) -> Option<Self::Item> {
let aux = unsafe { *self.buf.offset(self.next) };
self.next += 1;
match AT::from(aux){
AT::AT_NULL => None,
other => Some(other),
}
}
}
pub fn auxv() -> AuxVector {
AuxVector {
next: 0,
buf: crate::AUXV.load(Ordering::Relaxed),
}
}
}
#![allow(unused)]
fn main() {
#[no_mangle]
unsafe fn __rust_main(rsp: *const u8) -> u8 {
parse_stack(rsp);
//let ldso = ldso::Ldso::new();
//ldso.relocate_ldso();
//ldso.relocate_exe();
main()
}
}
Memory management
- mmap, mremap, munmap
- brk
- msync
- mprotect, mincore
brk
Let’s write a code like this and investigate the memory footprint of our program:
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
use linux::syscall::*;
#[no_mangle]
fn main() -> u8 {
println!("pid: {}", getpid().unwrap());
let _ = pause();
0
}
This small program gets the process id of our program and pauses the execution so we can checkout the memory
> ./cargo.sh run
pid: 1320734
Let’s checkout the mappings in the proc file system by using the pid like this:
> cat /proc/1320734/maps
00400000-00401000 r--p 00000000 fd:00 950935 /target/bin
00401000-00404000 r-xp 00001000 fd:00 950935 /target/bin
00404000-00405000 r--p 00004000 fd:00 950935 /target/bin
00406000-00407000 rw-p 00005000 fd:00 950935 /target/bin
00407000-00408000 rw-p 00000000 00:00 0
7ffe1a300000-7ffe1a321000 rw-p 00000000 00:00 0 [stack]
7ffe1a3f2000-7ffe1a3f6000 r--p 00000000 00:00 0 [vvar]
7ffe1a3f6000-7ffe1a3f8000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
Let’s break it down what this file tells us:
- Our binary is mapped into the low address rage with different permissions:
- read-only
- read-exec
- read-only
- read-write
- There is a middle section
- In the high address range we have
- stack
- vvar
- vdso
- vsyscall
Allocating memory
Let’s modify our main function like this and run our program:
#[no_mangle]
fn main() -> u8 {
println!("pid: {}", getpid().unwrap());
brk(brk(0) + 4096);
let _ = pause();
0
}
The mappings have been changed like this:
> cat /proc/1321009/maps
00400000-00401000 r--p 00000000 fd:00 950935 /target/bin
00401000-00404000 r-xp 00001000 fd:00 950935 /target/bin
00404000-00405000 r--p 00004000 fd:00 950935 /target/bin
00406000-00407000 rw-p 00005000 fd:00 950935 /target/bin
00407000-00408000 rw-p 00000000 00:00 0
004cd000-004ce000 rw-p 00000000 00:00 0 [heap]
7ffc131b5000-7ffc131d6000 rw-p 00000000 00:00 0 [stack]
7ffc131ed000-7ffc131f1000 r--p 00000000 00:00 0 [vvar]
7ffc131f1000-7ffc131f3000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
There is a new section called [heap] mapped as a private region with read and write permissions (rw-p)
So let’s use that space to to read the mappings from in the proc filesystem
#![no_std]
#![no_main]
extern crate linux;
use linux::syscall::*;
use linux::constants::*;
#[no_mangle]
fn main() -> u8 {
let len = 4096;
let old = brk(0);
let _ = brk(old + len) as *mut u8;
let mut buf = unsafe {
core::slice::from_raw_parts_mut(old as *mut u8, len as usize)
};
let fd = open("/proc/self/maps", O_RDONLY, 0).unwrap();
loop {
let len = read(fd, &mut buf).unwrap();
let _ = write(1, &buf[..len]).unwrap();
if len < buf.len() {
break;
}
}
0
}
It works like this:
> ./cargo.sh run
00400000-00401000 r--p 00000000 fd:00 950935 /target/bin
00401000-00404000 r-xp 00001000 fd:00 950935 /target/bin
00404000-00406000 r--p 00004000 fd:00 950935 /target/bin
00406000-00407000 rw-p 00005000 fd:00 950935 /target/bin
00407000-00408000 rw-p 00000000 00:00 0
017c8000-017c9000 rw-p 00000000 00:00 0 [heap]
7ffc3c050000-7ffc3c071000 rw-p 00000000 00:00 0 [stack]
7ffc3c079000-7ffc3c07d000 r--p 00000000 00:00 0 [vvar]
7ffc3c07d000-7ffc3c07f000 r-xp 00000000 00:00 0 [vdso]
ffffffffff600000-ffffffffff601000 --xp 00000000 00:00 0 [vsyscall]
Maps
mmap, munmap
Memory protection
Although brk is a nice little tool to allocate memory there quite a lot of other things we can do with memory.
To write self modifying code we can use mmap to allocate a memory which has all the read-write-exec flags enabled
Let’s create an executable which can read byte stream from standard out and it tries to execute it.
#![no_std]
#![no_main]
extern crate linux;
use linux::syscall::*;
#[no_mangle]
fn main() -> u8 {
let ptr = unsafe {
mmap(
core::ptr::null_mut(), 1024,
PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS,
0, 0
).unwrap()
};
let mut buf = unsafe { core::slice::from_raw_parts_mut(ptr, 1024) };
if read(0, &mut buf).unwrap() > 0 {
unsafe { core::arch::asm!("jmp {0}", in(reg) ptr) }
}
0
}
Let’s break our program down: First we need to allocate a buffer which we can fill with data
#![allow(unused)]
fn main() {
let ptr = unsafe {
mmap(
core::ptr::null_mut(), 1024,
PROT_READ|PROT_WRITE|PROT_EXEC,
MAP_PRIVATE|MAP_ANONYMOUS,
0, 0
).unwrap()
};
}
After that we create a slice to make sure that we avoid any memory safety issues…
#![allow(unused)]
fn main() {
let mut buf = unsafe { core::slice::from_raw_parts_mut(ptr, 1024) };
}
Once we’ve done with that we can read data from stdin into this buffer and if there were some data we can try to execute it.
#![allow(unused)]
fn main() {
if read(0, &mut buf).unwrap() > 0 {
unsafe { core::arch::asm!("jmp {0}", in(reg) ptr) }
}
}
If there is no data present, we simply exit with process with return code 0. Let’s test our program like this:
> ./cargo.sh build
> cat /dev/null | ./target/bin; echo $?
0
> echo "hello world" | ./target/bin; echo $?
Segmentation fault (core dumped)
139
It seems to be working, so let’s write some code which is able to rewrite itself:
global exploit
.text:
exploit:
mov rdi,0x1
inc byte [rel exploit + 0x1]
cmp rdi,0xa
jb exploit
mov rax,0x3c
syscall
This code initializes rdi with 0x1 and increments the constant value of 0x1 by one.
After that it checks if rdi is already equals to 0xa and if not it jumps back to exploit but this
time we put 0x2 into rdi. Once the rdi reaches 0xa it calls the exit system call so the return
code of our process will be 10.
Let’s build that code and see how it looks after the compilation:
> nasm -f elf64 -o obj asm.s
> objdump --disassemble=exploit -M intel ./obj
0000000000000000 <exploit>:
0: bf 01 00 00 00 mov edi,0x1
5: fe 05 f6 ff ff ff inc BYTE PTR [rip+0xfffffffffffffff6] # 1 <exploit+0x1>
b: 48 83 ff 0a cmp rdi,0xa
f: 72 ef jb 0 <exploit>
11: b8 3c 00 00 00 mov eax,0x3c
16: 0f 05 syscall
We can dump our exploit function as a binary blob so we can use it against our rust program like this:
> objcopy -O binary --only-section=.text obj exploit
> cat ./exploit | ./target/bin; echo $?
10
File mappings
As we’ve seen in the brk section there are always some files mapped into the virtual address space of a process.
At least there is the binary which is being executed. In many times there are mapped here too. (Check out the mappings
of the cat command with cat /proc/self/maps or of your shell with cat /proc/$$/maps)
We can also map a regular file to the address space and use it like a permanent buffer for our program.
#![no_std]
#![no_main]
extern crate linux;
use linux::syscall::*;
use linux::constants::*;
#[no_mangle]
fn main() -> u8 {
let fd = open("/tmp/data", O_CREAT|O_APPEND|O_RDWR, S_IRUSR|S_IWUSR).unwrap();
fallocate(fd, 0, 0, 1024).unwrap();
let ptr = unsafe {
mmap(
core::ptr::null_mut(), 1024,
PROT_READ|PROT_WRITE,
MAP_SHARED_VALIDATE,
fd, 0
).unwrap()
};
let mut buf = unsafe { core::slice::from_raw_parts_mut(ptr, 1024) };
let _ = write(1, buf).unwrap();
let _ = read(0, &mut buf).unwrap();
0
}
This way we can use it:
> echo "Hello old world" | ./target/bin
> cat /tmp/data
Hello old world
> echo "Hello new world" | ./target/bin
Hello old world
> cat /tmp/data
Hello new world
Feel free to reimplement the exploit above by mapping it into the virtual address space instead of reading from stdin.
Shared memory
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
use linux::syscall::*;
use linux::constants::*;
#[no_mangle]
fn main() -> u8 {
let fd = open("/tmp/data", O_CREAT|O_TRUNC|O_RDWR, 0).unwrap();
fallocate(fd, 0, 0, 1024).unwrap();
let p1 = unsafe {
mmap(
core::ptr::null_mut(), 1024,
PROT_READ|PROT_WRITE,
MAP_SHARED_VALIDATE,
fd, 0
).unwrap()
};
let p2 = unsafe {
mmap(
core::ptr::null_mut(), 1024,
PROT_READ|PROT_WRITE,
MAP_SHARED_VALIDATE,
fd, 0
).unwrap()
};
let mut b1 = unsafe { core::slice::from_raw_parts_mut(p1, 1024) };
let mut b2 = unsafe { core::slice::from_raw_parts_mut(p2, 1024) };
b1[0] = 13;
println!("b1[0] = {}", b1[0]);
println!("b2[0] = {}", b2[0]);
0
}
> ./cargo.sh run
b1[0] = 13
b2[0] = 13
Overmap section with different protection
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
use linux::syscall::*;
use linux::constants::*;
#[no_mangle]
fn main() -> u8 {
let p1 = unsafe {
mmap(
core::ptr::null_mut(), 4096 * 3,
PROT_READ,
MAP_ANONYMOUS|MAP_PRIVATE,
0, 0
).unwrap()
};
read(0, &mut [0u8]);
let p2 = unsafe {
mmap(
p1.offset(4096), 4096,
PROT_READ|PROT_WRITE,
MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED,
0, 0
).unwrap()
};
read(0, &mut [0u8]);
unsafe { munmap(p1, 4096 * 3).unwrap() };
pause();
0
}
mprotect
msync
Vdso
Functions to load symboles: dlopen, dlclose, dlsym
LD_PRELOAD=./libmfilter.so python to overwrite functions
To print the aux vector
https://lwn.net/Articles/519085/ https://lwn.net/Articles/615809/
#![allow(unused)]
fn main() {
use core::ffi::CStr;
use core::mem::transmute;
use crate::types::*;
use crate::error::{Result, result};
#[repr(C)]
#[derive(Debug, Clone)]
pub struct Ehdr {
pub e_ident: [u8;16],
pub e_type: u16,
pub e_machine: u16,
pub e_version: u32,
pub e_entry: u64,
pub e_phoff: u64,
pub e_shoff: u64,
pub e_flags: u32,
pub e_ehsize: u16,
pub e_phentsize: u16,
pub e_phnum: u16,
pub e_shentsize: u16,
pub e_shnum: u16,
pub e_shstrndx: u16,
}
#[repr(C)]
#[derive(Debug, Clone)]
pub struct Phdr {
pub p_type: u32,
pub p_flags: u32,
pub p_offset: u64,
pub p_vaddr: u64,
pub p_paddr: u64,
pub p_filesz: u64,
pub p_memsz: u64,
pub p_align: u64,
}
#[repr(C)]
#[derive(Debug, Clone)]
pub struct Shdr {
pub sh_name: u32,
pub sh_type: u32,
pub sh_flags: u64,
pub sh_addr: u64,
pub sh_offset: u64,
pub sh_size: u64,
pub sh_link: u32,
pub sh_info: u32,
pub sh_addralign: u64,
pub sh_entsize: u64,
}
#[repr(C)]
#[derive(Debug, Clone)]
pub struct Sym {
pub st_name: u32,
pub st_info: u8,
pub st_other: u8,
pub st_shndx: u16,
pub st_value: u64,
pub st_size: u64,
}
pub struct Vdso {
time: extern "C" fn(*mut time_t) -> time_t,
getcpu: extern "C" fn(*mut u32, *mut u32) -> isize,
gettimeofday: extern "C" fn(*mut timeval, *mut timezone) -> isize,
clock_getres: extern "C" fn(clockid_t, *mut timespec) -> isize,
clock_gettime: extern "C" fn(clockid_t, *mut timespec) -> isize,
}
impl Vdso {
pub(crate) unsafe fn from_ptr(p: *const u8) -> Self {
let header = &*(p as *const Ehdr);
let section_headers = core::slice::from_raw_parts(
p.offset(header.e_shoff as isize) as *const Shdr,
header.e_shnum as usize
);
let dynstr = section_headers.iter().find(|e| e.sh_type == 3).map(|h| {
p.offset(h.sh_offset as isize) as *const u8
}).unwrap();
let dynsym = section_headers.iter().find(|e| e.sh_type == 11).map(|h| {
core::slice::from_raw_parts(
p.offset(h.sh_offset as isize) as *const Sym,
h.sh_size as usize / core::mem::size_of::<Sym>(),
)
}).unwrap();
let mut time = None;
let mut getcpu = None;
let mut gettimeofday = None;
let mut clock_getres = None;
let mut clock_gettime = None;
for symbole in dynsym {
let s = dynstr.add(symbole.st_name as usize) as *const i8;
match CStr::from_ptr(s).to_str() {
Ok("time") => { time = transmute(p.add(symbole.st_value as usize)); }
Ok("getcpu") => { getcpu = transmute(p.add(symbole.st_value as usize)); }
Ok("gettimeofday") => { gettimeofday = transmute(p.add(symbole.st_value as usize)); }
Ok("clock_getres") => { clock_getres = transmute(p.add(symbole.st_value as usize)); }
Ok("clock_gettime") => { clock_gettime = transmute(p.add(symbole.st_value as usize)); }
_ => { /* ignore */ }
}
}
Self {
time: time.unwrap(),
getcpu: getcpu.unwrap(),
gettimeofday: gettimeofday.unwrap(),
clock_getres: clock_getres.unwrap(),
clock_gettime: clock_gettime.unwrap(),
}
}
#[inline(always)]
pub fn time(&self, time: &mut time_t) -> time_t {
(self.time)(time as *mut _)
}
/// The signature of this system call is different from the one documented in the man pages.
/// This is because there is only way make this system call fail which is providing invalid pointers
/// Since returning Result<()> has the same effect as returning a tuple with two numbers we simply
/// make sure that it never fails by putting these variables on the stack.
///
/// TODO: Is this really true about the Result<()>????
#[inline(always)]
pub fn getcpu(&self) -> (u32, u32) {
let mut cpu = 0;
let mut node = 0;
(self.getcpu)(&mut cpu as *mut _, &mut node as *mut _);
(cpu, node)
}
#[inline(always)]
pub fn gettimeofday(&self, tv: &mut timeval, tz: &mut timezone) -> Result<()> {
result((self.gettimeofday)(tv as *mut _, tz as *mut _)).map(|_| ())
}
#[inline(always)]
pub fn clock_getres(&self, clock: clockid_t, spec: &mut timespec) -> Result<()> {
result((self.clock_getres)(clock, spec as *mut _)).map(|_| ())
}
#[inline(always)]
pub fn clock_gettime(&self, clock: clockid_t, spec: &mut timespec) -> Result<()> {
result((self.clock_gettime)(clock, spec as *mut _)).map(|_| ())
}
}
}
Performance
Casting
There are multiple ways to convert types in Rust. We’re going to investigate the benefits and drawbacks of using one over another.
u64 => u32
Let’s create a small program to check out the differences at the assembly level:
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
use core::convert::TryInto;
use core::convert::TryFrom;
#[no_mangle] #[inline(never)] fn cast_as(n: u32) -> u64 { n as u64 }
#[no_mangle] #[inline(never)] fn cast_into(n: u32) -> u64 { n.into() }
#[no_mangle] #[inline(never)] fn cast_from(n: u32) -> u64 { u64::from(n) }
#[no_mangle] #[inline(never)] fn cast_try_into(n: u32) -> u64 { n.try_into().unwrap() }
#[no_mangle] #[inline(never)] fn cast_try_from(n: u32) -> u64 { u64::try_from(n).unwrap() }
#[no_mangle]
fn main() -> u8 {
println!("{}", cast_as(1));
println!("{}", cast_into(1));
println!("{}", cast_from(1));
println!("{}", cast_try_into(1));
println!("{}", cast_try_from(1));
0
}
once we compile, we get the following codes
> ./cargo.sh build
> ./cargo.sh dump cast_as
0000000000401270 <cast_as>:
401270: 89 f8 mov eax,edi
401272: c3 ret
> ./cargo.sh dump cast_into
0000000000401280 <cast_into>:
401280: 89 f8 mov eax,edi
401282: c3 ret
> ./cargo.sh dump cast_from
0000000000401290 <cast_from>:
401290: 89 f8 mov eax,edi
401292: c3 ret
> ./cargo.sh dump cast_try_into
00000000004012a0 <cast_try_into>:
4012a0: 89 f8 mov eax,edi
4012a2: c3 ret
> ./cargo.sh dump cast_try_from
00000000004012b0 <cast_try_from>:
4012b0: 89 f8 mov eax,edi
4012b2: c3 ret
As you can see rust really does a zero-cost abstraction and generates all of our functions the same way. This is possible
since a u32 can always be converted into a u64.
u32 => u64
But What happens if we switch the types and try to convert u64 into u32?
In this case we don’t have the Into and From traits implemented for the conversion so we can only compare the following
functions:
#![no_std]
#![no_main]
use core::convert::TryInto;
use core::convert::TryFrom;
#[macro_use]
extern crate linux;
#[no_mangle] #[inline(never)] fn cast_as(n: u64) -> u32 { n as u32 }
#[no_mangle] #[inline(never)] fn cast_try_into(n: u64) -> u32 { n.try_into().unwrap() }
#[no_mangle] #[inline(never)] fn cast_try_from(n: u64) -> u32 { u32::try_from(n).unwrap() }
#[no_mangle]
fn main() -> u8 {
println!("{}", cast_as(1));
println!("{}", cast_try_into(1));
println!("{}", cast_try_from(1));
0
}
Interestingly the code looks still the same. The compiler sees that the only value we use this function is a constant 1
so it can be sure that it fits into an u32 and it optimizes out the size checks.
> ./cargo.sh dump cast_as
0000000000401270 <cast_as>:
401270: 89 f8 mov eax,edi
401272: c3 ret
> ./cargo.sh dump cast_try_into
00000000004012a0 <cast_try_into>:
4012a0: 89 f8 mov eax,edi
4012a2: c3 ret
> ./cargo.sh dump cast_try_from
00000000004012b0 <cast_try_from>:
4012b0: 89 f8 mov eax,edi
4012b2: c3 ret
Let’s make it a bit more comples by reading a random byte from stdin, so the compiler doesn’t have a chance to optimize our code:
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
use core::convert::TryInto;
use core::convert::TryFrom;
#[no_mangle] #[inline(never)] fn cast_as(n: u64) -> u32 { n as u32 }
#[no_mangle] #[inline(never)] fn cast_try_into(n: u64) -> u32 { n.try_into().unwrap() }
#[no_mangle] #[inline(never)] fn cast_try_from(n: u64) -> u32 { u32::try_from(n).unwrap() }
#[no_mangle]
fn main() -> u8 {
let mut buf = [0u8;1];
linux::syscall::read(0, &mut buf).unwrap();
let n = buf[0].try_into().unwrap();
println!("{}", cast_as(n));
println!("{}", cast_try_into(n));
println!("{}", cast_try_from(n));
0
}
> ./cargo.sh dump cast_as
00000000004019c0 <cast_as>:
4019c0: 48 89 f8 mov rax,rdi
4019c3: c3 ret
> ./cargo.sh dump cast_try_from
0000000000401a10 <cast_try_from>:
401a10: 48 89 f8 mov rax,rdi
401a13: 48 c1 e8 20 shr rax,0x20
401a17: /-- 75 03 jne 401a1c <cast_try_from+0xc>
401a19: | 89 f8 mov eax,edi
401a1b: | c3 ret
401a1c: \-> 50 push rax
401a1d: 48 8d 3d f2 37 00 00 lea rdi,[rip+0x37f2]
401a24: 48 8d 0d 7d 60 00 00 lea rcx,[rip+0x607d]
401a2b: 4c 8d 05 fe 60 00 00 lea r8,[rip+0x60fe]
401a32: 48 8d 54 24 07 lea rdx,[rsp+0x7]
401a37: be 2b 00 00 00 mov esi,0x2b
401a3c: ff 15 96 65 00 00 call QWORD PTR [rip+0x6596]
> ./cargo.sh dump cast_try_into
00000000004019d0 <cast_try_into>:
4019d0: 48 89 f8 mov rax,rdi
4019d3: 48 c1 e8 20 shr rax,0x20
4019d7: /-- 75 03 jne 4019dc <cast_try_into+0xc>
4019d9: | 89 f8 mov eax,edi
4019db: | c3 ret
4019dc: \-> 50 push rax
4019dd: 48 8d 3d 32 38 00 00 lea rdi,[rip+0x3832]
4019e4: 48 8d 0d bd 60 00 00 lea rcx,[rip+0x60bd]
4019eb: 4c 8d 05 26 61 00 00 lea r8,[rip+0x6126]
4019f2: 48 8d 54 24 07 lea rdx,[rsp+0x7]
4019f7: be 2b 00 00 00 mov esi,0x2b
4019fc: ff 15 d6 65 00 00 call QWORD PTR [rip+0x65d6]
Alright, that looks now a bit different. As we can see, the documentation of TryFrom and TryInto was right: the two functions
generate the same code, the only difference is how the rust code looks like. So from now on we don’t care the try_from function either
and only compare the as keyword with the TryInto trait.
As you can see as keyword generates a single instruction in which it moves the content of rdi into rax. So it only does
register operation. As opposed to this the the TryInto trait generates 12 instructions. 5 of these instruction uses memory
access (lines with […]) and although it points to code segment which is probably already located in L2 cache it’s obviously
much slower than a simple register access.
If we count one CPU cycle for moving the value of a register into another one and about 10 cycles for finding a value in L2 cache
we can say that the TryInto conversion takes about 60-70x longer. And if you do it a lot it quickly adds up and makes a huge
difference in the performance of your code. But is this really mirroring the reality? Well not quite…
If we look at the code a bit closer at the beginning it does the same as the as keyword. After that it shifts the value
of our number 32 bit right and if it’s not equal to 0x4019dc then it jumps to the failure handling logic.
4019d0: 48 89 f8 mov rax,rdi
4019d3: 48 c1 e8 20 shr rax,0x20
4019d7: /-- 75 03 jne 4019dc <cast_try_into+0xc>
4019d9: | 89 f8 mov eax,edi
4019db: | c3 ret
This means that in case of the valid cast we only do 4 instructions which is only 4 times slower then the as keyword
but for that we get the benefit of the error handling. This makes the .text segment of our code obviously bigger and
so it will fit not so good into our instruction cache wich makes our code overall slower but it will be always correct.
As opposed to this we can see the as keyword a bit like an unsafe keyword which doesn’t always produces the expected
value and it just goes forward like nothing had happened. Still it has the benefit of having smaller and faster code
if we use it with care.
To see the difference between the code sizes we can use readelf like this:
> readelf -s ./target/bin | grep -E 'cast_|Name'
Num: Value Size Type Bind Vis Ndx Name
45: 00000000004019d0 50 FUNC GLOBAL DEFAULT 2 cast_try_into
95: 00000000004019c0 4 FUNC GLOBAL DEFAULT 2 cast_as
This says that the cast_try_into function occupies 12.5x more space in our caches. And if we remove the function wrapper
from the casts (remove the ret instruction) casting with as takes 3 bytes while casting with try_into takes 49 bytes.
As a result we can have ~22 as cast and ~1.5 try_into cast in our L1 instruction cache. Which is quite a bit of difference.
Let’s see what does the as keyword in different scenarios:
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
#[no_mangle] #[inline(never)] fn u32_as_u16(n: u32) -> u16 { n as u16 }
#[no_mangle] #[inline(never)] fn u32_as_i16(n: u32) -> i16 { n as i16 }
#[no_mangle] #[inline(never)] fn u32_as_i32(n: u32) -> i32 { n as i32 }
#[no_mangle] #[inline(never)] fn u32_as_u64(n: u32) -> u64 { n as u64 }
#[no_mangle] #[inline(never)] fn u32_as_i64(n: u32) -> i64 { n as i64 }
#[no_mangle] #[inline(never)] fn i32_as_u16(n: i32) -> u16 { n as u16 }
#[no_mangle] #[inline(never)] fn i32_as_i16(n: i32) -> i16 { n as i16 }
#[no_mangle] #[inline(never)] fn i32_as_u32(n: i32) -> u32 { n as u32 }
#[no_mangle] #[inline(never)] fn i32_as_u64(n: i32) -> u64 { n as u64 }
#[no_mangle] #[inline(never)] fn i32_as_i64(n: i32) -> i64 { n as i64 }
#[no_mangle]
fn main() -> u8 {
println!("u32_as_u16(1) {}", u32_as_u16(1));
println!("u32_as_i16(1) {}", u32_as_i16(1));
println!("u32_as_i32(1) {}", u32_as_i32(1));
println!("u32_as_u64(1) {}", u32_as_u64(1));
println!("u32_as_i64(1) {}", u32_as_i64(1));
println!("-----------------------------------------------");
println!("u32_as_u16(u32::MAX) {}", u32_as_u16(u32::MAX));
println!("u32_as_i16(u32::MAX) {}", u32_as_i16(u32::MAX));
println!("u32_as_i32(u32::MAX) {}", u32_as_i32(u32::MAX));
println!("u32_as_u64(u32::MAX) {}", u32_as_u64(u32::MAX));
println!("u32_as_i64(u32::MAX) {}", u32_as_i64(u32::MAX));
println!("-----------------------------------------------");
println!("i32_as_u16(i32::MIN) {}", i32_as_u16(i32::MIN));
println!("i32_as_i16(i32::MIN) {}", i32_as_i16(i32::MIN));
println!("i32_as_u32(i32::MIN) {}", i32_as_u32(i32::MIN));
println!("i32_as_u64(i32::MIN) {}", i32_as_u64(i32::MIN));
println!("i32_as_i64(i32::MIN) {}", i32_as_i64(i32::MIN));
println!("-----------------------------------------------");
println!("i32_as_u16(-1) {}", i32_as_u16(-1));
println!("i32_as_i16(-1) {}", i32_as_i16(-1));
println!("i32_as_u32(-1) {}", i32_as_u32(-1));
println!("i32_as_u64(-1) {}", i32_as_u64(-1));
println!("i32_as_i64(-1) {}", i32_as_i64(-1));
println!("-----------------------------------------------");
println!("i32_as_u16(1) {}", i32_as_u16(1));
println!("i32_as_i16(1) {}", i32_as_i16(1));
println!("i32_as_u32(1) {}", i32_as_u32(1));
println!("i32_as_u64(1) {}", i32_as_u64(1));
println!("i32_as_i64(1) {}", i32_as_i64(1));
println!("-----------------------------------------------");
println!("i32_as_u16(i32::MAX) {}", i32_as_u16(i32::MAX));
println!("i32_as_i16(i32::MAX) {}", i32_as_i16(i32::MAX));
println!("i32_as_u32(i32::MAX) {}", i32_as_u32(i32::MAX));
println!("i32_as_u64(i32::MAX) {}", i32_as_u64(i32::MAX));
println!("i32_as_i64(i32::MAX) {}", i32_as_i64(i32::MAX));
0
}
Interestingly if we try to disasseble the code of these functions there are only three of them can be found:
i32_as_i16, u32_as_i64, i32_as_i64. We must to dig a bit deeper to find out why. Let’s checkout the symbol table
of the binary with
> objdump -t ./target/bin | grep _as_ | sort
0000000000401270 g F .text 0000000000000003 u32_as_i64
0000000000401270 g F .text 0000000000000003 u32_as_u64
0000000000401280 g F .text 0000000000000003 i32_as_i16
0000000000401280 g F .text 0000000000000003 i32_as_u16
0000000000401280 g F .text 0000000000000003 u32_as_i16
0000000000401280 g F .text 0000000000000003 u32_as_u16
0000000000401290 g F .text 0000000000000004 i32_as_i64
0000000000401290 g F .text 0000000000000004 i32_as_u64
In this output we can see the following columns:
- memory address
- flags to describe the type of the symbol (g=global, F=function)
- section in which the symbol is located (.text = program code)
- size of the symbol
- name of the symbol
If you have a closer look at the memory address of the symboles you can see that multiple symbol uses the same address. This means that there are only 3 different code sections for these 8 symboles. In the disassemble function of the objdump command it takes only the first of these memory addresses as real symbol and so it doesn’t find any other of them. The reason for merging these symboles are that they do exactly the same from the compilers perspective. Let’s see what is that:
> objdump --disassemble=u32_as_i64 -M intel ./target/bin
0000000000401270 <u32_as_i64>:
401270: 89 f8 mov eax,edi
401272: c3 ret
> objdump --disassemble=i32_as_i16 -M intel ./target/bin
0000000000401280 <i32_as_i16>:
401280: 89 f8 mov eax,edi
401282: c3 ret
> objdump --disassemble=i32_as_i64 -M intel ./target/bin
0000000000401290 <i32_as_i64>:
401290: 48 63 c7 movsxd rax,edi
401293: c3 ret
I’m now sure why the first two functions weren’t merged but maybe because of the different input type (u32/i32) but the
third function is obviously different. It creates a signed integer with bigger size. This means that the value has to be
sign-extended (movsxd. This means for example if case of i8 => i16
-1: 0xff => 0xffff
+1: 0x01 => 0x0001
Last but not least, let’s see the output of our program. We have the following blocks:
Unsigned normal:
u32_as_u16(1) 1 # same
u32_as_i16(1) 1 # same
u32_as_i32(1) 1 # same
u32_as_u64(1) 1 # same
u32_as_i64(1) 1 # same
Unsigned overflow:
u32_as_u16(u32::MAX) 65535 # diff (truncated)
u32_as_i16(u32::MAX) -1 # diff (2's complement)
u32_as_i32(u32::MAX) -1 # diff (2's complement)
u32_as_u64(u32::MAX) 4294967295 # same
u32_as_i64(u32::MAX) 4294967295 # same
Signed underflow:
i32_as_u16(i32::MIN) 0 # diff
i32_as_i16(i32::MIN) 0 # diff
i32_as_u32(i32::MIN) 2147483648 # diff (not 2's complement)
i32_as_u64(i32::MIN) 18446744071562067968 # diff (not 2's complement)
i32_as_i64(i32::MIN) -2147483648 # same
i32_as_u16(-1) 65535 # diff (not 2's complement)
i32_as_i16(-1) -1 # same
i32_as_u32(-1) 4294967295 # diff (not 2's complement)
i32_as_u64(-1) 18446744073709551615 # diff (not 2's complement)
i32_as_i64(-1) -1 # same
Signed normal:
i32_as_u16(1) 1 # same
i32_as_i16(1) 1 # same
i32_as_u32(1) 1 # same
i32_as_u64(1) 1 # same
i32_as_i64(1) 1 # same
Signed overflow:
i32_as_u16(i32::MAX) 65535 # diff (truncated)
i32_as_i16(i32::MAX) -1 # diff (2's complement)
i32_as_u32(i32::MAX) 2147483647 # same
i32_as_u64(i32::MAX) 2147483647 # same
i32_as_i64(i32::MAX) 2147483647 # same
As a result we can set up the following rules:
| Cast | Safe if |
|---|---|
uS as uB | always |
uS as iB | always |
uN as iN | uN <= iN::MAX |
iN as uN | iN >= uN::MIN |
uB as uS | uB <= uS::MAX |
iB as uS | iB >= iS::MIN && iB <= iS::MAX |
iB as uS | iB >= uS::MIN && iB <= uS::MAX |
uB as uS | uB <= iS::MAX |
Where the letters have the following meanings:
- S: small
- B: big
- N: number (same size)
- u: unsigned
- i: signed
As a conclusion we could say the followings about casting integers:
- Use the
From,Intotraits whenever the type system allows it. It’s the safest way and it doesn’t have any overhead. - Use the
TryFrom,TryIntotraits whenever the type system requires it and you can not be sure about the input value. Even though it has a bit of an overhead and it decreases the cache locality of your code but it’s always safe to use and let’s the compiler warn you if you unintetionally modify the code in an incorrect way later on. - Use the
askeyword instead ofTryFromandTryIntoonly if you can always be sure about the input number. Even though it doesn’t require anunsafeblock it’s easy to shoot you into the foot by refactoring the code without realizing that the input value is not deterministic anymore. In this case you will have hard to determined bugs.
Result
Let’s start with a very simple code:
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
use core::arch::asm;
#[no_mangle]
#[inline(never)]
fn ok() -> Result<(), ()> {
unsafe { asm!( "nop", options(nostack)) };
Ok(())
}
#[no_mangle]
#[inline(never)]
fn err() -> Result<(), ()> {
unsafe { asm!( "nop", options(nostack)) };
Err(())
}
#[no_mangle]
fn main() -> u8 {
println!("{:?}", ok());
println!("{:?}", err());
0
}
Rust does a good job with optimizing out the code which is not necessaryy and this feautre makes it difficult to investigate the code of a function. A simple function with a constant return value wont be put into the binary so we can not dump the assembly of it. To trick the compiler into leaving our code in the binary we can use a simple assembly line which doesn’t do anything but the since the compiler doesn’t check the value of it it just thinks that it’s important so it leaves it there. After the compilation our code looks like this:
> ./cargo.sh dump ok
00000000004012f0 <ok>:
4012f0: 90 nop
4012f1: 31 c0 xor eax,eax
4012f3: c3 ret
> ./cargo.sh dump err
0000000000401300 <err>:
401300: 90 nop
401301: b0 01 mov al,0x1
401303: c3 ret
As you can the code of the two functions are quite similar. The rax register (or it’s parts) will be set and the code
returns. In case of Ok(()) the rax is set to zero and by Err(()) it will be set to one. Since the content of the
result is always zero sized () it doesn’t even need a register to be passed back to the caller function.
Let’s modify our code to return with some real value: for example an u32 number:
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
use core::arch::asm;
#[no_mangle]
#[inline(never)]
fn ok() -> Result<u32, ()> {
unsafe { asm!( "nop", options(nostack)) };
Ok(3)
}
#[no_mangle]
#[inline(never)]
fn err() -> Result<(), u32> {
unsafe { asm!( "nop", options(nostack)) };
Err(3)
}
#[no_mangle]
fn main() -> u8 {
println!("{:?}", ok());
println!("{:?}", err());
0
}
This changes already the output of the dump
> ./cargo.sh dump ok
0000000000401360 <ok>:
401360: 90 nop
401361: 31 c0 xor eax,eax
401363: ba 03 00 00 00 mov edx,0x3
401368: c3 ret
> ./cargo.sh dump err
0000000000401370 <err>:
401370: 90 nop
401371: b8 01 00 00 00 mov eax,0x1
401376: ba 03 00 00 00 mov edx,0x3
40137b: c3 ret
As you can see the value of Result is passed back to the caller in another register rdx. This alligns with the
System V ABI.
But what happens if we use some realistic error type, like an Error enum?
#![no_std]
#![no_main]
#[macro_use]
extern crate linux;
use core::arch::asm;
#[derive(Debug, Clone)]
pub enum Error { A }
#[no_mangle]
#[inline(never)]
fn ok() -> Result<(), Error> {
unsafe { asm!( "nop", options(nostack)) };
Ok(())
}
#[no_mangle]
#[inline(never)]
fn err() -> Result<(), Error> {
unsafe { asm!( "nop", options(nostack)) };
Err(Error::A)
}
#[no_mangle]
fn main() -> u8 {
println!("{:?}", ok());
println!("{:?}", err());
0
}
It looks quit similar, right?
> ./cargo.sh dump ok
0000000000401310 <ok>:
401310: 90 nop
401311: 31 c0 xor eax,eax
401313: c3 ret
> ./cargo.sh dump err
0000000000401320 <err>:
401320: 90 nop
401321: b0 01 mov al,0x1
401323: c3 ret
Then add another error variant to the Error enum and try it again
#![allow(unused)]
fn main() {
#[derive(Debug, Clone)]
pub enum Error { A, B }
}
> ./cargo.sh dump ok
0000000000401320 <ok>:
401320: 90 nop
401321: b0 02 mov al,0x2
401323: c3 ret
> ./cargo.sh dump err
0000000000401330 <err>:
401330: 90 nop
401331: 31 c0 xor eax,eax
401333: c3 ret
The return values have been changed. Now zero means Err(Error::A) and two means Ok(()). It seems like the compiler
realizes that the Ok(()) value can only have one state so it can be represented just like another variant of the Error
enum. It kind of creates another enum other the hood like
#![allow(unused)]
fn main() {
enum SpecialError {
Err_Error_A = 0,
Err_Error_B = 1,
Ok = 2,
}
}
This way it’s enough to use only one register instead of two. Pretty nice, right?
Let’s return with some real Ok() value to avoid this optimisation. For example a number like this:
#![allow(unused)]
fn main() {
#[no_mangle]
#[inline(never)]
fn ok() -> Result<u8, Error> {
unsafe { asm!( "nop", options(nostack)) };
Ok(3)
}
#[no_mangle]
#[inline(never)]
fn err() -> Result<u8, Error> {
unsafe { asm!( "nop", options(nostack)) };
Err(Error::B)
}
}
With u8 it seems to be quite good. rax=0 means Ok rax=1 means Err and the rdx holds the value.
> ./cargo.sh dump ok
0000000000401320 <ok>:
401320: 90 nop
401321: 31 c0 xor eax,eax
401323: b2 03 mov dl,0x3
401325: c3 ret
> ./cargo.sh dump err
0000000000401330 <err>:
401330: 90 nop
401331: b0 01 mov al,0x1
401333: b2 01 mov dl,0x1
401335: c3 ret
But what happens if we try to return with a bigger number like i32?
#![allow(unused)]
fn main() {
#[no_mangle]
#[inline(never)]
fn ok() -> Result<i32, Error> {
unsafe { asm!( "nop", options(nostack)) };
Ok(3)
}
#[no_mangle]
#[inline(never)]
fn err() -> Result<i32, Error> {
unsafe { asm!( "nop", options(nostack)) };
Err(Error::B)
}
}
It get’s already a bit scarry
> ./cargo.sh dump ok
0000000000401330 <ok>:
401330: 90 nop
401331: 48 b8 00 00 00 00 03 movabs rax,0x300000000
401338: 00 00 00
40133b: c3 ret
> ./cargo.sh dump err
0000000000401340 <err>:
401340: 90 nop
401341: b8 01 01 00 00 mov eax,0x101
401346: c3 ret
I assume since the Result is an enum too which size must be equal to the tag size + the size of the biggest inner value
it tries to encode the tag and the i32 inner value into a 64 bit register. The first half of the register represents the
tag (zero = Ok) and the second half the i32 value (3). As opposed to this the will still be encoded as a single integer
but since the value of the Error enum can not be bigger then 2 the compiler doesn’t have to use a full 64 bit register.
And it’s just the beginning. Replace i32 with i64 and you’ll get this:
> ./cargo.sh dump ok
0000000000401330 <ok>:
401330: 48 89 f8 mov rax,rdi
401333: 90 nop
401334: 48 c7 47 08 03 00 00 mov QWORD PTR [rdi+0x8],0x3
40133b: 00
40133c: c6 07 00 mov BYTE PTR [rdi],0x0
40133f: c3 ret
> ./cargo.sh dump err
0000000000401340 <err>:
401340: 48 89 f8 mov rax,rdi
401343: 90 nop
401344: 66 c7 07 01 01 mov WORD PTR [rdi],0x101
401349: c3 ret
This is even more hairy… The System V ABI says that if you need to return two integer values you can use the rax and rdx
registers just like we did this above. As opposed to this if the return value has a MEMORY type (eg a big struct) then
the caller functions needs to reserve space for the return value and the called function will write the value there.
In this case the pointer to this space is passed as a hidden first argument to the function in the rdi register and the
rax register should hold the pointer to this space at the return point. Hence mov rax,rdi at the beginning of both functions.
And after that the provided memory location pointed by [rdi] will be filled with tiher 0 and 0x3 for the Ok(3) or
with 0x101 for the Err(Error::B)
And this is kind of sad because we’re most likely hitting L2 (but at a minimum L1) cache for returning a simple number as number instead of passing it back in two registers basically for free.
It can be corrected by forcing the compiler to use the C ABI with the extern "C" declaration:
#![allow(unused)]
fn main() {
#[no_mangle]
#[inline(never)]
extern "C" fn ok() -> Result<usize, Error> {
unsafe { asm!( "nop", options(nostack)) };
Ok(3)
}
#[no_mangle]
#[inline(never)]
extern "C" fn err() -> Result<usize, Error> {
unsafe { asm!( "nop", options(nostack)) };
Err(Error::B)
}
}
We’ll see a warning that the Result enum is not FFI-safe but we also not want to use it currently as an FFI function.
Just as a function which doesn’t do unnecessarry work:
> ./cargo.sh dump ok
0000000000401330 <ok>:
401330: 90 nop
401331: ba 03 00 00 00 mov edx,0x3
401336: 31 c0 xor eax,eax
401338: c3 ret
> ./cargo.sh dump err
0000000000401340 <err>:
401340: 90 nop
401341: b8 01 01 00 00 mov eax,0x101
401346: c3 ret
So as long as we don’t use an Ok or Err type bigger than 64 bit we should be good now.
Even though it’s a bit compilicated to use there are some benefits of using Result:
- The value must be checked instead of simply using (like by malloc). It doesn’t to access its content until you proved that it’s has an Ok or Err value. This is a huge benefit.
- The questionmark. You can forward the error by using simply
Err()?. But how does it work under the hood?
The ? operator
Jump
- Why is it so much faster to use unconditional jumps instead of if-else? – branch predictor
- make the common case faster
- got-plt example
ABI
Tools
readelf
#![allow(unused)]
fn main() {
// ==============================================================================
// Elf file
// ==============================================================================
#[derive(Debug, Clone)]
pub struct File<'a>{
pub ehdr: &'a Ehdr,
pub phdrs: &'a [Phdr],
pub shdrs: &'a [Shdr],
}
impl<'a> File<'a> {
pub unsafe fn from_slice(buf: &'a [u8]) -> Self {
Self::from_ptr(buf.as_ptr())
}
pub unsafe fn from_ptr(p: *const u8) -> Self {
let ehdr = &*(p as *const Ehdr);
let phdrs = core::slice::from_raw_parts(
p.offset(ehdr.e_phoff as isize) as *const Phdr,
ehdr.e_phnum as usize
);
let shdrs = core::slice::from_raw_parts(
p.offset(ehdr.e_shoff as isize) as *const Shdr,
ehdr.e_shnum as usize
);
Self { ehdr, phdrs, shdrs }
}
pub unsafe fn shstr(&self, sh_name: u32) -> &str {
let h = &self.shdrs[self.ehdr.e_shstrndx as usize];
let p = (self.ehdr as *const _ as *const i8)
.add(h.sh_offset as usize)
.add(sh_name as usize);
CStr::from_ptr(p).to_str().unwrap()
}
pub unsafe fn strtab(&self, st_name: u32) -> &str {
for h in self.shdrs.into_iter() {
if h.sh_type == SHT::SHT_STRTAB as u32 {
let p = (self.ehdr as *const _ as *const i8)
.add(h.sh_offset as usize)
.add(st_name as usize);
return CStr::from_ptr(p).to_str().unwrap();
}
}
panic!("Missing strtab");
}
pub fn symtab(&self) -> &[Sym] {
for h in self.shdrs.into_iter() {
if h.sh_type == SHT::SHT_SYMTAB as u32 {
return unsafe {
let p = (self.ehdr as *const _ as *const i8)
.add(h.sh_offset as usize) as *const Sym;
core::slice::from_raw_parts(p, (h.sh_size / h.sh_entsize) as usize)
};
}
}
&[]
}
pub fn dynsym(&self) -> &[Sym] {
for h in self.shdrs.into_iter() {
if h.sh_type == SHT::SHT_DYNSYM as u32 {
return unsafe {
let p = (self.ehdr as *const _ as *const i8)
.add(h.sh_offset as usize) as *const Sym;
core::slice::from_raw_parts(p, (h.sh_size / h.sh_entsize) as usize)
};
}
}
&[]
}
pub fn dynamic(&self) -> &[Dyn] {
for h in self.shdrs.into_iter() {
if h.sh_type == SHT::SHT_DYNAMIC as u32 {
return unsafe {
let p = (self.ehdr as *const _ as *const i8)
.add(h.sh_offset as usize) as *const Dyn;
core::slice::from_raw_parts(p, (h.sh_size / h.sh_entsize) as usize)
};
}
}
&[]
}
pub fn dump_phdrs(&self) {
crate::println!("Program headers:");
//crate::println!("{:?}", ""); // NOTE: Without this it will segfault in the for loop...
crate::println!(" {:<3} {:<12} {:<10} {:<18} {:<18} {:<10} {:<10} {:<3} {:<10}",
"Idx",
"Type",
"Offset",
"VirtAddr",
"PhysAddr",
"FileSize",
"MemSize",
"Flg",
"Align"
);
//crate::println!("{:?}", self.phdrs);
for (idx, h) in self.phdrs.into_iter().enumerate() {
crate::println!(" {:<3?} {:<12} 0x{:0>8x?} 0x{:0>16x?} 0x{:0>16x?} 0x{:0>8x?} 0x{:0>8x?} {:<3} 0x{:0>8x?}",
idx,
h.p_type().as_str(),
h.p_offset,
h.p_vaddr,
h.p_paddr,
h.p_filesz,
h.p_memsz,
h.flags(),
h.p_align
);
}
crate::println!("");
}
pub fn dump_shdrs(&self) {
//crate::println!("{:?}", ""); // NOTE: Without this it will segfault in the for loop...
crate::println!("Section headers:");
crate::println!(" {:<3} {:<13} {:<18} {:<10} {:<10} {:<3} {:<3} {:<3} {:<3} {:<3} {}",
"Idx",
"Type",
"Address",
"Offset",
"Size",
"ENS",
"FLG",
"LNK",
"INF",
"ALI",
"Name"
);
//crate::println!("{:?}", self.phdrs);
for (idx, h) in self.shdrs.into_iter().enumerate() {
crate::println!(" {:<3} {:<13} 0x{:0>16x} 0x{:0>8x} 0x{:0>8x} {:<3} {:<3} {:<3} {:<3} {:<3} {}",
idx,
h.sh_type().as_str(),
h.sh_addr,
h.sh_offset,
h.sh_size,
h.sh_entsize,
h.sh_flags,
h.sh_link,
h.sh_info,
h.sh_addralign,
unsafe { self.shstr(h.sh_name) }
);
}
crate::println!("");
}
fn dump_symbols(&self, symbols: &[Sym]) {
crate::println!(" {:<3} {:<8} {:<18} {:<10} {:<6} {:<10} {:<5} {}",
"Idx",
"Type",
"Address",
"Size",
"Bind",
"Visibility",
"Shndx",
"Name"
);
//crate::println!("{:?}", self.phdrs);
for (idx, s) in symbols.into_iter().enumerate() {
crate::println!(" {:<3} {:<8} 0x{:0>16x} 0x{:0>8x} {:<6} {:<10} {:<5} {}",
idx,
s.st_type().as_str(),
s.st_value,
s.st_size,
s.st_bind().as_str(),
s.st_visibility().as_str(),
s.st_shndx,
unsafe { self.strtab(s.st_name) }
);
}
}
pub fn dump_symtab(&self) {
//crate::println!("{:?}", ""); // NOTE: Without this it will segfault in the for loop...
let symtab = self.symtab();
crate::println!("Static symbols: {:?}", symtab.len());
if symtab.len() > 0 {
self.dump_symbols(symtab);
}
crate::println!("");
}
pub fn dump_dynsym(&self) {
//crate::println!("{:?}", ""); // NOTE: Without this it will segfault in the for loop...
let dynsym = self.dynsym();
crate::println!("Dynamic symbols: {:?}", dynsym.len());
if dynsym.len() > 0 {
self.dump_symbols(dynsym);
}
crate::println!("");
}
pub fn dump_dynamic(&self) {
let dynamic = self.dynamic();
crate::println!("Dynamic reloaction: {:?}", dynamic.len());
if dynamic.len() == 0 {
return;
}
crate::println!(" {:<3} {:<15} {}", "Idx", "Type", "Value");
//crate::println!("{:?}", self.phdrs);
for (idx, s) in dynamic.into_iter().enumerate() {
match s.d_tag() {
DT::DT_SONAME | DT::DT_NEEDED => {
crate::println!(" {:<3} {:<15} {}",
idx,
s.d_tag().as_str(),
unsafe { self.strtab(s.d_val as u32) }
);
}
_ => {
crate::println!(" {:<3} {:<15} 0x{:0>16x}",
idx,
s.d_tag().as_str(),
s.d_val
);
}
}
if let DT::DT_NULL = s.d_tag() {
break
}
}
crate::println!("");
}
}
}
The dynamic linker
In these series we’re going to implement a basic version of a dynamic linker to load and relocate symbols at runtime. Let’s make a step back and see what’s dynamic linking all about. As always, we’re going to use the simplest example programs and for that we have to write some assembly. Let’s get some definitions done to avoid confusion:
Object file:
An object file is a compilation unit in which all the necessary information is collected. Once the program files are compiled and organized into object these files can be linked together to form a library or an executable file. This job is done by the program linker.
Let’s create a simplistic object file which only holds a global variable. We could use this variable to specify the exit
code fo the program so let’s call the file rc.s and the variable RC
global RC:data
section .data
RC: db 1
You can compile and print the most important information about the object file like this: (I removed some unimportant lines)
> nasm -f elf64 rc.s
> readelf -Wa rc.o
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: REL (Relocatable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x0
Start of program headers: 0 (bytes into file)
Start of section headers: 64 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 0 (bytes)
Number of program headers: 0
Size of section headers: 64 (bytes)
Number of section headers: 5
Section header string table index: 2
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .data PROGBITS 0000000000000000 000180 000001 00 WA 0 0 4
[ 2] .shstrtab STRTAB 0000000000000000 000190 000021 00 0 0 1
[ 3] .symtab SYMTAB 0000000000000000 0001c0 000060 18 4 3 8
[ 4] .strtab STRTAB 0000000000000000 000220 00000b 00 0 0 1
Symbol table '.symtab' contains 4 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS ./rc.s
2: 0000000000000000 0 SECTION LOCAL DEFAULT 1 .data
3: 0000000000000000 0 OBJECT GLOBAL DEFAULT 1 RC
This tells us the followings:
- The type of the file is
REL (Relocatable file) - It has a
.datasection (section .datain asm) - It has a global symbol called
RC(last line) and it’s located in the fist section (Ndx=1) which is the.datasection
As you can see the value of all the symbols are zero. The value should be a memory addres which the symbol points to, so how can it be zero? It will be updated by the linker once this object file is merged into an executable or a shared library.
Static linking:
The simplest way to create an elf binary is merging all of its parts into a single file. This allows it to be fully independent from any other userspace code. As a result you can put it into a docker container / chroot environment and it will just run.
Let’s reimplement the /bin/false command in such a way: The only thin the false command does is exiting with 1 as return code.
To make it a bit more interesting let’s use the rc.o file as a static library which we can include into our binary and use the
value of RC defined there as the exit code of our binary. The source of our false.s looks lie this:
global _start
extern RC:data
section .text
_start:
mov rdi,[RC]
mov rax,0x3c
syscall
With extern RC:data we tell the assembler that the RC with type data exists somewhere in another object file which we will
link against. With mov rdi,[RC] we say the compiler to go to the address marked by RC and read the value of the memory there
and move it intot the rdi register. This register is used as the return value of the exit system call.
We can compile, link and run like this:
> nasm -f elf64 ./false.s
> ld -static ./false.o ./rc.o -o ./false
> ./false; echo $?
1
Let’s have a closer look with readelf
> readelf -Wa ./false
ELF Header:
Magic: 7f 45 4c 46 02 01 01 00 00 00 00 00 00 00 00 00
Class: ELF64
Data: 2's complement, little endian
Version: 1 (current)
OS/ABI: UNIX - System V
ABI Version: 0
Type: EXEC (Executable file)
Machine: Advanced Micro Devices X86-64
Version: 0x1
Entry point address: 0x401000
Start of program headers: 64 (bytes into file)
Start of section headers: 8480 (bytes into file)
Flags: 0x0
Size of this header: 64 (bytes)
Size of program headers: 56 (bytes)
Number of program headers: 3
Size of section headers: 64 (bytes)
Number of section headers: 6
Section header string table index: 5
Section Headers:
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[ 0] NULL 0000000000000000 000000 000000 00 0 0 0
[ 1] .text PROGBITS 0000000000401000 001000 00000f 00 AX 0 0 16
[ 2] .data PROGBITS 0000000000402000 002000 000001 00 WA 0 0 4
[ 3] .symtab SYMTAB 0000000000000000 002008 0000c0 18 4 3 8
[ 4] .strtab STRTAB 0000000000000000 0020c8 00002d 00 0 0 1
[ 5] .shstrtab STRTAB 0000000000000000 0020f5 000027 00 0 0 1
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x0000000000400000 0x0000000000400000 0x0000e8 0x0000e8 R 0x1000
LOAD 0x001000 0x0000000000401000 0x0000000000401000 0x00000f 0x00000f R E 0x1000
LOAD 0x002000 0x0000000000402000 0x0000000000402000 0x000001 0x000001 RW 0x1000
Section to Segment mapping:
Segment Sections...
00
01 .text
02 .data
Symbol table '.symtab' contains 8 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS ./false.s
2: 0000000000000000 0 FILE LOCAL DEFAULT ABS ./rc.s
3: 0000000000401000 0 NOTYPE GLOBAL DEFAULT 1 _start
4: 0000000000402001 0 NOTYPE GLOBAL DEFAULT 2 __bss_start
5: 0000000000402001 0 NOTYPE GLOBAL DEFAULT 2 _edata
6: 0000000000402008 0 NOTYPE GLOBAL DEFAULT 2 _end
7: 0000000000402000 0 OBJECT GLOBAL DEFAULT 2 RC
As we can see the Type of the file is EXC (Executable file) now and as a result there is a new part in this dump compared
to the output of rc.o: Program Headers. The three lines under the Program Headers describes how this executable needs to
loaded into the memory when it gets run. The first line has R in the Flg column meaning that it can only be read. The
second line has RE meaning it can be read and executed. This hold the .text section as we defined in the assembly code
with section .text. The third line shows the .data section which can be read and written (Flg = RW)
We can also see that the RC symbol was merged into the .symtab of this file and it points to the location 0x0000000000402000
(last line). As in the rc.o file the RC symbol is located in the .data section (Ndx = 2 meaning the index two in the
Section Headers above. One can also use the Value of the symbol (0x0000000000402000) and find the same address in the
Address column of the Section Headers. This means that the RC symbol points exactly to the byte of the .data section.
Since we have an executable part in our file we can dump its content with objdump like this:
> objdump -M intel -d ./false
0000000000401000 <_start>:
401000: 48 8b 3c 25 00 20 40 mov rdi,QWORD PTR ds:0x402000
401007: 00
401008: b8 3c 00 00 00 mov eax,0x3c
40100d: 0f 05 syscall
As you can see the mov rdi,[RC] was replaced with mov rdi,QWORD PTR ds:0x402000. As you can see the address in this
intruction is the same as the Value of the RC symbole. So it point to the same byte of the .data section and it will
use the value located there which is in our case 1.
Let’s checkout the memory mappings of our executable in gdb
> gdb ./false
(gdb) break _start
(gdb) run
(gdb) info proc mappings
Start Addr End Addr Size Offset Perms objfile
0x400000 0x401000 0x1000 0x0 r--p /false
0x401000 0x402000 0x1000 0x1000 r-xp /false
0x402000 0x403000 0x1000 0x2000 rw-p /false
0x7ffff7ff9000 0x7ffff7ffd000 0x4000 0x0 r--p [vvar]
0x7ffff7ffd000 0x7ffff7fff000 0x2000 0x0 r-xp [vdso]
0x7ffffffde000 0x7ffffffff000 0x21000 0x0 rw-p [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
As you can see there are three mappings pointing to our executable with the same permission as we discussed about the Program Headers part. These mappings are created by the kernel as it’s initalizes our process. As you can see the Start and End Address are the same as the VirtAddr in the Program Headers.
Dynamic linking
As you could see in case of static linking all the code will be merged into a single executable. This makes everything really simple but it also means that if there is two executable using the same library the code of the library will be two times in the memory. It also takes twice as much space on the disk. Since the code is not shared it can not have the same cache entries either. So even though the library code is exactly the same if one process loads it into the CPU cache another needs to overwrite it resulting into constant cache misses.
Luckily there is a solution for that called shared libraries. But as always flexibility brings complexity.
Let’s create a shared library from the rc.o and link our false.o dynamically against it.
Since the code of rc.s is dead simple, it doesn’t need to be recompiled with nasm. But for the bigger code bases
needs to be written differently if it’s mean to be a shared library. More about that later on.
To create the lib we have to link it as shared.
> ld -shared rc.o -o rc.so
Since this will generate much more information we don’t print everything with readelf -Wa but only the important parts with
some command line flags. All of them can be found with readelf --help.
> readelf -Wh rc.so | grep Type
Type: DYN (Shared object file)
As we can see in the elf header the type of this file is DYN (Shared object file).
In the Program Headers wen can see that there is no more execution part (Flg=RE) but there are some other types like
DYNAMIC and GNU_RELRO. TODO: Describe what are these for.
> readelf -Wl ./rc.so
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x001000 0x001000 R 0x1000
LOAD 0x001f40 0x0000000000001f40 0x0000000000001f40 0x0000c1 0x0000c1 RW 0x1000
DYNAMIC 0x001f40 0x0000000000001f40 0x0000000000001f40 0x0000c0 0x0000c0 RW 0x8
GNU_RELRO 0x001f40 0x0000000000001f40 0x0000000000001f40 0x0000c0 0x0000c0 R 0x1
Let’s checkout the symbols in this file
> readelf -Ws rc.so
Symbol table '.dynsym' contains 2 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000002000 0 OBJECT GLOBAL DEFAULT 7 RC
Symbol table '.symtab' contains 5 entries:
Num: Value Size Type Bind Vis Ndx Name
0: 0000000000000000 0 NOTYPE LOCAL DEFAULT UND
1: 0000000000000000 0 FILE LOCAL DEFAULT ABS ./rc.s
2: 0000000000000000 0 FILE LOCAL DEFAULT ABS
3: 0000000000001f40 0 OBJECT LOCAL DEFAULT 6 _DYNAMIC
4: 0000000000002000 0 OBJECT GLOBAL DEFAULT 7 RC
It looks a bit different from the one we saw in our staticly linked binary or in our object file. It has now the address of
0x0000000000002000 which is much smaller then the on we saw in the staticly linked binary (0x0000000000402000). This
happens because it’s still an intermediate address. It shows only where it is located in the shared object file. As opposed
to this in the statically linked binary it showed us a real memory address where it will be located once the code is loaded
in to the memory and the process gets run.
By static linking we have the luxury that we can expect that the code will be mapped always into the same location of the
memory (0x0000000000400000) and so we can calculate the absolute addresses of the symbols already at the link time.
As opposed to this the dynamically loaded libraries must expect to be loaded into a random location of the address space.
Otherwise we should have a global register about the memory addresses where the different libraries are going to be loaded.
(A bit like the public ip addresses get assigned to companies).
As a result all the symbol addreses of a shared library needsto be updated once it got loaded into the memory. That’s the job of the dynamic loader which we are going to implement in these series.
But first let’s create our executable by dynamically linking against our rc.so library. This time we need to modify our
source code. Since the executable can only know the exact location of the library once it’s got loaded we have to write
our code in a way which respects this approach
global _start
extern RC:data
section .text
_start:
mov rax,[rel RC wrt ..got]
mov rdi,[rax]
mov rax,0x3c
syscall
Let’s recompile and run our command. To do that we need to find the dynamic loader of the system which can be done like this
> ls /lib64/ld*
/lib64/ld-linux-x86-64.so.2
Now we can pass it to our linker
> nasm -f elf64 rc.s
> ld --dynamic-linker /lib64/ld-linux-x86-64.so.2 -o false false.o -L. -l:./rc.so
> ./false; echo $?
1
Let’s checkout the memory in gdb
> gdb ./false
(gdb) break _start
(gdb) run
Starting program: /home/taabodal/work/blog/code/target/false
Breakpoint 1, 0x00007ffff7fe3290 in _start () from /lib64/ld-linux-x86-64.so.2
(gdb) info proc mappings
Start Addr End Addr Size Offset Perms objfile
0x400000 0x401000 0x1000 0x0 r--p /false
0x401000 0x402000 0x1000 0x1000 r-xp /false
0x402000 0x404000 0x2000 0x2000 rw-p /false
0x7ffff7fbd000 0x7ffff7fc1000 0x4000 0x0 r--p [vvar]
0x7ffff7fc1000 0x7ffff7fc3000 0x2000 0x0 r-xp [vdso]
0x7ffff7fc3000 0x7ffff7fc5000 0x2000 0x0 r--p /lib/ld.so
0x7ffff7fc5000 0x7ffff7fef000 0x2a000 0x2000 r-xp /lib/ld.so
0x7ffff7fef000 0x7ffff7ffa000 0xb000 0x2c000 r--p /lib/ld.so
0x7ffff7ffb000 0x7ffff7fff000 0x4000 0x37000 rw-p /lib/ld.so
0x7ffffffde000 0x7ffffffff000 0x21000 0x0 rw-p [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
There are multiple things to see: Even though we break at the _start function of ours it stoppes at another _start function.
This is the one of the dynamic linker (ld.so). (Note that I rewrote name of the ld.so because it doesn’t matter but
makes the look of the article ugly)
The other thing is to see is that compared to our static binary there is the dynamic loader also mapped into our memory
address space. And if you hit continue in the debugger, let it stop at our _start function and check the mappings again
you’ll see that the rc.so is mapped to. The loading of such shared libraries at the startup of the program is one of the
jobs of the dynamic loader.
(gdb) continue
(gdb) info proc mappings
Start Addr End Addr Size Offset Perms objfile
0x400000 0x401000 0x1000 0x0 r--p /false
0x401000 0x402000 0x1000 0x1000 r-xp /false
0x402000 0x403000 0x1000 0x2000 r--p /false
0x403000 0x404000 0x1000 0x3000 rw-p /false
0x7ffff7fb6000 0x7ffff7fb8000 0x2000 0x0 rw-p
0x7ffff7fb8000 0x7ffff7fb9000 0x1000 0x0 r--p /rc.so
0x7ffff7fb9000 0x7ffff7fba000 0x1000 0x1000 r--p /rc.so
0x7ffff7fba000 0x7ffff7fbb000 0x1000 0x2000 rw-p /rc.so
0x7ffff7fbb000 0x7ffff7fbd000 0x2000 0x0 rw-p
0x7ffff7fbd000 0x7ffff7fc1000 0x4000 0x0 r--p [vvar]
0x7ffff7fc1000 0x7ffff7fc3000 0x2000 0x0 r-xp [vdso]
0x7ffff7fc3000 0x7ffff7fc5000 0x2000 0x0 r--p /lib/ld.so
0x7ffff7fc5000 0x7ffff7fef000 0x2a000 0x2000 r-xp /lib/ld.so
0x7ffff7fef000 0x7ffff7ffa000 0xb000 0x2c000 r--p /lib/ld.so
0x7ffff7ffb000 0x7ffff7fff000 0x4000 0x37000 rw-p /lib/ld.so
0x7ffffffde000 0x7ffffffff000 0x21000 0x0 rw-p [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
Position independent executable (PIE)
As we discussed above all the shared libraries needs to be position independent, since they can be loaded anywhere in the memory.
To achive that we have to write pisition independent code (PIC) or instruct the compiler to write pic assembly for us (gcc -fpic).
But can we do the same for executables? Yes we can. In princip it is the same process. We need to write code that must expect
to be loaded anywhere in the memory and link it with the -pie flag. Since the source code of our executable is basicly empty
we can already link it as pie. A position independent executable can statically as well as dynamically linked. There is a
restriction though. All the components which we are linking against needs to be written in a PIC way. By dynamic libraries it is
by default so, but in case of static libraries we need to rewrite or regenerate our code in a PIC way.
Static PIE
Our false.s should look like this now:
global _start
extern RC:data
section .text
_start:
mov rax,[rel RC wrt ..got]
lea rdi,[rax]
mov rax,0x3c
syscall
And we can compile it like
> nasmf -f elf64 false.s
> ld -static -pie --no-dynamic-linker -o false false.o rc.o
It will change the header of the elf file
> readelf -Wh ./false | grep Type
Type: DYN (Position-Independent Executable file)
Create the DYNAMIC and GNU_RELRO program headers
> readelf -Wl ./false
Program Headers:
Type Offset VirtAddr PhysAddr FileSiz MemSiz Flg Align
LOAD 0x000000 0x0000000000000000 0x0000000000000000 0x0001d9 0x0001d9 R 0x1000
LOAD 0x001000 0x0000000000001000 0x0000000000001000 0x000015 0x000015 R E 0x1000
LOAD 0x002000 0x0000000000002000 0x0000000000002000 0x000000 0x000000 R 0x1000
LOAD 0x002f20 0x0000000000002f20 0x0000000000002f20 0x0000e1 0x0000e1 RW 0x1000
DYNAMIC 0x002f20 0x0000000000002f20 0x0000000000002f20 0x0000e0 0x0000e0 RW 0x8
GNU_RELRO 0x002f20 0x0000000000002f20 0x0000000000002f20 0x0000e0 0x0000e0 R 0x1
And if we have a look at the mapping of the running process we can see that our executable wasn’t mapped at 0x400000 anymore
but at 0x7ffff7ffb000.
> gdb ./false
(gdb) break _start
(gdb) run
(gdb) info proc mappings
Start Addr End Addr Size Offset Perms objfile
0x7ffff7ff5000 0x7ffff7ff9000 0x4000 0x0 r--p [vvar]
0x7ffff7ff9000 0x7ffff7ffb000 0x2000 0x0 r-xp [vdso]
0x7ffff7ffb000 0x7ffff7ffc000 0x1000 0x0 r--p /false
0x7ffff7ffc000 0x7ffff7ffd000 0x1000 0x1000 r-xp /false
0x7ffff7ffd000 0x7ffff7fff000 0x2000 0x2000 rw-p /false
0x7ffffffde000 0x7ffffffff000 0x21000 0x0 rw-p [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
Dynamic PIE
We can use the same false.s like we did in the dynamic library section and link it with
> ld -pie --dynamic-linker /lib64/ld-linux-x86-64.so.2 -o false false.o -L. -l:./rc.so
In gdb we can also see that it was mapped into the high address range:
> gdb ./false
(gdb) break _start
(gdb) run
(gdb) continue
(gdb) info proc mappings
Start Addr End Addr Size Offset Perms objfile
0x555555554000 0x555555555000 0x1000 0x0 r--p /false
0x555555555000 0x555555556000 0x1000 0x1000 r-xp /false
0x555555556000 0x555555557000 0x1000 0x2000 r--p /false
0x555555557000 0x555555558000 0x1000 0x3000 rw-p /false
0x7ffff7fb6000 0x7ffff7fb8000 0x2000 0x0 rw-p
0x7ffff7fb8000 0x7ffff7fb9000 0x1000 0x0 r--p /rc.so
0x7ffff7fb9000 0x7ffff7fba000 0x1000 0x1000 r--p /rc.so
0x7ffff7fba000 0x7ffff7fbb000 0x1000 0x2000 rw-p /rc.so
0x7ffff7fbb000 0x7ffff7fbd000 0x2000 0x0 rw-p
0x7ffff7fbd000 0x7ffff7fc1000 0x4000 0x0 r--p [vvar]
0x7ffff7fc1000 0x7ffff7fc3000 0x2000 0x0 r-xp [vdso]
0x7ffff7fc3000 0x7ffff7fc5000 0x2000 0x0 r--p /lib/ld.so
0x7ffff7fc5000 0x7ffff7fef000 0x2a000 0x2000 r-xp /lib/ld.so
0x7ffff7fef000 0x7ffff7ffa000 0xb000 0x2c000 r--p /lib/ld.so
0x7ffff7ffb000 0x7ffff7fff000 0x4000 0x37000 rw-p /lib/ld.so
0x7ffffffde000 0x7ffffffff000 0x21000 0x0 rw-p [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
There is a new section in the output of readelf which we haven’t seen before: the Relocations
> readelf -Wr false
Relocation section '.rela.dyn' at offset 0x298 contains 1 entry:
Offset Info Type Symbol's Value Symbol's Name + Addend
0000000000002ff8 0000000100000006 R_X86_64_GLOB_DAT 0000000000000000 RC + 0
As we’re referencing a variable which is located in a shared library we can not the address of it before the library
will be mapped. So the linker does an indiretion for us. Instread of referencing the variable directly we are referencing
a memory address which is tied to our binary and which serves as a pointer to the real address of the variable. Hence the
assembly code mov rax,[rel RC wrt ..got] which could be interpreted like this:
- Calculate a relative location of
RCWith Reference ToGOT - Put the value which can be found in this location into
rax
So what is GOT? It resolves to the Global Offset Table. GOT is a location in our program which we can use to delay the referencing of a value. It works a bit like a phone book. You know the name of the person you wanna call so you look the number of it in the book and after you use that number to reach the person. It’s also a bit different from the book because it’s empty at the beginning of our program.
> xxd -c8 false | grep 002ff8
00002ff8: 0000 0000 0000 0000 ........
If you look at the offset of the relocation (0x0000000000002ff8) and look it up in the Section Headers, you’ll see that
it points to the first element of the GOT
> readelf -WS false | grep -E '0000000000002ff8|Name'
[Nr] Name Type Address Off Size ES Flg Lk Inf Al
[10] .got PROGBITS 0000000000002ff8 002ff8 000008 08 WA 0 0 8
Every time a pie program is started the dynamic linker will check if there is any relocations in the program which needs to be made and if there is any, it’ll fix up the addresses of the executable. It’s also true for every dynamic libraries.
Let’s prove this with gdb
> gdb ./false
(gdb) break _start
(gdb) run
(gdb) info proc mappings
Start Addr End Addr Size Offset Perms objfile
0x555555554000 0x555555555000 0x1000 0x0 r--p /false
0x555555555000 0x555555556000 0x1000 0x1000 r-xp /false
0x555555556000 0x555555558000 0x2000 0x2000 rw-p /false
0x7ffff7fbd000 0x7ffff7fc1000 0x4000 0x0 r--p [vvar]
0x7ffff7fc1000 0x7ffff7fc3000 0x2000 0x0 r-xp [vdso]
0x7ffff7fc3000 0x7ffff7fc5000 0x2000 0x0 r--p /lib/ld.so
0x7ffff7fc5000 0x7ffff7fef000 0x2a000 0x2000 r-xp /lib/ld.so
0x7ffff7fef000 0x7ffff7ffa000 0xb000 0x2c000 r--p /lib/ld.so
0x7ffff7ffb000 0x7ffff7fff000 0x4000 0x37000 rw-p /lib/ld.so
0x7ffffffde000 0x7ffffffff000 0x21000 0x0 rw-p [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
As you can see our prgram was mapped at the address of 0x555555554000. If we add the offset of the relocation to this address
we can get the value of this memory region. At this point it is zero because the dynamic linker has just started and haven’t
done any fixings. Once we let the program continue and stop on out _start function the dynamic linker has already finished
it’s first job and the value pointed by the relocation has been changed.
(gdb) x/1gx 0x555555554000 + 0x002ff8
0x555555556ff8: 0x0000000000000000
(gdb) continue
(gdb) x/1gx 0x555555554000 + 0x002ff8
0x555555556ff8: 0x00007ffff7fba000
At this point our program is ready to use this indirection to access the memory location of RC.
(gdb) x/1bx 0x00007ffff7fba000
0x7ffff7fba000: 0x01
Conclusion
To summarize the above we could say the followings:
- PIC: Position independent code is a type of assembly code which only uses relative addressing. This is must for dynamicly linked libaries and an option for the executables.
- PIE: Poisition independent executable is an executable which written with PIC code only and so it can be loaded anywhere in the memory address sapce
- Object file: is a compilation unit which will be relocated during the linkage. Multiple object files can be merges into an archive (static library) a shared object (dynamic library) or into an executable.
- Shared object file: is a dynamically linked library which can be loaded anywhere in the memory because it’s written in PIC
- Static linking: is way to combine multiple object files into a single executable
- Dynamic linking: is a way to tell the linker that some of the dependencies will only be available at runtime
Interpreter
Let’s create a basic dynamic loader (aka interpreter) which can hand over the control to our executable. To be able to do that we need to have information about where is the entry point of the main executable.
When the kernel loads the program it has to find out a couple of iformation about it. Although the dynamic loader could do the same since there informations have already been parsed the kernel can simply put them onto the stack of the process and let the dynamic loader to find them. These are the information in the auxiliary vector which we already implemented earlier.
For a quick recap let’s print out the values. Let’s use ls command to check what kind of data will be passed when it gets started.
> gdb /bin/ls
(gdb) break _start
(gdb) run
(gdb) info auxv
33 AT_SYSINFO_EHDR System-supplied DSO's ELF header 0x7ffff7fc1000
51 AT_MINSIGSTKSZ Minimum stack size for signal delivery 0xe30
16 AT_HWCAP Machine-dependent CPU capability hints 0xf8bfbff
6 AT_PAGESZ System page size 4096
17 AT_CLKTCK Frequency of times() 100
3 AT_PHDR Program headers for program 0x555555554040
4 AT_PHENT Size of program header entry 56
5 AT_PHNUM Number of program headers 13
7 AT_BASE Base address of interpreter 0x7ffff7fc3000
8 AT_FLAGS Flags 0x0
9 AT_ENTRY Entry point of program 0x55555555aaa0
11 AT_UID Real user ID 1066129479
12 AT_EUID Effective user ID 1066129479
13 AT_GID Real group ID 1065878017
14 AT_EGID Effective group ID 1065878017
23 AT_SECURE Boolean, was exec setuid-like? 0
25 AT_RANDOM Address of 16 random bytes 0x7fffffffec19
26 AT_HWCAP2 Extension of AT_HWCAP 0x2
31 AT_EXECFN File name of executable 0x7fffffffefec "/usr/bin/ls"
15 AT_PLATFORM String identifying platform 0x7fffffffec29 "x86_64"
0 AT_NULL End of vector 0x0
As we can see the entry point is marked by AT_ENTRY so let’s find that value.
Our main function could simply look like this:
#[no_mangle]
fn main() -> u8 {
for aux in linux::env::auxv() {
if let AT::AT_ENTRY(entry) = aux {
unsafe {
core::arch::asm!(
"jmp {}",
in(reg) entry,
options(nostack, noreturn),
);
}
}
}
unreachable!()
}
We also need to rebuild our false binary. And when we’re there we should also get rid of the compelxity of using a library.
Let’s create a simple executable which exits with one.
global _start
section .text
_start:
lea rdi,0x1
mov rax,0x3c
syscall
Wenn we recompile it with our rust binary as an dynamic linker we can prove the result with
> nasm -f elf64 false.s
> ld -pie --dynamic-linker ld.so -o false false.o
> readelf -Wl false | grep interpreter
[Requesting program interpreter: ld.so]
> ./false; echo $?
1
As you can see we started the false executable and still our rust binary got run first.
So far so good but what happens if we need to run a non-pie executable? Let’s rebuild false with -no-pie option.
> ld -no-pie --dynamic-linker ld.so -o false false.o
> file false
false: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), statically linked, not stripped
It will be statically linked. It looks like we do need to link against a shared library to convince the linker to create a
dynamically linked executable. So let’s link against our rc.so even if we don’t use the variable defined there anymore.
> ld -no-pie --dynamic-linker ld.so -o false false.o -L. -l:rc.so
> file false
false: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter ld.so, not stripped
It looks better now but if we try to run it then we see the problem
> ./false
panicked at Segmentation fault
Let find out with gdb what’s the cause:
> gdb ./false
(gdb) b _start
(gdb) r
Starting program: /home/taabodal/work/blog/code/target/false
Cannot access memory at address 0x66c1d40f66eec178
Cannot access memory at address 0x66c1d40f66eec170
Cannot access memory at address 0x66c1d40f66eec178
Cannot access memory at address 0x66c1d40f66eec178
Cannot access memory at address 0x66c1d40f66eec170
(gdb) info proc mappings
Start Addr End Addr Size Offset Perms objfile
0x400000 0x401000 0x1000 0x0 r--p /ld.so
0x401000 0x404000 0x3000 0x1000 r-xp /ld.so
0x404000 0x405000 0x1000 0x4000 r--p /ld.so
0x406000 0x407000 0x1000 0x5000 rw-p /ld.so
0x407000 0x408000 0x1000 0x0 rw-p
0x7ffff7ff9000 0x7ffff7ffd000 0x4000 0x0 r--p [vvar]
0x7ffff7ffd000 0x7ffff7fff000 0x2000 0x0 r-xp [vdso]
0x7ffffffde000 0x7ffffffff000 0x21000 0x0 rw-p [stack]
0xffffffffff600000 0xffffffffff601000 0x1000 0x0 --xp [vsyscall]
As you can see there is a bunch of memory address at the startup of the process and if we list the mappings there is no
executable at all. We only have the ld.so mapped into the address space. So what’s the problem? We built a binary which
depens on the where it gets loaded but at the standard position (0x400000) we have already mapped our ld.so and it
collides with the binary it should load. We should really build our ld.so with pie so it can live together with pie
and non-pie executable in the same address space. We can do that by specifiying the link arguments of our rust binary
in the cargo.sh like this -nostartfiles -pie -Wl,--no-dynamic-linker. Once we’ve done that it should be mapped into a
random location and let the main executable do its job.
> ./false
Segmentation fault
Or not… But what’s the problem?
> gdb ./false
(gdb) r
Program received signal SIGSEGV, Segmentation fault.
0x0000000000001140 in ?? ()
(gdb) backtrace
#0 0x0000000000001140 in ?? ()
#1 0x00007ffff7ff8629 in linux::__rust_main (rsp=<optimized out>) at lib.rs:63
#2 0x00007ffff7ff85c4 in linux::_start () at lib.rs:42
(gdb) up
(gdb) disassemble
Dump of assembler code for function linux::__rust_main:
0x00007ffff7ff85e0 <+0>: push rbp
0x00007ffff7ff85e1 <+1>: mov rbp,rsp
0x00007ffff7ff85e4 <+4>: mov rax,QWORD PTR [rdi]
0x00007ffff7ff85e7 <+7>: lea rax,[rdi+rax*8]
0x00007ffff7ff85eb <+11>: add rax,0x10
0x00007ffff7ff85ef <+15>: mov rcx,rax
0x00007ffff7ff85f2 <+18>: data16 data16 data16 data16 cs nop WORD PTR [rax+rax*1+0x0]
0x00007ffff7ff8600 <+32>: cmp QWORD PTR [rcx],0x0
0x00007ffff7ff8604 <+36>: lea rcx,[rcx+0x8]
0x00007ffff7ff8608 <+40>: jne 0x7ffff7ff8600 <linux::__rust_main+32>
0x00007ffff7ff860a <+42>: add rdi,0x8
0x00007ffff7ff860e <+46>: mov QWORD PTR [rip+0x59eb],rdi # 0x7ffff7ffe000
0x00007ffff7ff8615 <+53>: mov QWORD PTR [rip+0x59ec],rax # 0x7ffff7ffe008
0x00007ffff7ff861c <+60>: mov QWORD PTR [rip+0x59ed],rcx # 0x7ffff7ffe010
0x00007ffff7ff8623 <+67>: call QWORD PTR [rip+0x598f] # 0x7ffff7ffdfb8
=> 0x00007ffff7ff8629 <+73>: pop rbp
0x00007ffff7ff862a <+74>: ret
End of assembler dump.
(gdb) x/1gx 0x7ffff7ffdfb8
0x7ffff7ffdfb8: 0x0000000000001140
It seems like we’re doing some relative addressing there and trying to jump to the location 0x0000000000001140.
At this address there is definitelly nothing to look for. So what’s this address? Where does it come from? It seems to be
a relative relocation to somewhere. But where
> readelf -Wr bin | grep 1140
0000000000006fb8 0000000000000008 R_X86_64_RELATIVE 1140
> nm --demangle=rust bin | grep 1140
0000000000001140 T main
That’s our main function. The ld.so try to start its main function but to be able to call that it needs to be relocated first. And who will do this relocation if there is no ld.so running? Well, there is one. We’re building it right now…
To be continued…