Basic X86-64 Assembly Language • Jonathan Cook

Intel x86-64 is a 64-bit Instruction Set Architecture that is used by all 64-bit Intel CPUs and all clones such as AMD CPUs. Intel first created the beginnings of its “x86” assembly language way back in about the early 1980’s with its 8088 8-bit CPU. This was followed quickly by a 16-bit version called the 8086, from which we get the “86” in “x86”. The late 1980’s saw the first 32-bit CPU, the 80386, and many 32-bit redesigns were done; eventually a need to move to 64 bits arose, and so they created the “x86-64” extension (actually, AMD created it before Intel!). This page also has direct content to help NMSU CS 370 students in using x86-64 as a compiler target language.

Check the Resources part of this page for other resources.

Page Sections:

The x86-64 ISA (Instruction Set Architecture)
From 8 to 64 bits!
C Programs and 64 bit CPUS
Addressing Modes
Register Names and Calling Convention
Basic Assembly Program Format
Stack Operations
Defining a Function
Function Calls
Expressions
Simple Variable Assignments and Uses
Conditionals and Loops
Complex Conditions (Logical AND/OR)
Local Variables and Arguments
Arrays
Resources

Generating Code Examples For Yourself

You can use the -S (capital S) option on gcc to generate example code from some C code, and look at what the code does. The code in this page was created based on doing just this, and also using the -fpie option, which creates “position independent” code. This code should be more portable and work on more systems than without it.

The x86-64 ISA (Instruction Set Architecture)

To program and use a CPU, it must have a well-defined interface. This is known as its Instruction Set Architecture, because it is centered around the machine instructions that the CPU actually executes in its circuits. But there is lots of other detail around the instructions that needs to be specified to actually program and use the CPU.

What is the native operand size? x86-64 is a 64-bit ISA, but lots of 32-bit operations are still used.

Does a CPU expect multi-byte values to be ordered most significant byte first (big endian), or least significant byte first (little endian)? The x86-64 is little endian.

CPUs use registers to efficiently store values and have instructions operate on them. The x86-64 ISA has a weird combination of named and numbered registers, seen in the table further below.

8 bit, 16 bit, 32 bit, and 64 bits!

As Intel grew its CPUs, it insisted that all of the CPUs should be backwards compatible, so they kept all the old lesser-bit instructions and just added the more-bit instructions. Since instructions operate on registers, this was also true about registers. This was done by using unique suffixes on instructions and prefixes on register names!

Original 16 bit register: %ax; then the 32-bit register: %eax; then 64-bit: %rax

So “e” is the 32-bit register prefix, and “r” is the 64-bit register prefix. These are not separate registers! %ax is the lowest 16 bits of %eax, which is in turn the lowest 32 bits of %rax.

Original 16 bit move instruction: mov; then the 32-bit move: movl; 64-bit: movq.

So “l” (ell) is the 32-bit instruction suffix, and “q” is the 64-bit instruction suffix. “l” stands for “long”, and “q” stands for “quad” (i.e., 4 16-bit words).

C Programs and 64 bit CPUS]

Interestingly, the official definition of the C programming language does not specify how many bits each datatype should have! It only specifies the rule:

char <= short <= int <= long <= long long

In other words, a valid C compiler could make them all the same!

When compiling for a 64-bit CPU, the vast majority of C compilers use the following sizes (in bits):

char=8, short=16, int=32, long=64, long long = 64

This is why when we generate the assembly code from a C program using “gcc -S”, for all of the integer operations we see the use of the “%e..” registers and the “…l” instructions (that end with ell). These are the 32-bit versions of everything.

However, pointers are generally 64 bits, so any time we are passing by reference (arrays, string constants, etc.), or generating code that uses pointers, we will see the use of the “%r..” registers and the “…q” instructions.

In CS 370, we will follow the convention above: all integer variables and values will be 32 bits, and pointers will be 64 bits.

Addressing Modes

Every ISA needs to support three basic memory addressing modes to load values from memory:

Immediate: this is where a constant is embdedded in the instruction itself; x86-54 uses a $ symbol to indicate a constant. For example, “movl $42, %eax” puts a 32-bit value of decimal 42 into the %eax register.
Direct: this is where an address that is embedded in the instruction is used to fetch a value from memory, or store a value to memory; this is used to access a global variable by name, such as “movl %eax, myVar”, which stores the value in %eax to the myVar variable.
Indirect: this is where a register contains the address to use to access memory to fetch/store a value; this is used for array indexing, local variable access (on the stack), and argument access (after arguments are copied onto the stack). For example, the %rbp (base/frame pointer) register contains the address of the stack frame, and so the instruction “movl %eax, -32(%rbp)” uses indirect addressing to store the value of %eax into memory at address (%rbp-32). x86-64 and many other ISAs allow a small constant value in front of the register name to act as an offset; 0 is the most common value, but for local variables and arguments on the stack, where we know their offset from the frame pointer (%rbp register), we will often use something like “-40(%rbp)” to indicate a particular variable.

Register Names and Calling Convention

In addition to the legacy alpha-named registers, x86-64 added some numbered registers, and so now there are 16 64-bit registers: %rax, %rbx, %rcx, %rdx, %rdi, %rsi, %rbp, %rsp, and %r8-15. While they all work generally, many have very specific purposes. For example, %rsp is the stack pointer and should never be used for anything else; %rbp is generally used as a call frame pointer and not for anything else; etc. The alpha-named registers all have a %eXX 32-bit equivalent, but the numbered registers do not.

Category	Registers	Description
Arguments	%rdi, %rsi, %rdx, %rcx, %r8, %r9	First 6 args passed in these registers (in order), the rest are pushed on the stack
Caller-saved (scratch)	%rax, %rdi, %rsi, %rdx, %rcx, %r8–%r11	Free to use in functions, caller must save if needed
Callee-saved (preserved)	%rbx, %rsp, %rbp, %r12–%r15	Function must save before using
Return Value	%rax, %rdx	Function return value (up to 128 bits)
Stack Pointer	%rsp	Contains address of top of stack (top item on stack)
Base Pointer	%rbp	Used by function as stack frame access pointer
Instruction Pointer	%rip	Contains address of next instruction to execute

Basic Assembly Program Format

We can take the simple C program:

int main(int argc, char** argv)
{
  printf("hello world\n");
  return 0;
}

and use “gcc -S” on it to generate the x86-64 assembly code:

	.file	"hello.c"
	.text
	.section	.rodata
.LC0:
	.string	"hello world"
	.text
	.globl	main
	.type	main, @function
main:
.LFB0:
	.cfi_startproc
	endbr64
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset 6, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register 6
	subq	$16, %rsp
	movl	%edi, -4(%rbp)
	movq	%rsi, -16(%rbp)
	leaq	.LC0(%rip), %rdi
	call	puts@PLT
	movl	$0, %eax
	leave
	.cfi_def_cfa 7, 8
	ret
	.cfi_endproc
.LFE0:
	.size	main, .-main
	.ident	"GCC: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0"
	.section	.note.GNU-stack,"",@progbits

but this has a lot of compiler cruft that is not essential. We can simplify it to be (with descriptive comments added):

	.section	.rodata            # Read-only data section (string constants)
.LC0:                          # Label for our string constant on next line
	.string	"hello world"      # Directive to put string (w/ '\0') into memory

	.text                       # Machine instructions (code) is "text" section
	.globl	main               # Declare label (symbol) "main" to be global
main:                          # All functions begin with their name's label
	pushq	%rbp                  # Save current base pointer on stack
	movq	%rsp, %rbp            # Copy stack pointer to base pointer
	subq	$16, %rsp             # Create 16 bytes of space on stack
	movl	%edi, -4(%rbp)        # Save %edi (argc param) on stack (4 bytes)
	movq	%rsi, -16(%rbp)       # Save %rsi (argv param) on stack (8 bytes)
	leaq	.LC0(%rip), %rdi      # Put string address into first argument reg
	call	puts@PLT              # Call "puts" function (aka printf w/ 1 arg)
	movl	$0, %eax              # Put main's return value into %eax
	leave                       # Restore stack to be ready to leave function
	ret                         # Return to caller (since main, it ends program)

The Stack and Stack Operations

“The Stack” refers to the hardware-supported function call stack, where each function call creates an activation record (also known as a stack frame or call frame) on the stack, which contains information necessary for the function call to operate. This information generally includes: argument values (if not passed in registers), the return address, callee-saved register values, and local variables. Not every function call will have all of these, but every function call will at least have a return address, which is the place in the calling function to return to when the function call ends.

The CPU supports the stack with the %rsp register, known as the stack pointer. This register contains the memory address of the top item on the stack, and the stack grows “downward” towards lower memory addresses. So subtracting 64 from the stack pointer is equivalent to opening up 64 bytes of usable memory space on the stack! Indeed, this is exactly how space for local variables is created!

In x86-64, usually the first two instructions in a function save the current %rbp on the stack (“pushq %rbp”) and then copy the stack pointer to %rbp (“movq %rsp, %rbp”). This makes %rbp a frame pointer for the current function call. All references to local variables and to arguments in memory are made using indirect addressing with %rbp, not %rsp. Since the %rbp is initialized before space is created on the stack (by subtraction), then local variable space is all with a negative offset from %rbp. So the memory address of a local variable looks something like “-16(%rbp)”, which is assembly syntax for subtracting 16 from the address value in %rbp, and using the result in indirect addressing.

Defining a Function

A function must be defined in the .text section, but you do not need to repeat this directive each time a function is defined. A function has a prologue, which is some constant instructions that set up the function, a body, which is the instructions that actually compute the statements that were written, and a epilogue, which is some constant instructions that conclude the function and return to the caller. A compiler can just printf() the constant parts as they are, although the function name must be printed in the correct parts of the prologue. A template is below.

	.globl	funcname           # prologue: declare funcname as global
funcname:                      # prologue: label at start of function code
	pushq	%rbp                  # prologue: save %rbp on stack
	movq	%rsp, %rbp            # prologue: copy stack pointer to frame pointer
	subq	$128, %rsp            # prologue: open up 128 bytes of space on stack
	movq	%rbx, -8(%rbp)        # prologue: save current %rbx
	movq	%rdi, -16(%rbp)       # prologue: save 1st arg reg onto stack
	movq	%rsi, -24(%rbp)       # prologue: save 2nd arg reg onto stack
	movq	%rdx, -32(%rbp)       # prologue: save 3rd arg reg onto stack
	movq	%rcx, -40(%rbp)       # prologue: save 4th arg reg onto stack
	# Body instructions are here (possibly very many!)
	movq	-8(%rbp), %rbx        # epilogue: restore %rbx from stack
	leave                       # epilogue: restore stack and base pointers
	ret                         # epilogue: return to caller

The code above assume that there are at most four arguments to save on the stack, and that the registers %r12–%r15 are not used (and thus do not need to be saved and restored). In CS 370, once we copy the argument registers onto the stack, we will access the argument values from the stack, and ignore the argument registers. So if we need to use, say, an integer argument 3 in an expression, we can do “movl -32(%rbp), %eax” to put the argument value in our expression value register.

Function Calls

In old 32-bit x86, all function arguments were passed on the stack; this is drastically changed in x86-64. In x86-64, the first six function arguments are passed in the registers %rdi, %rsi, %rdx, %rcx, %r8, and %r9, in order. If a function has more than six arguments, the rest are passed on the stack.

The function return value is passed in %rax (and %rdx if more bits needed).

In assembly, a function call is: a series of mov instructions to place the argument values in the correct argument registers; then a call instruction to call the function. In the example above our program had this:

   printf("hello world\n");

and in assembly it was this:

	leaq	.LC0(%rip), %rdi
	call	puts@PLT

The compiler gcc is smart enough to know that when printf() is used only with a string constant, it is more efficient to just use the library function puts() instead. There is only one argument, a char* value (character pointer), which is a 64-bit address formed from the string label acting as an offset to %rip.

“leaq” stands for “load effective address quad” (quad==64bit); this loads the address formed by the indirect addressing into %rdi, it does not use the address to access memory but rather just computes the actual (effective) address. This is used in position-independent code. If we compiled without the “-fpie” option, this instruction would be replaced with “movl $.LC0, %edi”, which uses the label as a constant 32-bit address. In our CS370 compiler projects, we use the “leaq” form.

The “@PLT” extension to the function name is used in position-independent code; PLT stands for Procedure Linkage Table. We can also add this to our own function names and everything will work.

As the register table above shows, there is really no pattern to the register names that are used as argument registers. But in a compiler you must sequentially use them correctly and place argument values in them. Here’s a way to do this: 1) create a global integer variable argNum or something like that; this will keep the argument position number; 2) create a global array of strings, perhaps argRegStr with the register names as strings, in their proper position; 3) each time you generate argument code, refer to “argRegStr[argNum]” to get the correct register name, and then increment argNum; 4) at the end of the function call generation code, reset argNum back to 0; see code below.

int argNum=0;
char *argRegStr[] = {"%rdi", "%rsi", "%rdx", "%rcx", "%r8", "%r9"};

Expressions

Expressions can result in very complicated code from real, optimizing compilers. In our compiler project, we are going to keep it very simple. Our main overall rule will be: every integer-valued expression will leave its result in the %eax register. This includes subexpressions of larger expressions. There are essentially three types of expressions: a numeric constant, a variable use, and a binary operator.

Numeric Constant: simply load the constant into %eax, e.g., “movl $42, %eax”
Variable Use: load the variable’s value into %eax; for a global variable, this looks like “movl myVar, %eax”; for a local variable or parameter, it looks like “movl -24(%rbp), %eax”, where “-24” is just an example offset.
Binary Operator: for an operator, we have to 1) first evaluate the left-hand subexpression; then evaluate the right-hand subexpression; then apply the operator to the two resulting values. To do this, we _push the resulting left-hand value onto the stack, then evaluate the righ-hand side, then pop the left-hand value into another register (%ecx), and then apply the operator. For example, the expression “42+x” (where x is a global variable) would result in the code:

      movl    $42, %eax      # code for left-hand subexpression
      pushq   %rax           # save its value on the stack
      movl    x(%rip), %eax  # code for right-hand subexpression
      popq    %rcx           # pop left-hand value into %ecx
      addl    %ecx, %eax     # perform the add operation, result in %eax

The use of the stack allows recursive sub-expressions to be handled, so no matter how complicated the entire expression is, it will all work! It is terribly inefficient, and real compilers do much better, but it works! Also note that push and pop are only defined as instructions for the 64-bit “%r” registers, but the “%e” registers are just the lower 32 bits, so it all works just fine.

Simple Variable Assignments and Uses

Simple integer global variables will be declared with a .comm directive, and should be in the .data section, like this:

      .data
      .comm   myVar, 4, 4
      .comm   myArr, 52, 32

The directive declares the symbol as “common”, or globally available, and associates it with the address of the space being reserved. The first number is the number of bytes of memory to reserv, and the second number is the alignment in memory of those bytes. Alignments must be powers of 2, and high values allow more efficient memory access if we are going to access multiple things. So the single 32-bit (4 byte) variable above has an alignment of just 4, while the array block (52 bytes, or 13 4-byte integer values) has an alignment of 32.

Using a global simple variable is as simple as using mov instructions to either read the variable (move the value from memory to a register) or write a new value (move a value in a register to the variable’s memory). These look like:

      movl   myVar(%rip), %eax     # read current value of variable
      movl   %eax, myVar(%rip)     # store a new value into variable

Indirect addressing using the %rip register is used for position-independent code.

Conditionals and Loops

The typical way to generate an if-then-else construct is as such:

        /* ASM code for relational expressions */
        cmpl    /* two registers */
        jle     LL101       /* pretending we have > comparison */
        /* code for IF block */
        jmp     LL102       /* must jump around else block */
LL101:
        /* code for ELSE block */
LL102:
        /* program continues here */

However, this requires you to invert the conditional jump instruction (i.e., if the relational operator used in the if condition was “>”, then you would have to generate a “<=” conditional jump, because you have to jump to the else case). There’s no law, of course, requiring this order; we can put the else block before the if block, and then do our conditional jump to the if block. This way the conditional jump is the same as the relational operator used in the if condition. So our code can look like this:

        /* ASM code for relational expressions */
        cmpl    /* two registers */
        jg      LL101        /* pretending we have > comparison */
        /* code for ELSE block */
        jmp     LL102        /* must jump around else block */
LL101:
        /* code for IF block */
LL102:
        /* program continues here */

You are free to do either way. Real C compilers do the first (they put the if block on top of the else block). I would suggest using the second.

NOTE: The labels need auto-generated label IDs. For this, and for your while loop construct, you will need a function getUniqueLabelID() that returns a unique number each time it is called (just create a static local variable as a counter, and increment it each time). Before start fprintf-ing the code lines, call this twice for the two unique label ID numbers.

For loops, most compilers put the condition check at the bottom of the loop body, not on top! To do this for a while loop, we then need an initial jump-to-condition-check before the loop begins (and outside of the loop body). This structure actually makes the loop to have one less instruction. So the while loop code looks like this:

        jmp   LL102      /* jump to condition check */
LL101:  /* label at top of loop body */
        /* loop body code */
LL102:
        /* ASM code for relational expressions */
        cmpl    /*registers */
        jg         LL101       /* pretending we have > comparison */
        /* program continues here */

Note that the loop ends when the condition jump does not jump, and just falls through to the rest of the program.

Complex Conditions (Logical AND/OR)

Recall that the logical operators AND and OR, often && and || in programming languages, are, in most languages, short-circuited operators. This means that if the left-hand side of the operator already determines the operator’s outcome, the right-hand side is not evaluated (and indeed is not allowed to be evaluated).

For an AND operator, if the left-hand side is false, the whole expression must be false; for an OR operator, if the left-hand side is true, the whole expression must be true; in these cases, the right-hand side is skipped.

This is implemented in assembly language with control flow using conditional branches, and so there is no instruction that encodes the logical operator; it is all done using conditional branches and control flow. How is this done? The left-hand side subexpression should be evaluated, and then a conditional branch should branch to the short-circuit case, and fall through (not branch) to the right-hand side evaluation.

Below is an example:

	# assumes we have variables x, y, z
	# if (x < 42 && y == 7) then {
	# z = 10; } else { z = 20; }
	movl	x, %eax
	movl	$42, %ecx
	cmpl	%eax, %ecx
	jge	else		# short-circuit branch
	movl	y, %eax
	movl	$7, %ecx
	cmpl	%eax, %ecx
	jne	else
ifpart:
	movl	$10, %eax
	movl	%eax, z
	jmp	endif		# must skip the else part
else:
	movl	20, %eax
	movl	%eax, z
endif:
	# done with if-else

Local Variables and Arguments

In x86-64 arguments (actual parameter values when a function is called) are placed in the argument registers as described above (if more than 6 are needed, the stack is used). But as soon as our function has to make a function call inside itself, we need these same argument registers!

So, for non-leaf functions, we must save our own arguments somewhere else so that we can use the argument registers. Where to save them? On the stack, of course! So we must make room for our arguments and save them on the stack. We use the movl instruction to store the register value into memory on the stack.

Local variables must be on the stack, too. This is for two reasons. One, local variables should not take up space when a function is not being used. Two, if a function is recursive, each invocation must have its own copies of the local variables. The stack is the natural place to do this.

As explained near the top of this page in the Addressing Modes section, accessing values on the stack is done using indirect addressing. Because expressions might use the stack inside the function body, we need another register to hold an address that is in a fixed place on the stack, so that we have a consistent and fixed reference point for our local variables and arguments. We use the %rbp (frame/base pointer) register for this; once we set up our stack space, we just copy (move) the stack pointer into the frame pointer, and then leave it this way until we leave the function. But the %rbp register needs saved first because it is a protected register. In the example below we also save %rbx just in case we need an extra register to use (e.g., assigning to an array element involves at least three values, the assigned value, the index value, and the array starting address).

Below is an example:

# Function myFunc (int arg1, int arg2)
# - and with two local vars: int x, int y
	.globl	myFunc             # prologue: declare funcname as global
myFunc:                        # prologue: label at start of function code
	pushq	%rbp                  # prologue: save %rbp on stack
	movq	%rsp, %rbp            # prologue: copy stack pointer to frame pointer
	subq	$128, %rsp            # prologue: open up 128 bytes of space on stack
	movq	%rbx, -8(%rbp)        # prologue: save current %rbx
	movq	%rdi, -16(%rbp)       # prologue: save 1st arg reg onto stack
	movq	%rsi, -24(%rbp)       # prologue: save 2nd arg reg onto stack
	movq	%rdx, -32(%rbp)       # prologue: save 3rd arg reg onto stack
	movq	%rcx, -40(%rbp)       # prologue: save 4th arg reg onto stack
	# function body instructions are here (possibly very many!)
	# below are example reads and writes 
	movl  -16(%rbp), %eax       # read first int arg value into %eax
	movl  -44(%rbp), %eax       # read first local variable spot into %eax
	movl  %eax, -44(%rbp)       # store new value into first local variable
	# function body end
	movq	-8(%rbp), %rbx        # epilogue: restore %rbx from stack
	leave                       # epilogue: restore stack and base pointers
	ret                         # epilogue: return to caller

The example above creates extra space on the stack to hold 128 bytes. This is what I would recommend doing in your compiler, instead of calculating the exact space…. and then using a constant starting position for local variables. In our assignments, you can just subtract 128 from the stack pointer – this gives you enough room for 32 4-byte values. The first six will be saved for (up to) six argument values, and the rest will be for local variables. So your first argument will be at “-16(%rbp)”, your sixth argument will be at “-24(%rbp)”, and your first local variable will always be at “-44(%rbp)”. Your second local variable will be at “-48(%rbp)”, and so on. We will never test with so many local variables that you need more than 128 bytes of stack space.

Arrays

The basic operation for accessing arrays is that we must use the array index to calculate the address offset from the beginning of the array, and then add this to the beginning address of the array; this will give us the address of the array element we want to access, and then we use indirect addressing to access it (load or store).

Suppose that we have a global integer array named vals, and an local integer variable i that is stored on the stack at -12(%rbp). Recall that integers are 4 bytes. Now suppose we do vals[i] = 42; The gcc compiler with “-fpie” will produce:

	movl	-12(%rbp), %eax       # load value of i into %eax
	cltq                        # sign extend to 64 bits
	leaq	0(,%rax,4), %rdx      # multiply by 4 and leave in %rdx
	leaq	vals(%rip), %rax      # load starting address of vals into %rax
	movl	$42, (%rdx,%rax)      # store 42 into address (%rdx + %rax)

The code above works but does not take full advantage of the x86-64 indirect addressing mode, which allows three values inside the parentheses: two registers and a constant. Indirect address will multiply the second register by this constant. We can use this for the size of each array element (4 bytes for an integer array). So:

	movl	-12(%rbp), %eax       # load value of i into %eax
	cltq                        # sign extend to 64 bits
	leaq	vals(%rip), %rdx      # load starting address of vals into %rdx
	movl	$42, (%rdx,%rax,4)    # store 42 into address (%rdx + (%rax*4))

is equivalent code. The compiler might use the first form because maybe it knows that it is actually faster (CPUs can be weird like that!). In CS 370 we can use this second forms since it is easy to generate.

Also note that when assigning to an array element in an assignment statement, you also have the value of the right-hand-side to manage. After its expression is evaluated, the value is in %eax; you will need to move and manage it so that you don’t overwrite it while preparing the array element addressing.