Last updated September 1, 2017
Originally published December 16, 2005
Table of Contents
This document describes the syntax for the SLEIGH processor specification language, which was developed for the GHIDRA project. The language that is now called SLEIGH has undergone several redesign iterations, but it can still trace its heritage from the language SLED, from whom its name is derived. SLED, the “Specification Language for Encoding and Decoding”, was defined by Norman Ramsey and Mary F. Fernandez as a concise way to define the translation, in both directions, between machine instructions and their corresponding assembly statements. This facilitated the development of architecture independent disassemblers and assemblers, such as the New Jersey Machine-code Toolkit.
The direct predecessor of SLEIGH was an implementation of SLED for GHIDRA, which concentrated on its reverse-engineering capabilities. The main addition of SLEIGH is the ability to provide semantic descriptions of instructions for data-flow and decompilation analysis. This piece of SLEIGH was originally a separate language, the Semantic Syntax Language (SSL), very loosely based on concepts and a language of the same name developed by Cristina Cifuentes, Mike Van Emmerik and Norman Ramsey, for the University of Queensland Binary Translator (UQBT) project.
SLEIGH is a language for describing the instruction sets of general purpose microprocessors, in order to facilitate the reverse engineering of software written for them. SLEIGH was designed for the GHIDRA reverse engineering platform and is used to describe microprocessors with enough detail to facilitate two major components of GHIDRA, the disassembly and decompilation engines. For disassembly, SLEIGH allows a concise description of the translation from the bit encoding of machine instructions to human-readable assembly language statements. Moreover, it does this with enough detail to allow the disassembly engine to break apart the statement into the mnemonic, operands, sub-operands, and associated syntax. For decompilation, SLEIGH describes the translation from machine instructions into p-code. P-code is a Register Transfer Language (RTL), distinct from SLEIGH, designed to specify the semantics of machine instructions. By semantics, we mean the detailed description of how an instruction actually manipulates data, in registers and in RAM. This provides the foundation for the data-flow analysis performed by the decompiler.
A SLEIGH specification typically describes a single microprocessor and is contained in a single file. The term processor will always refer to this target of the specification.
Italics are used when defining terms and for named entities. Bold is used for SLEIGH keywords.
Although p-code is a distinct language from SLEIGH, because a major purpose of SLEIGH is to specify the translation from machine code to p-code, this document serves as a primer for p-code. The key concepts and terminology are presented in this section, and more detail is given in Section 7.7, “The Semantic Section”. There is also a complete set of tables which list syntax and descriptions for p-code operations in the Appendix.
The design criteria for p-code was to have a language that looks much like modern assembly instruction sets but capable of modeling any general purpose processor. Code for different processors can be translated in a straightforward manner into p-code, and then a single suite of analysis software can be used to do data-flow analysis and decompilation. In this way, the analysis software becomes retargetable, and it isn’t necessary to redesign it for each new processor being analyzed. It is only necessary to specify the translation of the processor’s instruction set into p-code.
So the key properties of p-code are
SLEIGH is the language which specifies the translation from a machine instruction to p-code. It specifies both this translation and how to display the instruction as an assembly statement.
A model for a particular processor is built out of three concepts: the address space, the varnode, and the operation. These are generalizations of the computing concepts of RAM, registers, and machine instructions respectively.
An address space for p-code is a generalization of the indexed memory (RAM) that a typical processor has access to, and it is defined simply as an indexed sequence of memory words that can be read and written by p-code. In almost all cases, a word of the space is a byte (8 bits), and we will usually use the term byte instead of word. However, see the discussion of the wordsize attribute of address spaces below.
The defining characteristics of a space are its name and its size. The size of a space indicates the number of distinct indices into the space and is usually given as the number of bytes required to encode an arbitrary index into the space. A space of size 4 requires a 32 bit integer to specify all indices and contains 232 bytes. The index of a byte is usually referred to as the offset, and the offset together with the name of the space is called the address of the byte.
Any manipulation of data that p-code operations perform happens in some address space. This includes the modeling of data stored in RAM but also includes the modeling of processor registers. Registers must be modeled as contiguous sequences of bytes at a specific offset (see the definition of varnodes below), typically in their own distinct address space. In order to facilitate the modeling of many different processors, a SLEIGH specification provides complete control over what address spaces are defined and where registers are located within them.
Typically, a processor can be modeled with only two spaces, a ram address space that represents the main memory accessible to the processor via its data-bus, and a register address space that is used to implement the processor’s registers. However, the specification designer can define as many address spaces as needed.
There is one address space that is automatically defined for a SLEIGH specification. This space is used to allocate temporary storage when the SLEIGH compiler breaks down the expressions describing processor semantics into individual p-code operations. It is called the unique space. There is also a special address space, called the const space, used as a placeholder for constant operands of p-code instructions. For the most part, a SLEIGH specification doesn’t need to be aware of this space, but it can be used in certain situations to force values to be interpreted as constants.
A varnode is the unit of data manipulated by p-code. It is simply a contiguous sequence of bytes in some address space. The two defining characteristics of a varnode are
With the possible exception of constants treated as varnodes, there is never any distinction made between one varnode and another. They can have any size, they can overlap, and any number of them can be defined.
Varnodes by themselves are typeless. An individual p-code operation forces an interpretation on each varnode that it uses, as either an integer, a floating-point number, or a boolean value. In the case of an integer, the varnode is interpreted as having a big endian or little endian encoding, depending on the specification (see Section 4.1, “Endianess Definition”). Certain instructions also distinguish between signed and unsigned interpretations. For a signed integer, the varnode is considered to have a standard twos complement encoding. For a boolean interpretation, the varnode must be a single byte in size. In this special case, the zero encoding of the byte is considered a false value and an encoding of 1 is a true value.
These interpretations only apply to the varnode for a particular operation. A different operation can interpret the same varnode in a different way. Any consistent meaning assigned to a particular varnode must be provided and enforced by the specification designer.
P-code is intended to emulate a target processor by substituting a sequence of p-code operations for each machine instruction. Thus every p-code operation is naturally associated with the address of a specific machine instruction, but there is usually more than one p-code operation associated with a single machine instruction. Except in the case of branching, p-code operations have fall-through control flow, both within and across machine instructions. For a single machine instruction, the associated p-code operations execute from first to last. And if there is no branching, execution picks up with the first operation corresponding to the next machine instruction.
Every p-code operation can take one or more varnodes as input and can optionally have one varnode as output. The operation can only make a change to this output varnode, which is always indicated explicitly. Because of this rule, all manipulation of data is explicit. The operations have no indirect effects. In general, there is absolutely no restriction on what varnodes can be used as inputs and outputs to p-code operations. The only exceptions to this are that constants cannot be used as output varnodes and certain operations impose restrictions on the size of their varnode operands.
The actual operations should be familiar to anyone who has studied general purpose processor instruction sets. They break up into groups.
Table 1. P-code Operations
Operation Category | List of Operations |
Data Moving | COPY, LOAD, STORE |
Arithmetic | INT_ADD, INT_SUB, INT_CARRY, INT_SCARRY, INT_SBORROW,
INT_2COMP, INT_MULT, INT_DIV, INT_SDIV, INT_REM, INT_SREM |
Logical | INT_NEGATE, INT_XOR, INT_AND, INT_OR, INT_LEFT, INT_RIGHT, INT_SRIGHT |
Integer Comparison | INT_EQUAL, INT_NOTEQUAL, INT_SLESS, INT_SLESSEQUAL, INT_LESS, INT_LESSEQUAL |
Boolean | BOOL_NEGATE, BOOL_XOR, BOOL_AND, BOOL_OR |
Floating Point | FLOAT_ADD, FLOAT_SUB, FLOAT_MULT, FLOAT_DIV, FLOAT_NEG,
FLOAT_ABS, FLOAT_SQRT, FLOAT_NAN |
Floating Point Compare | FLOAT_EQUAL, FLOAT_NOTEQUAL, FLOAT_LESS, FLOAT_LESSEQUAL |
Floating Point Conversion | INT2FLOAT, FLOAT2FLOAT, TRUNC, CEIL, FLOOR, ROUND |
Branching | BRANCH, CBRANCH, BRANCHIND, CALL, CALLIND, RETURN |
Extension/Truncation | INT_ZEXT, INT_SEXT, PIECE, SUBPIECE |
Managed Code | CPOOLREF, NEW |
We postpone a full discussion of the individual operations until Section 7.7, “The Semantic Section”.