The (Long) Journey To A Multi-Architecture Disassembler • NorthSec

Back to the list of Speakers and Sessions
Level: {"label"=>"Level", "Beginner"=>"Beginner", "Medium"=>"Medium", "Nose bleed"=>"Nose bleed"}

We will describe the internals of the disassembler engine we built fully in-house to analyze x86/x64, ARM/ARM64 and MIPS executables (among others).

Disassembly is a well-known problem in the reverse-engineering community, but designing and building a disassembler engine able to deal with architectures like MIPS, ARM/ARM64 and x86/x64 at the same time, compiled by classic compilers or custom obfuscators, is a long and difficult road.

While translating individual instructions to their corresponding assembly representations is doable, producing a correct and complete representation of a whole executable is indeed another story. This adventure includes dealing with numerous compilers’ peculiarities, such as switch-case constructions, position-independent code and control-flow optimizations, while struggling with theoretically intractable questions, such as code and data distinction.

In this talk, we would like to dig into the internals of our own disassembler engine, which is part of JEB reverse-engineering platform. This component produces an assembly-like representation of a whole binary object, in particular for MIPS, ARM/ARM64 and x86/x64 executables, and has been developed fully in-house over the last three years.

During this presentation, we will describe in particular:

the design choices behind our disassembler engine. We will explain how we developed most of the logic in a generic way, while trying to keep architecture-specific parts contained, and how the disassembler employs different strategies depending of the architecture and the identified compiler.
the use of a so-called “advanced” analysis pass, based on a custom intermediate representation (IR), which allows us to compute possible runtime values in the same way on all architectures. We will explain in particular the design of our IR, and the way we translated native instructions to the IR.
the implementation of signatures on machine code, such that classic statically linked libraries are automatically identified. We will dig into the problems that the generation, storage and matching of such signatures brought.
the various techniques and tests we developed to assess the disassembler correctness.

Finally, dealing with several (quite different) architectures forced us to very often reassess our assumptions on what machine code is supposed to look like. Throughout this presentation, we will describe the mistakes and wrong assumptions we made, in the hope that it will be useful to fellow security researchers dealing with machine code.