Accurate Disassembly of Complex Binaries Without Use of Compiler Metadata
Abstract
Accurate disassembly of ==stripped binaries== is the first step in binary analysis, instrumentation and reverse engineering. Complex instruction sets such as the x86 pose major challenges in this context because it is very difficult to distinguish between code and embedded data. To make progress, many recent approaches have either made ==optimistic assumptions== (e.g., absence of embedded data) or relied on ==additional compiler-generated metadata== (e.g., relocation info and/or exception handling metadata). Unfortunately, many complex binaries do contain embedded data, while ==lacking the additional metadata== needed by these techniques. We therefore present a novel approach for accurate disassembly that uses statistical properties of data to detect code, and behavioral properties of code to flag data. We present new static analysis and data-driven probabilistic techniques that are then combined using a prioritized error correction algorithm to achieve results that are 3× to 4× more accurate than the best previous results.
Q: So u may ask what is metadata?
A: Metadata is data about data. Metadata refers to additional information embedded in a compiled binary by the compiler or linker that helps tools understand the structure of the binary. This can include details like:
- ==Relocation Information== – Helps adjust memory addresses when the binary is loaded into memory.
- Exception Handling Metadata – Contains details about how exceptions (errors) should be handled at runtime. (DWARF)
- Debug Symbols – Provides function names, variable names, and other useful debugging information.
- Symbol Tables – Contains mappings of function names to memory addresses.
Analysis
Usages of Binary analysis and instrumentation:
- binary debloating
- binary optimization
- code similarity detection
- reverse engineering
- security hardening
Two basic techniques: - linear sweep
- recursive disassembly
Conventional tenchiques relay on debugging information of metadata, while applications such as Chrome disable it as it is large. Also, lack of coverage.Author:
**use properties of data to flag code and properties of code to flag data**
That is, characterize the statistical properties rather than code. Flag a byte sequence as code whenever its statistical properties deviate drastically from that of data.
Contributions
- Static analysis to flag data
- Statistical analysis to flag code
- Using valid code properties to address non-returning calls and missing jump table bounds
- Conflict resolution using prioritized error correction
Challenges
- linear disassembly such as
objdump
suffers from a high false positive rate.