We can divide "program analysis" up into a few different buckets. We can call one "static analysis" and another "dynamic analysis". When we talk about "static analysis", we can talk about either "binary analysis" or "source analysis".
Usually when we differentiate between "binary" and "source" analysis, we say that a source-level analysis benefits from having the original program, while a binary analysis has the program image with native code that will execute on the original system. Essentially, a source analysis is an analysis of the programs high-level language source code, while a binary analysis is an analysis of a module of compiled, executable, native instructions.
The consideration I will give you to begin with is, can binary analysis system consider program images which were not created by a compiler?
What Do We Mean By Binary?
Usually when people say "binary" they mean "executable container format", such as ELF or PE/COFF. The container could have some or no information about the program contained within, such as imported functions, relocation information, internal symbols, etc.
An important point to consider is that generally a compiler is the producer of ELF or PE/COFF program images. However, the compiler is not the only entity that is capable of creating program images. Since the header formats are well documented (they must be), it is very possible for an individual to create a "binary" program which can be loaded and executed on a general purpose operating system. The tiny ELF and tiny PE projects are clear demonstrations of this.
Binary Analysis Can Be Semantically Unambiguous
One interesting property of a binary program analysis is that in high level languages, certain cases (integer wraparound, invalid arguments to memory modifying library functions, etc) have "undefined" behavior. When a static analysis is considering a high-level source code program, it has to "fill in the gaps" around the undefined behavior. Some projects aim to use static code analysis to identify programs whose computation relies on undefined behavior, so that their authors can re-write them.
When considering a compiled application however, the analysis system does not have to decide one way or the other. All of the undefined and implementation specific decisions present in the original high-level program have been made and are represented in the compiled program.
Binary Analysis Can Be More Challenging
One challenge that binary analysis has which source analysis does not is control flow graph discovery/recovery. When given the source code to a well-formed high level application, the control flow graph is a property of the source code. It is very difficult (I wager impossible) to express a self-modifying program in C without using machine-level tricks.
Binary programs are not so constrained. A binary program can be any program that is a sequence of instructions which is valid for the processor it should be evaluated on.
"Binary Analysis" Isn't
Imagine the world of programs. This world represents all programs that an Intel x86 microprocessor could express, given the correct sequence of instructions. Now, imposed on that world, are the set of programs that modern compilers understand how to create. Modern compilers are constrained by instruction sequences that are efficient and by the expressibility of high-level languages.
We can almost think of this as compilers casting a shadow into the world of binary programs. Viewed through this lens, we have divided our universe into programs which come from compilers, and ones which don't.
What use is this, and why would we do it? Well, compilers are machines. They obey certain rules. Frequently, program analysis researchers will take certain rules for granted when analyzing a program:
- Parameters passed on the stack
- 'this' parameter being passed in ecx
- Functions defined by call/ret boundaries
These might seem like wholly sensible truths to take from a binary but consider: are they rules of a "binary" program? Or are they rules of compiler output?
A fun game to play is to look at "binary" analysis systems from industry and "binary" analysis research from universities and spot their assumptions about compiler output vs binary programs. Over the last few months, I've been building a collection of 'rules' that high-level code must obey, but binary programs need not obey. A glimpse into this rulebook:
- High level languages have stacks. Binary programs need not have stacks.
- High level languages have difficult expressing self-modification. Binary programs can express this very easily.
- High level languages, and compilers, lead to natural call/ret semantics. Binary programs need not use call/ret.
- High level programs use indirect branches in specific circumstances. Binary programs may always use indirect branches.
- High level language runtimes generally have contiguous stacks. A binary program may have split stacks (although some weird high level language runtimes may also implement split stacks).
Consider this when reading about a technique that makes assumptions about a program and its compiler. Could this analysis work if parameters are passed via globals? Can we take this shortcut if we don't know that register states are clobbered at the end of a function? Can we assume that a call indicates the beginning of a new function?
Connection To Weird Machines And Memory Unsafety
Take a well-formed C program that uses no machine-level constructs (inline assembly, writing 0xFEEB into the address of the return address, etc). Ask, can this program be used to express a program that is not compiler output?
Memory corruption errors that allow for code injection can take programs from the space of compiler output and move them into the space of all possible programs. This is probably obvious. It does pose a problem for systems that want to look at this division of the space and limit themselves to only considering "compiler output". To assure that your program under analysis can only be compiler output, you also have to assure, or assume, that there are no memory errors in the program that can be used to express non-compiler generated programs. This might be difficult.
Warnings and Conclusions
In many cases this is fine. When we perform binary analysis for verification or for bug identification, we are generally concerned only with the space of programs that are compiler output. Those are programs that we want verified and bug-free. Weird programs that can be constructed are not of too much concern - if the system does not understand it, then we do not certify it.
However, "compiler output analysis" is I think a more honest way to phrase the program analysis tasks that many perform. An easy way to spot this from a vendor is if they ever say "our binary analysis system doesn't support that compiler". If your system or technique depends on a specific compiler, it is not a binary analysis, it is a compiler output analysis.
Analyzing arbitrary binary programs could be useful as malicious software becomes even more devious. Possibly, with these compiler-specific rules mapped out, we can analyze programs which have no origin in known compilers and which break the rules that known compilers obey.