Code obfuscation and malware detection by abstract interpretation

Dalla Preda, Mila

Non disponibile

An obfuscating transformation aims at confusing a program in order to make it more difficult to understand while preserving its functionality. Software protection and malware detection are two major applications of code obfuscation. Software developers use code obfuscation in order to defend their programs against attacks to the intellectual property, usually called malicious host attacks. In fact, by making the programs more difficult to understand it is possible to obstruct malicious reverse engineering – a typical attack to the intellectual property of programs. On the other side, malware writers usually obfuscate their malicious code in order to avoid detection. In this setting, the ability of code obfuscation to foil most of the existing detection techniques, such as misuse detection algorithms, relies in their purely syntactic nature that makes malware detection sensitive to slight modifications of programs syntax. In the software protection scenario, researchers try to develop sophisticated obfuscating techniques that are able to resist as many attacks as possible. In the malware detection scenario, on the other hand, it is important to design advanced detection algorithms in order to detect as many variants of obfuscated malware as possible. It is clear how both malicious host and malicious code attacks represent harmful threats to the security of computer networks. In this dissertation, we are interested in both security issues described above. In particular, we describe a formal approach to code obfuscation and malware detection based on program semantics and abstract interpretation. This theoretical framework is useful in contrasting some well known drawbacks of software protection through code obfuscation, as well as for improving existing malware detection schemes. In fact, the lack of rigorous theoretical bases for code obfuscation prevents any possibility to formally study and certify their effectiveness in protecting proprietary programs. Moreover, in order to design malware detection schemes that are resilient to obfuscation we have to focus on program semantics rather than on program syntax. Our formal framework for code obfuscation relies on a semantics-based definition of code obfuscation that characterizes each program transformation T as a potential obfuscation in terms of the most concrete property preserved by T on program semantics. Deobfuscating techniques, and reverse engineering in general, usually begin with some sort of static program analysis, which can be specified as an abstraction of program semantics. In the software protection scenario, this observation naturally leads to model attackers as abstractions of program semantics. In fact, the abstraction modeling the attacker expresses the amount of information, namely the semantic properties, that the attacker is able to observe. It follows that, comparing the degree of abstraction of an attacker A with the one of the most concrete property preserved by an obfuscating transformation T, it is possible to understand whether obfuscation T defeats attack A. Following the same reasoning it is possible to compare the efficiency of different obfuscating transformations, as well as the ability of different attackers in contrasting a given obfuscation. We apply our semantics-based framework to a known control code obfuscation technique that aims at confusing the control flow of the original program by inserting opaque predicates. As argued above, an obfuscating transformation modifies a program while preserving an abstraction of its semantics. This means that different obfuscated versions of the same malware have to share (at least) the malicious intent, namely the maliciousness of their semantics, even if they may express it through different syntactic forms. The basic idea of our formal approach to malware detection is to use program semantics to model both malware and program behaviour, and semantic abstractions to hide the details changed by the obfuscation. Thus, given an obfuscation T, we are interested in defining an abstraction of program semantics that does not distinguish between the semantics of malware M and the semantics of its obfuscated version T(M). In particular, we provide this suitable abstraction for an interesting class of commonly used obfuscating transformations. It is clear that, given a malware detector D, it is always possible to define its semantic counterpart by analyzing how D works on program semantics. At this point, by translating both malware detectors and obfuscating transformations in the semantic world, we are able to certify which obfuscations a detector is able to handle. This means that our semanticsbased framework provides a formal setting where malware detectors designers can prove the efficiency of their algorithms.