Obfuscation classification via Machine Learning

Back to the list of Speakers and Sessions
Watch the stream
In this work we build a machine learning classifier that distinguishes between cleartext and obfuscated code. Starting with JavaScript, we extend our techniques to Python and PHP.

Client-side protection is one of the key pillars on Imperva’s quest to protect its customers from attackers. Obfuscation is one of the ubiquitous methods to hide malicious code. Being able to distinguish between cleartext JavaScript documents and obfuscated ones is a first but crucial step in this endeavor.

In this work we first survey the variety of methods and techniques used to obfuscate JavaScript code. We analyze 10+ open-source JavaScript obfuscators and show their similarities and differences. For example, all obfuscators employ variable renaming, but the output distributions differ across obfuscators (e.g., in terms of the lengths of the renamed variables).

This allows us to extract several families of features. Some of them require careful feature engineering, while others are more general and follow well-known NLP techniques. Next, we survey prior art from the literature and discuss several natural approaches to this problem.

Finally, we suggest obfuscator-agnostic methods to build state-of-the-art machine learning classifier for this problem.

Although we used JavaScript as a starting point of our research, our techniques generalize nicely to additional programming languages. In other languages, as opposed to JavaScript, obfuscation is a much stronger evidence for maliciousness. Therefore our techniques there are of special interest.