Goals
As Symantec's report shows, we are drowning in malware. As numerous randomized obfuscation tools are freely available, it is becoming harder and harder for signature-based methods to keep up with the flood. So, in this project, I'm studying behavioral detection methods, especially those based on automata inference. To get a better idea about the project, see the Malware Analysis with Tree Automata Inference paper.
Key Ideas
The key idea of the project is that the system call data-flow dependency graphs obtained through taint analysis (see Newsome and Song and Clause et al.'s papers) can be expanded into trees, which in turn can be recognized by tree automata. Tree automata are a generalization of regular word finite-state automata, accepting trees rather than strings. In this project, I'm working on automatic inference of tree automata and their application to malware recognition and classification.
Dependency Graphs
Here you can download system call dependency graphs generated for 2631 malware samples and 35 commonly used benign applications. The graphs represent data-flow dependencies among executed system calls. The traces were produced by a tool developed by Daniel Reynaud and the libwst library developed by Lorenzo Martignoni and Roberto Paleari.
Benchmarks | Release | Date | Archive | Size |
Benign apps, 120sec timeout | 1.0 | Jan 1, 2010 | [tar.bz2] | 255KB |
Benign apps, 800sec timeout | 1.0 | Jan 1, 2010 | [tar.bz2] | 758KB |
Malware, 120sec timeout | 1.0 | Jan 1, 2010 | [tar.bz2] | 20MB |
Here's a brief description of the file format:
- Lines beginning with '#' are comments
- The N line specifies the total number of different nodes in the graph
- The V line specifies a unique identifier of the node, its name, and the numbers of input and output parameters
- The E line specifies edges in the form SourceNodeId:OutputParameterNumber,DestinationNodeId:InputParameterNumber
Here you can find an example of a very small dependency graph obtained from executing a sample from the Hupigon malware family.
Implementation of the Tree-Automaton Inference
The source code (in C++) of the inference engine for analyzing dependency graphs and for tree-automata inference is available here: [tar.bz2]