"Len or Index or Count, Anything but V1": Predicting Variable Names in Decompilation Output with Transfer Learning

45TH IEEE SYMPOSIUM ON SECURITY AND PRIVACY, SP 2024（2024）

引用 0|浏览5

摘要

Binary reverse engineering is an arduous and tedious task performed by skilled and expensive human analysts. Information about the source code is irrevocably lost in the compilation process. While modern decompilers attempt to generate C-style source code from a binary, they cannot recover lost variable names. Prior works have explored machine learning techniques for predicting variable names in decompiled code. However, the state-of-the-art systems, DIRE and DIRTY, generalize poorly to functions in the testing set that are not included in the training set—31.8% for DIRE on DIRTY’s data set and 36.9% for DIRTY on DIRTY’s data set.In this paper, we present VarBERT, a Bidirectional Encoder Representations from Transformers (BERT) to predict meaningful variable names in decompilation output. An advantage of VarBERT is that we can pre-train on human source code and then fine-tune the model to the task of predicting variable names. We also create a new data set VarCorpus, which significantly expands the size and variety of the data set. Our evaluation of VarBERT on VarCorpus, demonstrates a significant improvement in predicting the developer’s original variable names for O2 optimized binaries achieving accuracies of 54.43% for IDA and 54.49% for Ghidra. VarBERT is strictly better than state-of-the-art techniques: On a subset of VarCorpus, VarBERT could predict the developer’s original variable names 50.70% of the time, while DIRE and DIRTY predicted original variable names 35.94% and 38.00% of the time, respectively.

查看译文

关键词

Program and binary analysis,Machine learning and computer security,Decompilation

AI 理解论文

溯源树

样例

生成溯源树，研究论文发展脉络

您的评分 :

暂无评分

数据免责声明

页面数据均来自互联网公开来源、合作出版商和通过AI技术自动分析结果，我们不对页面数据的有效性、准确性、正确性、可靠性、完整性和及时性做出任何承诺和保证。若有疑问，可以通过电子邮件方式联系我们：report@aminer.cn