Danfeng (Daphne) Yao

Abstract

In this paper, we conduct a measurement study to comprehensively compare the accuracy of Cryptographic API completion tasks trained with multiple API embedding options. Embedding is the process of automatically learning to represent program elements as low-dimensional vectors. Our measurement aims to uncover the impacts of applying program analysis, token-level embedding, and sequence-level embedding on the Cryptographic API completion accuracies. Our findings show that program analysis is necessary even under advanced embedding. The results show 36.10% accuracy improvement on average when program analysis preprocessing is applied to transfer byte code sequences into API dependence paths. The best accuracy (93.52%) is achieved on API dependence paths with embedding techniques. On the contrary, the pure data-driven approach without program analysis only achieves a low accuracy (around 57.60%), even after the powerful sequence-level embedding is applied. Although sequence-level embedding shows slight accuracy advantages (0.55% on average) over token-level embedding in our basic data split setting, it is not recommended considering its expensive training cost. A more obvious accuracy improvement (5.10%) from sequence-level embedding is observed under the cross-project learning scenario when task data is insufficient. Hence, we recommend applying sequence-level embedding for cross-project learning with limited task-specific data.

Ya Xiao, Salman Ahmed, Xinyang Ge, Bimal Viswanath, Na Meng, Danfeng Daphne Yao: Poster: Comprehensive Comparisons of Embedding Approaches for Cryptographic API Completion. ICSE-Companion 2022: 360-361

People

Danfeng (Daphne) Yao


Publication Details

Date of publication:
October 19, 2022
Conference:
ICSE: International Conference on Software Engineering
Page number(s):
360-361