Better Code Representation for Machine Learning

Better Code Representation for Machine Learning

Summary

Suggest improved code embedding and way to reduce data when working with code changes.

Abstract

Using machine learning for code becomes more and more common. Different approaches based on paths or BERT are available. This paper focuses on improving parts of the inputvector by creating a more compact embedding. Furthermore, it explores and discusses ways to reduce the amount of data inserted into a model when working with code changes. The results presented in this paper show that it is possible to reduce the input data into a latent space, cutting it to half the input data size, representing differences and similarities between code paths in a very compact way while still maintaining an accuracy of 99%. Moreover, it is shown that with proper preprocessing, it is possible to reduce the amount of data inserted into a code changes model by around 84%.

Download