logo

Training AI To Code Using the Largest Code Dataset

2022-10-28

Authors:   Animesh Singh, Tommy Li


Summary

The presentation discusses the open-sourcing of a project called Net, which aims to help migrate legacy code and enhance developer cycles. The project includes high-quality data sets and tools for code generation, natural language processing, and code analysis. The goal is to create better solutions, improve existing code performance, and reduce errors and debugging. The project is being hosted on the Machine Learning Exchange, which provides data and AI asset catalogs and integrates execution engines for easy experimentation. The ultimate aim is to develop production AI systems that can automatically translate and modernize legacy code with minimal effort.
  • The open-sourcing of the Net project aims to help migrate legacy code and enhance developer cycles
  • The project includes high-quality data sets and tools for code generation, natural language processing, and code analysis
  • The goal is to create better solutions, improve existing code performance, and reduce errors and debugging
  • The project is being hosted on the Machine Learning Exchange, which provides data and AI asset catalogs and integrates execution engines for easy experimentation
  • The ultimate aim is to develop production AI systems that can automatically translate and modernize legacy code with minimal effort
The presentation highlights a collaboration with Red Hat to create Project Wisdoms, which generates Ansible pipelines from plain English text. This tool can help developers automate tasks without relying on automation teams, making the process more efficient and accessible. Another use case involves modernizing old Java code into new Java code. These examples demonstrate the potential of the Net project to enhance developer productivity and streamline code migration.

Abstract

Project CodeNet is a large dataset of 14 million code samples totaling 500 million lines of code in 55 programming languages. It enables machine learning for code, like finding code similarity, extracting semantic context, and even translating between different programming languages. Using the Machine Learning Exchange (MLX), a Linux Foundation for AI & Data Sandbox Project, we demonstrate how Project CodeNet can be leveraged to classify code and analyze code complexity in three steps. Using DataShim we turn domain specific subsets of the data into Kubernetes Custom Resources. Running Jupyter notebooks on Kubernetes we use the datasets to train deep learning models. The models are then served for inferencing as Kubernetes Custom Resources using KServe. For each of these steps, MLX generates Kubeflow Pipelines on Tekton so data scientists are not required to write Kubernetes specific code.

Materials: