About a year ago, generating code from a Large Language model (LLM) was like an unachievable task. With the advancement in Artificial Intelligence, LLMs are now successfully being used to generate software codes. The automatic generation of code has streamlined a lot of real-world programming tasks. However, along with the ample usage of code LLMs by the techies, there has been a buzz about the source code that is used as the training data for developing the model. The model learns from the training examples, which might include open-source codes constrained by restrictive licenses. This has cast doubts and raised questions among developers who would not have intended to have their codes used in training the language models.
The BigCode project, an association of ServiceNow and Hugging Face, has released The Stack, incorporating a 3.1 TB dataset of permissively licensed source code in 30 programming languages. Considering the current scenario in which using open-source repositories is debatable, BigCode has released the code to promote transparency around the pre-training data.
The main idea is to let people choose if they want their code to be contributed to evaluating Machine Learning models. The hugging face website – ‘https://huggingface.co/spaces/bigcode/in-the-stack’ allows people to conveniently opt-out from having their repository included in The Stack for training the LLMs. People can confirm so by entering their respective GitHub usernames on the website, and if the repository is in the Stack, they can discard the data from any future variation.
The ServiceNow and Hugging Face team, in their recently published paper The Stack: 3 TB of Permissively Licensed Source Code have mentioned some of their contributions which are as follows –
- The team has disclosed 3.1 TB of permissively licensed source code in 30 programming languages and a near-deduplicated version of the same, which anyone can access by visiting the website – https://hf.co/BigCode.
- Upon training 350M decoder-only transformers on Python data, discarding near-duplicates from the training data helps remarkably uplift the model performance.
- The team claims to show that by using permissively licensed data, it has the capability to replicate the extraordinary outcomes of Codex and CodeGen.
- It shares a dedicated Data Governance plan with the instructions and the process to opt-out from sharing open-source repositories in the training data.
To obtain the license details of 137.36M Github repositories constituting the huge dataset, the team used GHArchive and the go-license-detector. The most commonly used licenses were MIT and Apache 2.0. The group laid a complete comparison between the size of The Stack and one of the most popular datasets, CodeParrot. Compared with CodeParrot, The Stack is relatively more than three times the size. Apart from that, The Stack is compared with other code datasets such as AlphaCode, CodeGen, and PolyCoder.
The absence of transparency in training data has always been a crucial obstacle to the development of a model. The Service Now Research and Hugging Face have definitely promoted clarity in code LLMs by releasing the enormous dataset and sharing the entire process of curating the data.
Check out the Paper. All Credit For This Research Goes To Researchers on This Project. Also, don’t forget to join our Reddit page and discord channel, where we share the latest AI research news, cool AI projects, and more.
Tanya Malhotra is a final year undergrad from the University of Petroleum & Energy Studies, Dehradun, pursuing BTech in Computer Science Engineering with a specialization in Artificial Intelligence and Machine Learning.
She is a Data Science enthusiast with good analytical and critical thinking, along with an ardent interest in acquiring new skills, leading groups, and managing work in an organized manner.
#Meet #Stack #3TB #Permissively #Licensed #Source #Code #LLMs #Large #Language #Models