CodeGemma
Open code model variants of Gemma model suited for text-to-text and text-to-code tasks.CodeGemma is a collection of lightweight open code models built on top of Gemma. CodeGemma models are text-to-text and text-to-code decoder-only models and are available as a 7 billion pretrained variant that specializes in code completion and code generation tasks, a 7 billion parameter instruction-tuned variant for code chat and instruction following and a 2 billion parameter pretrained variant for fast code completion.
This model card includes the 2B and 7B model variants.
Code Gemma models have a wide range of applications, which vary between IT and PT models. The following list of potential uses is not comprehensive. The purpose of this list is to provide contextual information about the possible use-cases that the model creators considered as part of model training and development.
You can deploy CodeGemma to Vertex AI.
Using Gemma as the base model, CodeGemma 2B and 7B pretrained variants are further trained on an additional 500 billion tokens of primarily English language data from publicly available code repositories, open source mathematics datasets and synthetically generated code.
The following data pre-processing techniques were applied:
CodeGemma was trained using the latest generation of Tensor Processing Unit (TPU) hardware (TPUv5e).
Training large language models requires significant computational power. TPUs, designed specifically for matrix operations common in machine learning, offer several advantages in this domain:
These advantages are aligned with Google's commitments to operate sustainably.
Training was done using JAX and ML Pathways.
JAX allows researchers to leverage the latest generation of hardware, including TPUs, for faster and more efficient training of large models.
ML Pathways is Google's latest effort to build artificially intelligent systems capable of generalizing across multiple tasks. This is specially suitable for foundation models, including large language models like these ones.
Together, JAX and ML Pathways are used as described in the paper about the Gemini family of models; "the 'single controller' programming model of Jax and Pathways allows a single Python process to orchestrate the entire training run, dramatically simplifying the development workflow."
These models were evaluated against a large collection of different datasets and metrics to cover different aspects of text generation:
Benchmark | 2B | 7B | 7B-IT |
---|---|---|---|
HumanEval | 31.1 | 44.5 | 56.1 |
MBPP | 43.6 | 56.2 | 54.2 |
HumanEval Single Line | 78.41 | 76.09 | 68.25 |
HumanEval Multi Line | 51.44 | 58.44 | 20.05 |
BC HE C++ | 24.2 | 32.9 | 42.2 |
BC HE C# | 10.6 | 22.4 | 26.7 |
BC HE Go | 20.5 | 21.7 | 28.6 |
BC HE Java | 29.2 | 41.0 | 48.4 |
BC HE JavaScript | 21.7 | 39.8 | 46.0 |
BC HE Kotlin | 28.0 | 39.8 | 51.6 |
BC HE Python | 21.7 | 42.2 | 48.4 |
BC HE Rust | 26.7 | 34.1 | 36.0 |
BC MBPP C++ | 47.1 | 53.8 | 56.7 |
BC MBPP C# | 28.7 | 32.5 | 41.2 |
BC MBPP Go | 45.6 | 43.3 | 46.2 |
BC MBPP Java | 41.8 | 50.3 | 57.3 |
BC MBPP JavaScript | 45.3 | 58.2 | 61.4 |
BC MBPP Kotlin | 46.8 | 54.7 | 59.9 |
BC MBPP Python | 38.6 | 59.1 | 62.0 |
BC MBPP Rust | 45.3 | 52.9 | 53.5 |
Our evaluation methods include structured evaluations and internal red-teaming testing of relevant content policies. Red-teaming was conducted by a number of different teams, each with different goals and human evaluation metrics. These models were evaluated against a number of different categories relevant to ethics and safety, including:
The results of ethics and safety evaluations are within acceptable thresholds for meeting internal policies for categories such as child safety, content safety, representational harms, memorization, large-scale harms. See the Gemma model card for more details.
Large Language Models (LLMs) have limitations based on their training data and the inherent limitations of the technology. See the Gemma model card for more details on the limitations of LLMs.
The development of large language models (LLMs) raises several ethical concerns. We have carefully considered multiple aspects in the development and the assessment of these models, especially regarding coding-specific risks, as detailed in the evaluation section above. For additional details, please refer to the same discussion in the Gemma model card.
Resource ID | Release date | Release stage | Description |
---|---|---|---|
codegemma | 2024-04-09 | GA |
By using, reproducing, modifying, distributing, performing or displaying any portion or element of Gemma, Model Derivatives including via any Hosted Service, (each as defined below) (collectively, the "Gemma Services") or otherwise accepting the terms of this Agreement, you agree to be bound by this Agreement.
Google reserves the right to update this Gemma Prohibited Use Policy from time to time.
You may not use nor allow others to use Gemma or Model Derivatives to:
Google Cloud Console has failed to load JavaScript sources from www.gstatic.com.
Possible reasons are: