Llama 2 (Quantized)
Fine-tune & deploy a quantized version of Meta's Llama 2 models on Vertex AI.Model quantization is a technique used to reduce the size and computational requirements of deep learning models. Quantization converts the model's weights from high-precision floating-point numbers to lower-precision data types, such as 4-bit integers while aiming to preserve model precision. AWQ and GPTQ are two algorithms that employ post-training quantization. For more details on these two algorithms, refer to the original papers: GPTQ paper and AWQ paper. The performance of some quantized LLaMA 2 models can be found on the HuggingFace Open LLM Leaderboard, and evaluation can also be performed using Vertex AI Custom Jobs following the LLaMA2 (Evaluation) Colab notebook.
For details on the original model refer to the LLaMA 2 model card.
This model can be used directly through the DEPLOY button or in a notebook. Click OPEN NOTEBOOK to deploy and run inference on the model using the LLaMA2 Quantization notebook in Colab.
Deploying a model consists of 3 steps: creating an endpoint resource, uploading the model, and deploying the model to the endpoint. A service account will need to be created with the Vertex AI User role for deploying models to Vertex AI endpoints.
With the model deployed, we can make inferences.
The output will be a text generated by the model:
Example:
Quantized LLaMA2 models can be deployed using vLLM on GCP to achieve high throughputs. Quantized models are recommended for minimizing GPU requirements and lowering latency. Full precision models are recommended for highest throughput. LLaMA2 is released under the Llama 2 Community License Agreement.
Resource ID | Release date | Release stage | Description |
TheBloke/Llama-2-7B-chat-GPTQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-13B-chat-GPTQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-70B-chat-GPTQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-7B-chat-AWQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-13B-chat-AWQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-70B-chat-AWQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-7B-GPTQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-13B-GPTQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-70B-GPTQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-7B-AWQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-13B-AWQ | 2024-01-01 | GA | Serving for text generation. |
TheBloke/Llama-2-70B-AWQ | 2024-01-01 | GA | Serving for text generation. |
Google Cloud Console has failed to load JavaScript sources from www.gstatic.com.
Possible reasons are: