Skip to main content

Introducing Llama2-70B-Chat with MosaicML Inference


Share this post
Introducing Llama2-70B-Chat with MosaicML Inference

Llama2-70B-Chat is a leading AI model for text completion, comparable with ChatGPT in terms of quality. Today, organizations can leverage this state-of-the-art model through a simple API with enterprise-grade reliability, security, and performance by using MosaicML Inference and MLflow AI Gateway.

Llama2-70B-Chat is available via MosaicML Inference. To get started, sign up here and check out our inference product page. Curious about our pricing? Look here.

Human raters prefer Llama2
Figure 1: Human raters prefer Llama2-70B-Chat to ChatGPT and PaLM-Bison. Adapted from the Llama2 technical paper. See the paper for additional data on model-based evaluation using GPT-4. Llama2-70B-Chat was fine-tuned for dialog use cases, carefully optimized for safety and helpfulness leveraging over 1 million human annotations.

On July 18th, Meta published Llama2-70B-Chat: a 70B parameter language model pre-trained on 2 trillion tokens of text with a context length of 4096 that outperforms all open source models on many benchmarks, and is comparable in quality to closed proprietary models such as OpenAI's ChatGPT and Google PaLM-Bison. Meta made the model publicly available with a commercially permissive license that enables the broader ML community to learn from this work, build on top of it, and leverage it for commercial use cases.

MosaicML LLM evaluation
Figure 2: Llama2-70B-Chat ranks #1 in the MosaicML LLM evaluation leaderboard out of all open LLMs. The Mosaic Model Gauntlet is our open-source evaluation harness for measuring LLM quality in a holistic and standardized manner. It encompasses 34 different in-context learning benchmarks collected from a variety of sources, and organized into 6 broad categories of competency.

However, enterprise deployment of Llama2-70B-Chat is challenging—and costly. Achieving adequate reliability, latency and throughput for commercial-grade applications requires state-of-the-art GPUs and sophisticated system and ML optimization. That's why we're making Llama2-70B-Chat available on the MosaicML Inference service. Customers who use the service can start experimenting with Llama2-70B-Chat within minutes, benefiting from enterprise-grade reliability, security, and performance, while paying for usage only on a per-token basis.

Read on to learn more about how to leverage Llama2-70B-Chat's enterprise-grade capabilities, integrate with MLflow AI Gateway, and design effective prompts.

Querying Llama2-70B-Chat with MosaicML Inference API

Invoking Llama2-70B-Chat text completion with MosaicML Inference is as easy as importing a Python Module and calling an API:

from mcli.sdk import predict
url = 'https://models.hosted-on.mosaicml.hosting/llama2-70b-chat/v1'
# We'll explain this prompt format later in the blog post!
prompt = """[INST] <<sys>>
Always answer in a professional and engaging manner.
<</sys>>
Write a LinkedIn Post about Llama2-70B-Chat being available on MosaicML Inference. [/INST]"""


response = predict(url, {
    "inputs": [prompt], 
    "parameters": {'temperature':0.2}})

Here's the example response:

"Exciting news, everyone! 🚀 We're thrilled to announce that Llama2-70B-Chat is now available on 
MosaicML Inference! 🎉

This powerful language model is trained on a massive dataset of text from the internet and is 
capable of generating human-like responses to a wide range of questions and prompts. With its 
impressive knowledge base and conversational abilities, Llama2-70B-Chat is perfect for a variety 
of applications, including chatbots, virtual assistants, and customer service."

You can further customize your outputs with the following parameters:

  • max_new_tokens: the maximum number of new tokens to generate.
  • temperature: a decimal greater than or equal to 0. Higher values will make the output more random, while lower values will make it more deterministic. A temperature of 0 means greedy sampling.
  • top_p: a decimal between 0 and 1 that controls how nucleus sampling is performed; the higher it is set, the more possible it is to pick unlikely continuations.
  • top_k: an integer value that is 1 or higher. Determines how many tokens are considered at each token generation step.

For more information about using MosaicML Inference, take a look at the API documentation.

Querying Llama2-70B-Chat through MLflow AI Gateway

MLflow AI Gateway from Databricks is a highly scalable, enterprise-grade API gateway that enables organizations to manage their LLMs for experimentation and production use cases.

Using MLflow AI Gateway to query generative AI models enables centralized management of LLM credentials and deployments, exposes standardized interfaces for common tasks such as text completions and embeddings, and provides cost management guardrails.

You can now leverage MosaicML Inference API through MLflow AI Gateway on Databricks, and query Llama2-70B-Chat as well as other models including MPT text completion and Instructor text embedding models.

The following code snippet demonstrates how easy it is to query Llama2-70B-Chat through AI Gateway using the MLflow Python client:

from mlflow.gateway import set_gateway_uri, create_route, query

# Set the MLflow AI Gateway server
set_gateway_uri("databricks")

# Create a Route for text completions with MosaicML Inference API 
create_route(
    name="llama2-70b-chat-route",
    route_type="llm/v1/completions",
    model={
        "name": "llama2-70b-chat",
        "provider": "mosaicml",
        "mosaicml_config": {
        "mosaicml_api_key": "<your MosaicML API key>"
        }
    }
)

# We'll explain this prompt format later in the blog post!
prompt = """[INST] <<SYS>>
You are an AI assistant, helping Databricks customers learn about Databricks products and services. Keep your answers short and concise.
<</SYS>>
Explain what is MLflow [/INST]"""

# Query the Route with a prompt
response = query(
    route="llama2-70b-chat-route",
    data={"prompt": prompt}
)

Here's the example response:

"MLflow is an open-source platform for managing the end-to-end machine learning (ML) lifecycle. 
It helps data scientists and engineers to manage and track experiments, reproduce and share 
results, and deploy ML models. MLflow is built on top of Apache Spark and is integrated with 
Databricks, allowing users to easily create, share, and deploy ML models on a production-ready 
Spark cluster. With MLflow, users can focus on building better ML models rather than managing 
infrastructure."

Using MosaicML Inference and MLflow AI Gateway, ML engineers can now leverage pretrained language models to build generative AI applications such as Retrieval Augmented Generation (RAG). To learn more, check out this blog post from Databricks.

Enterprise-grade serving of Llama2-70B-Chat

There are three models in the Llama-v2 family with parameter sizes ranging from 14 GB to 140 GB in Float16 precision: Llama2-7B, Llama2-13B and Llama2-70B. Each of these have different inference hardware requirements for serving. For example, 7B models can easily fit on a single NVIDIA A10 GPU (24GB memory) or NVIDIA A100 GPU (40GB memory), while 70B doesn't in either Float16 or even Int8 precision.

For the 70B parameter model Llama2-70B-Chat, it's necessary to distribute the workload across multiple GPUs. However, multi-GPU serving is complicated and naively serving without optimizations is very expensive and slow.

Reliability

rate-limiting and authentication
Figure 3: The main components of the deployment service highlighting various reliability and security features such as auto-restarts, rate-limiting and authentication.

Our deployment of the 70B model is served with multiple replicas for redundancy, with built-in auto-restarts to handle failure recovery. If, for example, CUDA OOMs and a single replica goes down, the load balancer will route requests to other available replicas. The deployment itself is monitored to detect failures. Each replica implements rate limiting to ensure fairness among different clients and to avoid overloading the service. Moreover, we are working to have replicas deployed across multiple regions to ensure high availability.

Security

The MosaicML Inference platform is designed to maintain data privacy. User queries and responses generated by the model are not logged. The platform does not record or log any data provided by the user or generated by the model as a response to user prompts; it only records non-identifiable operational system metrics.

Performance

As briefly mentioned above, optimizations are what makes it possible to serve these large models efficiently. We include a number of optimizations, including Tensor Parallelism, KV cache techniques, continuous batching, and other efficient methods. Some of the optimizations we employ include:

Tensor Parallel: A 70B parameter model such as Llama2-70B-Chat requires either 140GB (Float16) or 70GB (Int8) just to store the weights in GPU memory. To address this constraint, we use a tensor parallel mechanism to fit the model into the memories of multiple GPUs.

KV Cache memory management and Quantization: In autoregressive token generation, past Key/Values (KV) from the attention layers are cached; it's not necessary to recompute them at every step. The size of the KV cache can be significant depending on the number of sequences processed at a time and the length of these sequences. Moreover, during each iteration of the next token generation, new KV items are added to the existing cache making it bigger as new tokens are generated. Therefore, effective memory management that keeps the KV cache size small is critical for good inference performance.

Continuous Batching: Batching, i.e., processing multiple requests simultaneously, is essential for efficient utilization of GPU compute. However, requests from different clients do not arrive at the same time to form a batch. MosaicML Inference server supports adding requests to the server as they arrive and returning them as they get processed. This technique, known as continuous batching, optimizes GPU usage with minimal impact on request latency.

Streaming of Output Tokens: Output token streaming enables incremental rendering of output tokens to the end-user, as tokens are generated. While this is not a model performance feature per-se, this capability greatly enhances the user experience for chat and similar use cases and delivers enhanced interactivity when generating long output sequences. This feature is currently available with MosaicML Llama2-70B-Chat API; see the documentation for details.

Stay tuned for future blog posts as we explore other optimization techniques like quantization that improve performance (but may impact model quality).

Designing effective prompts for Llama2-70B-Chat

Llama2-70B-Chat, as the name implies, has undergone more training to make it excel at following instructions and having conversations. Like other versions of Llama2, this model has a 4,096 token maximum length, which corresponds to around 3,000 words. (Tokens are sometimes whole words, and sometimes fractions of a word. On average, a token is about ¾ of a word.)

If you'd like to check how many tokens a message has, you can use Llama2's tokenizer, which is available to experiment with through llama-tokenizer-js. You can spend those tokens how you like, so if you ask to summarize 4,095 tokens, it will only be able to respond with a single token. For optimal results, Llama2-70B-Chat does expect some structure, and that format will use some of your tokens. For example, if you want to know how to make a customer support bot to help answer questions based on product docs, you can use a prompt like this:

"""[INST] <<sys>>
You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, 
while being safe. 
Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or 
illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of 
answering something not correct. If you don't know the answer to a question, please don't share 
false information.
<</sys>>
How do I make a customer support bot using my product docs? [/INST]"""

That's probably a lot more words than you expected.

In this prompt, there are two parts.

  1. User message: Your question or request. In this case: "How do I make a customer support bot using my product docs?"
  2. System prompt: A prompt shared across messages that tells the model what persona to take on in the conversation. It's everything between the <<SYS>> and <</SYS>> tokens. In this example, we used the system prompt that Meta suggests using with Llama2. However, it is far from the only system prompt you can use.

In chat conversations, there is usually a third part, the LLM response, which gets added in multi-turn conversations.

Llama-70B-Chat customization comes from being able to specify system prompts for specific use cases. For example, if you are using Llama2-70B-Chat to write code, you might want to use a system prompt like:

"You are a code generator. Always output your answer in JSON or concise JavaScript. No preamble 
or comments."

A company using Llama2-70B-Chat as a customer service chatbot would pass that information to the model in the system prompt, like:

"You are a customer service chatbot for [company details]. You are conversing with [customer 
details]. Always be polite and concise."

Retrieval Augmented Generation Example

Let's look at a longer code sample with a realistic example.

Say you want to answer a domain-specific question based on a set of documents. A popular method to use is Retrieval Augmented Generation (RAG). With RAG, you first retrieve a set of relevant documents from a larger document store based on the user query, and then pass those along to the LLM as context to help it answer the query. This method allows the LLM to draw on knowledge it may not have seen before, and helps to reduce hallucination.

For this simple example we are not dynamically retrieving relevant documents, and are just providing a hardcoded set of relevant documents to the model. We make a request to the Llama-70B-Chat endpoint with the following in the input:

  • A set of documents about MosaicML
  • A user message: a domain specific question about our MPT models
  • A system prompt asking to choose the most relevant document to answer the user query
from mcli.sdk import predict

endpoint_url = "https://models.hosted-on.mosaicml.hosting/llama2-70b-chat/v1"

system_prompt = "Choose the doc that is most relevant to the question, and answer the question 
concisely."

user_message = "What is the context length of MPT-30B?"

# Optional parameters. Here, temperature controls how random or deterministic the model outputs 
are.
params = {"temperature": 0.2} 

### HELPER FUNCTIONS
def construct_user_query(user_message: str):
    return f"Query: {user_message}"

def construct_complete_system_prompt(system_prompt: str):
    return f"<<sys>>\n{system_prompt}\n<</sys>>\n\n"

def retrieve_documents(user_query: str):
    # To use RAG in a real application, you would use an embedding model and vector database to 
    retrieve the most relevant documents for the input query.
    # To keep this example simple, we provide document samples for you and ignore the user query.

    docs = [
        {
            "title": "MPT-7B Announcement Blog",
            "content": "MPT-7B is a transformer trained from scratch on 1T tokens of text and 
code. It is open source, available for commercial use, and optimized for faster training and 
inference.",
        },
        {
            "title": "MPT-30B Announcement Blog",
            "content": "Introducing MPT-30B, a new, more powerful member of our Foundation Series 
of open-source models, trained with an 8k context length on NVIDIA H100 Tensor Core GPUs.",
        },
        {
            "title": "MPT-7B-8k Announcement Blog",
            "content": "Today, we are releasing MPT-7B-8K, a 7B parameter open-source LLM with 8k 
context length trained with the MosaicML platform. MPT-7B-8K was pretrained starting from the MPT-
7B checkpoint in 3 days on 256 NVIDIA H100s with an additional 500B tokens of data.",
        },
        {
            "title": "StreamingDataset Docs",
            "content": "StreamingDataset helps make training on large datasets from cloud storage 
as fast, cheap, and scalable as possible. It's specially designed for multi-node, distributed 
training for large models—maximizing correctness guarantees, performance, and ease of use. 
StreamingDataset is compatible with any data type, including images, text, video, and multimodal 
data.",
        },
        {
            "title": "Composer Docs",
            "content": "Composer is an open-source deep learning training library optimized for 
distributed training on large-scale clusters.",
        },
        {
            "title": "Mosaic Model Gauntlet",
            "content": "The Mosaic Model Gauntlet is our open-source, standardized, and holistic 
technique for evaluating the quality of pretrained LLMs. The Model Gauntlet encompasses 34 
different benchmarks collected from a variety of sources, and organized into 6 broad categories 
of competency that we expect good LLMs to have.",
        },
    ]

    return docs

def construct_rag_input(system_prompt: str, user_message: str):
    start_token = "[INST] "
    end_token = " [/INST]"

    query = construct_user_query(user_message)

    complete_system_prompt = construct_complete_system_prompt(system_prompt)

    docs = retrieve_documents(query)

    docs_list = []
    for document in docs:
        docs_list.append(f"{document['title'].strip()}: ")
        docs_list.append(f"{document['content'].strip()}\n")

    document_prompt = "".join(docs_list)

    prompt = (
        start_token
        + complete_system_prompt
        + "Context:\n"
        + document_prompt
        + "\n\n"
        + query
        + end_token
    )

    return prompt


if __name__ == "__main__":
    full_input = construct_rag_input(system_prompt, user_message)
    # Call and print the results of MosaicML Inference Llama-70B-chat API
    print(predict(endpoint_url, {"inputs": [full_input], "parameters": params})["outputs"][0])

Here's the answer Llama2 finds:

"According to the MPT-30B Announcement Blog, MPT-30B was trained with an 8k context length.
Therefore, the answer to the question is 8k."

From here, you can try experimenting with additional system prompts, parameters, documents, and user messages for your custom applications.

Getting started with Llama2-70B-Chat on MosaicML Inference

To leverage Llama2-70B-chat with enterprise-grade reliability, security, and performance, sign up here and check out our inference product page. To stay up-to-date on updates and new releases, follow us on Twitter or join our Slack.

License

Llama 2 is licensed under the LLAMA 2 Community License, Copyright © Meta Platforms, Inc. All Rights Reserved. Customers are responsible for ensuring compliance with applicable model licenses.