Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to pass prior conversation over LLaMa 2 7B chat API? How to increase output length?

Hello. I have deployed and been successfully hitting an endpoint for the LLaMa 2 7B chat model on Vertex AI. However, I am having a couple of issues. I sent this body in a request:
 
{
  "instances": [
    { "prompt""this is the prompt"}
  ],
"parameters": {
    "temperature"0.2,
    "maxOutputTokens"256,
    "topK"40,
    "topP"0.95
  }
}
 
And received this response: 
 
{
    "predictions": [
        "Prompt:\nthis is the prompt\nOutput:\n for class today:\n\nPlease write a 1-2 page reflection on"
    ],
    "deployedModelId""8051409189878104064",
    "model""projects/563127813488/locations/us-central1/models/llama2-7b-chat-base",
    "modelDisplayName""llama2-7b-chat-base",
    "modelVersionId""1"
}
 
 Why is this response cutting off mid-sentence? I have adjusted the maxOutputTokens parameter, but no matter what I set it to, the response cuts off in roughly the same place. How can I fix this?
 
I would also like to pass prior conversation to the LLaMa model. I can do this to chat-bison with a body like this:
 
{
    "instances": [
        {
            "context": "",
            "examples": [],
            "messages": [
                {
                    "author": "user",
                    "content": "hello my name is tim"
                },
                {
                    "author": "bot",
                    "content": " Hello Tim, how can I help you today?
",
                    "citationMetadata": {
                        "citations": []
                    }
                },
                {
                    "author": "user",
                    "content": "what is my name"
                }
            ]
        }
    ],
    "parameters": {
        "candidateCount": 1,
        "maxOutputTokens": 1024,
        "temperature": 0.2,
        "topP": 0.8,
        "topK": 40
    }
}
 
The model will "remember" that my name is Tim. What is the syntax for doing the equivalent with LLaMa? Right now I am constrained to a singular "prompt" field like this:
 
{
  "instances": [
    { "prompt""this is the prompt"}
  ],
"parameters": {
    "temperature"0.2,
    "maxOutputTokens"256,
    "topK"40,
    "topP"0.95
  }
}
 
How can I additionally pass prior queries and responses, or even a system prompt? Thank you in advance for your help!
Solved Solved
0 8 5,188
2 ACCEPTED SOLUTIONS

Hi - I did figure out how to pass the conversation, but I haven't solved the issue of the responses getting cut off yet. This is the format I used for passing conversations:

{
  "instances": [
    { "prompt""[SYS]This is the system prompt[/SYS][INST]Here is the user's first prompt[/INST]This is the model's first response[INST]This is the next prompt[/INST]"}
  ] 
}
 
By using the [SYS] and [INST] tags I was able to pass the conversation and a system prompt. I hope this helps!

View solution in original post

Hey, I guess, I solved the cut off responses: This is my input

text=endpoint.predict(instances=[ { "prompt" : "[SYS]Be respectful and answer, use emojis[/SYS][INST]Hey[/INST]Hey[INST]How is your day going so far?[/INST]","max_tokens":1000 } ]

)

The parameters will be inside the dictionary

View solution in original post

8 REPLIES 8