Solved: Re: How to pass prior conversation over LLaMa 2 7B...

wbalkan · 10-12-2023 07:43 PM

Hello. I have deployed and been successfully hitting an endpoint for the LLaMa 2 7B chat model on Vertex AI. However, I am having a couple of issues. I sent this body in a request:

{

"instances": [

{ "prompt": "this is the prompt"}

],

"parameters": {

"temperature": 0.2,

"maxOutputTokens": 256,

"topK": 40,

"topP": 0.95

}

And received this response:

{

"predictions": [

"Prompt:\nthis is the prompt\nOutput:\n for class today:\n\nPlease write a 1-2 page reflection on"

],

"deployedModelId": "8051409189878104064",

"model": "projects/563127813488/locations/us-central1/models/llama2-7b-chat-base",

"modelDisplayName": "llama2-7b-chat-base",

"modelVersionId": "1"

}

Why is this response cutting off mid-sentence? I have adjusted the maxOutputTokens parameter, but no matter what I set it to, the response cuts off in roughly the same place. How can I fix this?

I would also like to pass prior conversation to the LLaMa model. I can do this to chat-bison with a body like this:

{

"instances": [

{

"context": "",

"examples": [],

"messages": [

{

"author": "user",

"content": "hello my name is tim"

},

{

"author": "bot",

"content": " Hello Tim, how can I help you today?

",

"citationMetadata": {

"citations": []

}

},

{

"author": "user",

"content": "what is my name"

}

]

}

],

"parameters": {

"candidateCount": 1,

"maxOutputTokens": 1024,

"temperature": 0.2,

"topP": 0.8,

"topK": 40

}

The model will "remember" that my name is Tim. What is the syntax for doing the equivalent with LLaMa? Right now I am constrained to a singular "prompt" field like this:

{

"instances": [

{ "prompt": "this is the prompt"}

],

"parameters": {

"temperature": 0.2,

"maxOutputTokens": 256,

"topK": 40,

"topP": 0.95

}

How can I additionally pass prior queries and responses, or even a system prompt? Thank you in advance for your help!

wbalkan

Hi - I did figure out how to pass the conversation, but I haven't solved the issue of the responses getting cut off yet. This is the format I used for passing conversations:

{

"instances": [

{ "prompt": "[SYS]This is the system prompt[/SYS][INST]Here is the user's first prompt[/INST]This is the model's first response[INST]This is the next prompt[/INST]"}

]

}

By using the [SYS] and [INST] tags I was able to pass the conversation and a system prompt. I hope this helps!

View solution in original post

Sakshat234

Hey, I guess, I solved the cut off responses: This is my input

text=endpoint.predict(instances=[ { "prompt" : "[SYS]Be respectful and answer, use emojis[/SYS][INST]Hey[/INST]Hey[INST]How is your day going so far?[/INST]","max_tokens":1000 } ]

)

The parameters will be inside the dictionary

View solution in original post

Sakshat234

Hi, were you be able to figure out a way to do this? Even I need to pass on the conversations

wbalkan

Hi - I did figure out how to pass the conversation, but I haven't solved the issue of the responses getting cut off yet. This is the format I used for passing conversations:

{

"instances": [

{ "prompt": "[SYS]This is the system prompt[/SYS][INST]Here is the user's first prompt[/INST]This is the model's first response[INST]This is the next prompt[/INST]"}

]

}

By using the [SYS] and [INST] tags I was able to pass the conversation and a system prompt. I hope this helps!

Sakshat234

Thanks! I haven't been able to solve the cut-off in response as well

kk2105

Hi @wbalkan
Thanks for your suggestions.
Could you please help me with the below query?
Here is an example

My first prompt is

{
  "instances": [
    {
      "prompt": "[SYS]You are math tutor[/SYS] [INST]What is the sum of 999 and 1?[/INST]",
      "max_tokens": 1000,
      "temperature": 0
    }
  ]
}

When I get the response back from llama2, it appends the prompt as well along with `SYS`.

{
  "predictions": [
    "Prompt:\n[SYS]You are math tutor[/SYS] [INST]What is the sum of 999 and 1?[/INST]\nOutput:\n  The sum of 999 and 1 is 1000."
  ],
  "deployedModelId": "4230432560519315456",
  "model": "projects/115031558026/locations/us-east4/models/llama2-7b-chat-001-1710741954685",
  "modelDisplayName": "llama2-7b-chat-001-1710741954685",
  "modelVersionId": "1"
}

My first question, is there a way I can make the model to just return me only the output by trimming out the input prompt?

Second question is, for the above could you please help in preparing the prompt for next input?

Thank you,

KK

Sakshat234

Hey, I guess, I solved the cut off responses: This is my input

text=endpoint.predict(instances=[ { "prompt" : "[SYS]Be respectful and answer, use emojis[/SYS][INST]Hey[/INST]Hey[INST]How is your day going so far?[/INST]","max_tokens":1000 } ]

)

The parameters will be inside the dictionary

wbalkan

Thank you! Problem solved!

kk2105

Thank you, I had the same problem, this solution helped me to fix the issue.
One more point to be noted is, the `temperature` should also be inside the instances and not in the parameters section.

Right way to provide temperature is

{
"instances": [
   {
      "prompt": "What is the sum of 999 and 1?",
      "max_tokens" : 1000,
      "temperature" : 0
   }
],
}

Kindly correct me if I am wrong here.

Thank you,
KK

sravani-durga

I have custom data and need to write the system prompt in such a way that it should get in the desired format. But it is not giving the output as expected.

How to pass prior conversation over LLaMa 2 7B chat API? How to increase output length?