text-bison@001 tuned model serving

Hi! I would like to tune a model based on text-bison@001 and have it run online inferences. The documentation about how to tune is very clear. However, I can't figure out how Vertex serves my model for inference.

Do I need to deploy the tuned model to an endpoint and pay hourly? If so, what instance type is necessary to support the tuned model?

Alternatively, is the tuned model hosted "serverlessly" and I pay the same (or different) per-character rate as for regular requests to the base text-bison@001 model?

1 18 3,546
18 REPLIES 18

Hi @adam5 , To serve your tuned model for online inferences using Vertex AI, you can either deploy it to an endpoint and pay an hourly rate based on the instance type chosen, or you can use serverless prediction and pay a per-character rate. The instance type required for deployment depends on your model's resource requirements. Serverless prediction allows you to make API requests directly to the model without explicit endpoint deployment. For specific pricing and implementation details, refer to Google Cloud documentation or consult with their support to get proper reply.

Thanks! Is there any documentation about what instance types a text-bison001-based model would require? I can't find any information about what instance types that would be compatible with.

 

I also can't find any documentation about how to deploy or make API requests to a serverless tuned model or what the pricing per character there would be. Do you have any pointers to the docs?

Hi @adam5,  by my knowledge, I will help you with that.

  • Instance types for text-bison001-based models

The text-bison001 model is a large language model, so it will require a powerful instance type to run. Some good options include:

  • m6g.xlarge - This instance type has 4 vCPUs and 16 GB of memory. It is a good choice for most text-bison001-based models.

  • m6g.2xlarge - This instance type has 8 vCPUs and 32 GB of memory. It is a good choice for larger text-bison001-based models.

  • m6g.4xlarge - This instance type has 16 vCPUs and 64 GB of memory. It is a good choice for very large text-bison001-based models.

  • Deploying or making API requests to a serverless tuned model

To deploy or make API requests to a serverless tuned model, you will need to use the Cloud Natural Language API: click here for documentation . The API documentation includes instructions on how to deploy a model, as well as how to make API requests.

  • Pricing per character for serverless tuned models

The pricing for serverless tuned models is based on the number of characters processed. The current pricing is $0.000004 per character.

I hope this helps! 

Thanks so much for the help! But based on the docs, this doesn't look right.

I would expect that text-bison001 would be limited by GPU memory. Is this thing actually running on a CPU? There are no docs I can find that give any information.

The Cloud Natural Language API is 1) deprecated and 2) doesn't host fine-tuned models as far as I can tell from any of the docs.

Maybe I"m missing something?

I don't know what the above answer is talking about. m6g.* are amazon machines.

After tuning, it should become a model in your model registry, and you'll then deploy it as a regular "model" to your own endpoint, then you'll send traffic to it. However, it is still charged against tokens, as you don't even need to specify a machine type for your deployment.

 

Thanks! Finally got a fine tune job to succeed after 2 weeks of quota issues. I can send requests to the endpoint now, which is great. Definitely understand that I'll be charged per character for completions. Will I also be charged some hourly rate for the endpoint?

No. Tokens only. The endpoint appears to be "yours" but in reality it is shared so you won't be charged for an hourly rate.

Thank you for the information. So, after tuning, the model will be added to our model registry and deployed as a regular 'model' to our endpoint. It's good to know that we don't need to specify a machine type for the deployment, but it's important to keep in mind that it will still be charged against tokens. I'm just suggested him the current availabilities as per level of my knowledge!

You have to deploy the tuned model to an endpoint (it will be running 24/7) and then can use the following Python code:

 

 

import vertexai
import os
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = '/home/user/key.json'

import vertexai
from google.cloud.aiplatform.private_preview.language_models import TextGenerationModel

vertexai.init(project="your-project", location="us-central1")
parameters = {
    "temperature": 0.2,
    "max_output_tokens": 256,
    "top_p": 0.8,
    "top_k": 40
}

model = TextGenerationModel.get_tuned_model("fine-tuned-model-name-here")

from flask import Flask, request, jsonify

app = Flask(__name__)

import json
from collections import Counter

@app.route('/predict', methods= ['POST'])
def predict():
    if request.get_json():
        x=json.dumps(request.get_json())
        x=json.loads(x)
    else:
        x={}
    data=x["text"]  # text
    print(data)

    response = model.predict("""input: I had to compare two versions of Hamlet for my Shakespeare class and unfortunately I picked this version. Everything from the acting (the actors deliver most of their lines directly to the camera) to the camera shots (all medium or close up shots...no scenery shots and very little back ground in the shots) were absolutely terrible. I watched this over my spring break and it is very safe to say that I feel that I was gypped out of 114 minutes of my vacation. Not recommended by any stretch of the imagination.
Classify the sentiment of the message: negative

input: This Charles outing is decent but this is a pretty low-key performance. Marlon Brando stands out. There\'s a subplot with Mira Sorvino and Donald Sutherland that forgets to develop and it hurts the film a little. I\'m still trying to figure out why Charlie want to change his name.
Classify the sentiment of the message: negative

input: My family has watched Arthur Bach stumble and stammer since the movie first came out. We have most lines memorized. I watched it two weeks ago and still get tickled at the simple humor and view-at-life that Dudley Moore portrays. Liza Minelli did a wonderful job as the side kick - though I\'m not her biggest fan. This movie makes me just enjoy watching movies. My favorite scene is when Arthur is visiting his fiancée\'s house. His conversation with the butler and Susan\'s father is side-spitting. The line from the butler, \"Would you care to wait in the Library\" followed by Arthur\'s reply, \"Yes I would, the bathroom is out of the question\", is my NEWMAIL notification on my computer.
Classify the sentiment of the message: positive

input: {}
Classify the sentiment of the message:
""".format(data),**parameters)
    response=jsonify(response.text)
    print(response)

    return response



if __name__ == "__main__":
    app.run(port=8080, host='0.0.0.0', debug=True)

 

 

nit: for testing one could just do model.predict in __main__ without starting a flask server.

Exactly, @shawnma 

Apologies for tagging into this conversation, and I will be happily directed to a better resource, but I would like to run the code below in a CoLab notebook :

vertexai.init(project="your-project", location="us-central1")
parameters = {
    "temperature": 0.2,
    "max_output_tokens": 256,
    "top_p": 0.8,
    "top_k": 40
}

In the billing section of my project, I can see that the API has one request processed (because I asked a test query), and I would like to scale this up to the research question. But before I launch a program to ask many questions I would like to know if there is a way to estimate the costs (if there are costs) when the model is ran on a pre-trained model and I am only requesting information. 
The program would loop over all countries globally, and for selected universities retrieve the module information for the program related to the research topic, to give an idea and context to my question about costs. 

Thanks in advance for helping getting my bearing in this new domain.

A token is about 4 characters. 100 tokens is about 60-80 words in English.

The billing is done through counting tokens. You could count words then roughly multiply by 2, then lookup the billing price.

Thanks @shawnma , 
That link is very useful, I missed the count when I visited the page earlier.

I ran a test over the first level of the lookup by using another program, and retrieving a list of countries based on the sub-region (that's a loop of 21 entities), just as a test to confirm that I can use LLM to get lists of information I need, I ended up with ~3000 words (roughly), and this is for a known set of ~200 countries. LLM give more data back than required, but this is to be expected. 
So these first 21 questions are costing me ~4500 tokens (100 tokens for 70 words * 3000 words), and then I found that it costs $0.0010 per 1000 characters, which means that getting a country list would cost me ~0.017 USD (very roughly estimated)? Just to understand the mathematics. 

Getting the educational instituted per country specialised in the topic will be highly variable, most likely from none to 8, but their names are also not standardized, so capping would be dangerous.

What would be the best strategy to get a comprehensive list which is not breaking the bank?

Oh, it is charging based on characters, that's easier to understand than tokens. Just ignore tokens in my previous note.

count your characters and do a math...your post has about 1000 chars, which will cost you 1/10 cent.

I found this link: https://cloud.google.com/vertex-ai/pricing

Example cost calculation

If a user sends five separate requests to the PaLM Text Bison model, and each request has a 200-character input and 400-character output, the total charge is calculated as follows:

 
Input cost:
200 input characters x 5 prompts = 1,000 total input characters;
1,000 total input characters x ($0.001 / 1000) = $0.001 input cost.

Output cost:
400 output characters x 5 prompts = 2,000 total output characters;
2,000 total output characters x ($0.001 / 1000) = $0.002 output cost.

Total cost:
$0.001 input cost + $0.002 output cost = $0.003 total cost.

These are reasonable prices, I am just a bit worried that letting the program loose on the module descriptions it could spin out of control quickly. 

Hi, Thanks for the responses. Is there currently a good way to track and record training and validation loss during and/or after fine-tuning? It would be helpful for comparing different training data sets and/or parameter settings, and also deciding how long to continue training.