I'd like to to add domain specific knowledge to a tuning model for text-bison. This is knowledge I already have in How-to articles, FAQs, canned messages, etc.
Ideally, I think this should be context data and GenAI should answer all prompts with this tuning data. I'm having problems trying to figure out how to create the tuning-data.jsonl file. This isn't really an input/output, or question/answer model. It's more of here's a bunch of data you have to base future questions on, like "context"
I've created entries where "input_text" is basically the body of the how-to article, FAQ, etc. Like the following:
{"input_text": "How to report a problem In general, when reporting a problem, the more information, the better. General statements like App X isn't working, but it used to work before do not provide enough information to debug an issue. You should provide as much information as possible: Problem Information Describe the problem or bug A clear and concise description of what the bug is [shortened]"}
Vertex AI doesn't like that there's no "output_text" field. Also I wonder if I should prepend "context:" to the "input_text" data.
Any guidance or help is greatly appreciated.
Thanks!
-Ernie
Hi, You are right Vertex AI would require output_text field in the jsonl dataset, You can view this sample here in the documentation:
{"input_text": "question: How many people live in Beijing? context: With over 21 million residents, Beijing is the world's most populous national capital city and is China's second largest city after Shanghai. It is located in Northern China, and is governed as a municipality under the direct administration of the State Council with 16 urban, suburban, and rural districts.[14] Beijing is mostly surrounded by Hebei Province with the exception of neighboring Tianjin to the southeast; together, the three divisions form the Jingjinji megalopolis and the national capital region of China.", "output_text": "over 21 million people"}
{"input_text": "question: How many parishes are there in Louisiana? context: The U.S. state of Louisiana is divided into 64 parishes (French: paroisses) in the same manner that 48 other states of the United States are divided into counties, and Alaska is divided into boroughs.", "output_text": "64"}
{"input_text": "question: How many churches in Texas? context: In 2010, there were a number of religious congregations in the state of Texas.", "output_text": "27,848"}
{"input_text": "question: How many lakes in North Dakota? context: North Dakota has many lakes and rivers offering exciting action for walleye, northern pike, perch, bass, salmon, catfish and other game fish with seasons for most species open year-round.", "output_text": "400"}
{"input_text": "question: How many rivers in the United States? context: The United States of America has over 250,000 rivers, with a total of about 3,500,000 miles of rivers. The longest river in the USA is the Missouri River (it is a tributary of the Mississippi River and is 2,540 miles long), but the biggest in terms of water volume is the deeper Mississippi River", "output_text": "over 250,000"}
{"input_text": "question: How many mountains in Oregon? context: There are 4760 named mountain ranges in Oregon and approximately 3,764 mountains altogether.", "output_text": "3,764"}
{"input_text": "question: How many small businesses in the state of Vermont? context: Vermont has 78,883 small businesses, most of which are sole proprietors. Vermont small businesses employ 157,131 workers, which is 60.2% of the state's workforce. The top three industries for small business in Vermont are professional, scientific, and technical services; construction; and retail.", "output_text": "78,883"}
{"input_text": "question: How many states grow apples? context: 2,500 varieties of apples are grown in the United States. 7,500 varieties of apples are grown throughout the world. 100 varieties of apples are grown commercially in the United States. Apples are grown commercially in 36 states.", "output_text": "36"}
{"input_text": "question: How many states grow mangos? context: Because mangos need a tropical climate to flourish, only Florida, California, Hawaii, and Puerto Rico grow mangos.", "output_text": "4"}
{"input_text": "question: How often does it storm in Missouri? context: Thunderstorms normally occur between 40 and 50 days per year. During any year, there are usually a few of these thunderstorms that are severe, and produce large hail and damaging winds. Tornadoes have produced extensive damage and loss of life in the St. Louis area.", "output_text": "Thunderstorms normally occur between 40 and 50 days per year."}
Question or Context fields are not always required for the tasks field but must be following the same format (also as mentioned in the documentation above production traffic should have the same fields in the same for the model to recognize the pattern)
See another example here:
{"input_text": "Given the following Food Product information classify it into one of the following classes: [Contains, Does not contain] allergens Food Product:Almond Cookies, Main Ingredient:Almonds, Sweetener:Sugar, Fat[oil]:Butter, Seasoning:Flour", "output_text": "Contains"}
{"input_text": "Given the following Food Product information classify it into one of the following classes: [Contains, Does not contain] allergens Food Product:Chicken Noodle Soup, Main Ingredient:Chicken broth, Sweetener:None, Fat[oil]:None, Seasoning:Salt", "output_text": "Contains"}
{"input_text": "Given the following Food Product information classify it into one of the following classes: [Contains, Does not contain] allergens Food Product:Chicken Noodle Soup, Main Ingredient:Chicken broth, Sweetener:None, Fat[oil]:None, Seasoning:Salt", "output_text": "Contains"}
{"input_text": "Given the following Food Product information classify it into one of the following classes: [Contains, Does not contain] allergens Food Product:Cheddar Cheese, Main Ingredient:Cheese, Sweetener:None, Fat[oil]:None, Seasoning:Salt", "output_text": "Contains"}
{"input_text": "Given the following Food Product information classify it into one of the following classes: [Contains, Does not contain] allergens Food Product:Ranch Dressing, Main Ingredient:Buttermilk, Sweetener:Sugar, Fat[oil]:Vegetable oil, Seasoning:Garlic, herbs", "output_text": "Contains"}
So there's no way around the input_text/output_text pairing? This doesn't work well for us since we have 100s of documents we want to tune the LLM on. Creating the question/answer like pairing for each document is a large lift.
Yes you are correct, the dataset for tuning the model requires those two fields to identlify the prompt and the example response for the dataset,.
Reference for dataset formatting:
User | Count |
---|---|
2 | |
1 | |
1 | |
1 | |
1 |