Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

How to correctly reference a dataframe in a Gemini Pro data summary prompt in a Vertex AI notebook

I'm running a prompt using Gemini Pro to analyze a dataset that includes entity sentiment analysis datapoints on user reviews content. The dataset is a mix of mostly string and numerical data types, with a date timestamp. And I'm wanting the AI to analyze the data and write a summary, similar to an example I'm including in the Prompt.

The dataset is a pandas dataframe that's in the same Python notebook as the Gemini Pro prompt. The dataframe shape is (166, 4). So I presume that's not too large, although I have no idea what 'too large' would be. 

I'm using a structured prompt, with Context, Examples, and Input sections, with the Input containing instructions and the reference to the dataframe, in this line of code:

`Here is the data to be analyzed and summarized: {df_string}`

The 'df_string' is the dataframe being converted to string data, which I'd read that needed to be done, although I've also coded this without converting the dataframe to a string, replacing 'df_string' with a direct reference to the dataframe. Both have many and similar errors in their summaries. 

The summary has so many factual errors and misstatements regarding the dataframe being analyzed that I'm wondering if either way of contextualizing the data in this type of analysis is technically correct. 

Any suggestions on how to approach this greatly appreciated.

3 1 2,356
1 REPLY 1

Hi Memerunner, 

I would say, you can try using (df_string), (df_columns) with required business context as your guidance to the model. Additionally examples of huge dataframe , moderate and empty dataftame with manually written summary would be really helpful. And we all know the required instructions with more strictness. Till date what I have observed if we can modify the dataframe on aggregation level or take pandas series which will simply do the value_counts() of specific categorical columns (required ones) along with the dataframe as variable to the prompt... The quality of summary will really improve