Data Cloud Generative AI in BigQuery

ikramberrada · 03-24-2024 01:44 PM

In the quest for unlocking new insights and efficiencies, organizations worldwide are turning to the power of Artificial Intelligence (AI). With dreams of leveraging AI to its fullest potential, businesses seek a data and AI platform that seamlessly integrates with their enterprise data, both structured and unstructured, while ensuring security and governance. In response to this demand, we are announcing groundbreaking innovations that enhance the connection between data and AI, offering increased scale and efficiency through the integration of BigQuery and Vertex AI. These advancements empower organizations to simplify multimodal generative AI, unlock value from unstructured data, and build AI-powered search capabilities directly into their data analytics workflows.

Simplifying Multimodal Generative AI

Multimodal generative AI represents a significant advancement in the field of AI, allowing models to process and generate content across multiple data modalities, such as text, images, and videos. Google Cloud is making strides in this area by integrating Gemini models, including the Gemini 1.0 Pro, into BigQuery ML. This integration enables users to harness the power of generative AI through familiar SQL statements, providing access to advanced capabilities like text summarization and sentiment analysis directly within the BigQuery console. By blending structured and unstructured data with generative AI models, organizations can create innovative analytical applications, such as real-time customer sentiment analysis and personalized content generation.

Advantages of BigQuery ML

BigQuery ML offers several advantages over other approaches to using ML or AI with a cloud-based data warehouse:

BigQuery ML democratizes the use of ML and AI by empowering data analysts, the primary data warehouse users, to build and run models using existing business intelligence tools and spreadsheets. Predictive analytics can guide business decision-making across the organization.
You don't need to program an ML or AI solution using Python or Java. You train models and access AI resources by using SQL—a language that's familiar to data analysts.
BigQuery ML increases the speed of model development and innovation by removing the need to move data from the data warehouse. Instead, BigQuery ML brings ML to the data, which offers the following advantages:
- Reduced complexity because fewer tools are required.
- Increased speed to production because moving and formatting large amounts of data for Python-based ML frameworks isn't required to train a model in BigQuery.
For more information, watch the video How to accelerate machine learning development with BigQuery ML.

Supported models

A model in BigQuery ML represents what an ML system has learned from training data. The following sections describe the types of models that BigQuery ML supports.

Internally trained models

The following models are built in to BigQuery ML:

Linear regression is for forecasting. For example, this model forecasts the sales of an item on a given day. Labels are real-valued, meaning they cannot be positive infinity or negative infinity or a NaN (Not a Number).
Logistic regression is for the classification of two or more possible values such as whether an input is low-value, medium-value, or high-value. Labels can have up to 50 unique values.
K-means clustering is for data segmentation. For example, this model identifies customer segments. K-means is an unsupervised learning technique, so model training doesn't require labels or split data for training or evaluation.
Matrix factorization is for creating product recommendation systems. You can create product recommendations using historical customer behavior, transactions, and product ratings, and then use those recommendations for personalized customer experiences.
Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data. It's commonly used for dimensionality reduction by projecting each data point onto only the first few principal components to obtain lower-dimensional data while preserving as much of the data's variation as possible.
Time series is for performing time series forecasts. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays.

You can perform a dry run on the CREATE MODEL statements for internally trained models to get an estimate of how much data they will process if you run them.

Externally trained models

The following models are external to BigQuery ML and trained in Vertex AI:

Deep neural network (DNN) is for creating TensorFlow-based deep neural networks for classification and regression models.
Wide & Deep is useful for generic large-scale regression and classification problems with sparse inputs (categorical features with a large number of possible feature values), such as recommender systems, search, and ranking problems.
Autoencoder is for creating TensorFlow-based models with the support of sparse data representations. You can use the models in BigQuery ML for tasks such as unsupervised anomaly detection and non-linear dimensionality reduction.
Boosted Tree is for creating classification and regression models that are based on XGBoost.
Random forest is for constructing multiple learning method decision trees for classification, regression, and other tasks at training time.
AutoML is a supervised ML service that builds and deploys classification and regression models on tabular data at high speed and scale.

You can't perform a dry run on the CREATE MODEL statements for externally trained models to get an estimate of how much data they will process if you run them.

Unlocking Value from Unstructured Data

Unstructured data, including images, documents, and audio files, represents a goldmine of untapped information for organizations. However, extracting meaningful insights from unstructured data can be challenging. To address this challenge, Google Cloud is expanding the capabilities of BigLake, a unified data management framework, to enable analysis, search, and processing of unstructured data. Leveraging Vertex AI's document processing and speech-to-text APIs, organizations can extract valuable insights from documents and audio files, facilitating tasks such as content generation, sentiment analysis, and entity extraction. This opens up new possibilities for industries ranging from finance to healthcare, allowing them to derive actionable insights from previously inaccessible data sources.

Improving Vector Search with Unstructured Data

Vector search, also known as approximate nearest-neighbor search, is a powerful technique for enabling semantic search, similarity detection, and retrieval-augmented generation (RAG) with large language models (LLMs). Google Cloud recently announced the preview of BigQuery vector search integrated with Vertex AI, providing users with the ability to perform vector similarity search on their BigQuery data. This functionality enhances AI models' context understanding, reduces ambiguity, and ensures factual accuracy, ultimately improving the quality of search results and AI-driven applications. By leveraging vector search, organizations can enhance product recommendations, automate content retrieval, and streamline information retrieval processes.

The integration of generative AI into BigQuery marks a significant milestone in the evolution of data analytics. By simplifying access to multimodal generative AI, unlocking insights from unstructured data, and improving search capabilities, Google Cloud is empowering organizations to derive greater value from their data. As businesses embark on their journey towards digital transformation, the possibilities afforded by generative AI are endless. With Google Cloud as a partner, organizations can navigate this journey with confidence, leveraging the latest advancements in AI and data analytics to drive innovation, unlock new insights, and stay ahead in an increasingly competitive landscape.

Supported AI resources

You can use remote models to access AI resources like LLMs from BigQuery ML. BigQuery ML supports the following AI resources:

Generative AI by using one of the Vertex AI text-bison* natural language foundation models.
Text embedding by using one of the Vertex AI textembedding-gecko* text embedding foundation models.
Natural language processing by using the Cloud Natural Language API.
Machine translation by using the Cloud Translation API.
Document processing by using the Document AI API.
Audio transcription by using the Speech-to-Text API.
Computer vision by using the Cloud Vision API.

BigQuery ML and Vertex AI

BigQuery ML integrates with Vertex AI, which is the end-to-end platform for AI and ML in Google Cloud. When you register your BigQuery ML models to Model Registry, you can deploy these models to endpoints for online prediction. For more information, see the following:

To learn more about using your BigQuery ML models with Vertex AI, see Manage BigQuery ML models with Vertex AI.
If you aren't familiar with Vertex AI and want to learn more about how it integrates with BigQuery ML, see Vertex AI for BigQuery users.
Watch the video How to simplify AI models with Vertex AI and BigQuery ML.

Join Us for the Future of Data and Generative AI

As organizations continue to explore the possibilities of generative AI, we remains committed to driving innovation in data analytics. To learn more about the latest advancements in generative AI and data analytics, sign up for the upcoming Data Cloud Innovation Live webcast on March 7, 2024.

And be sure to join us at Next ’24 to get the inside track on all the latest product news and innovations to accelerate your transformation journey this year.

Screenshot 2024-03-24 at 15.25.12.png