BigQuery is Google Cloud’s serverless enterprise data warehouse that enables users to centrally manage and analyze large amounts of data.
BigQuery isn’t just for “Big” amounts of data or large enterprises though - it can support a wide variety of workloads and any organization, scaling based on their needs..
In this article, we explore five reasons to use BigQuery at the heart of your data analytics platform, including rich capabilities such as built-in machine learning, Spark integration, geospatial analysis, and near-real time data ingestion.
This article is based on a recent session from the 2023 Cloud Technical Series. Register here to watch on demand!
If you have any questions, please leave a comment below and someone from the Community or Google Cloud team will be happy to help.
If you think about a typical machine learning pipeline - from gathering data to feature engineering to training the model, evaluating the model and making sure it’s accurate, and then finally deploying it and making predictions based off of it, then you have lots of different processing places to manage.
With BigQuery, you can do all of this within the database because that’s where your data lives. You don't need to export data to other applications or wait for limited resources from a data science team.
BigQuery ML enables you to build and operationalize machine learning models on structured, semi-structured, and now unstructured data directly inside BigQuery, using simple SQL - in a fraction of the time.
So what kind of machine learning models can you get started with in BigQuery?
Supervised machine learning is where you’re asking the model for a specific outcome you want it to predict for you. Supervised machine learning is analogous to learning a subject by studying a set of questions and their corresponding answers. After mastering the mapping between questions and answers, a student can then provide answers to new (never-before-seen) questions on the same topic.
For example, you can use a supervised machine learning model to answer the question, which of these products do we think these customers will be interested in?
Unsupervised machine learning is where you don’t really know what outcome the model is going to tell you. The most common use of unsupervised machine learning is to cluster data into groups to perform analysis and identify patterns. For example, an unsupervised machine learning algorithm can cluster songs based on various properties of the music. The resulting clusters can become an input to other machine learning algorithms (for example, to a music recommendation service). Clustering can help when useful labels are scarce or absent.
As discussed during the 2023 Cloud Technical Series, K-means clustering can help identify customer segments. You can run the model across your customer base to see customers that are similar and engage with them in a way that’s helpful for that specific group of customers.
Time series is for performing time-series forecasts. For example, you could use time series analysis to forecast the future sales of winter coats by month based on historical sales data. You can use this feature to create millions of time series models and use them for forecasting. The model automatically handles anomalies, seasonality, and holidays.
Below is a handy machine learning model selection guide to help you identify which BigQuery ML model will be best for your use case.
Lastly, it’s important to mention the Model Ops capabilities of BigQuery ML. Training the model is just one part - you also have to make sure you can manage your models well, including exporting models, updating metadata, deleting models, etc. Learn more about managing your BigQuery ML models here.
Over one third of organizations believe that data analytics and machine learning have the most potential to significantly alter the way they run their business over the next three to five years. However, only 26% of organizations are data driven. One of the biggest reasons for this gap is that a major portion of the data generated today is unstructured, which includes images, documents, and videos. It’s estimated to cover roughly up to 80% of all data, which has remained untapped by organizations.
BigQuery object tables are a new type of table in BigQuery (currently in Preview) that provides a structured record interface for unstructured data in Google Cloud Storage. With object tables, you can extend best practices of securing, sharing, and governing structured data to unstructured, without needing to learn or deploy new tools. You can perform analysis with remote functions or perform inference by using BigQuery ML, and then join the results of these operations with the rest of your structured data in BigQuery.
By analyzing unstructured data natively in BigQuery, you can:
Learn more about BigQuery object tables in the short video below.
Serverless Spark is Google Cloud’s autoscaling, serverless Spark solution, integrated with Google-native and open source tools so you can develop and run Spark where you need it across all use cases, including ETL, data science, and exploration. Using this capability, developers can write applications and pipelines that autoscale without any manual infrastructure provisioning or tuning.
With Serverless Spark, you can use all Spark types (PySpark, Spark SQL, Spark R, Spark Java/Scala), but if you’re using PySpark, you can actually do it through the BigQuery interface. You can write PySpark code in the BigQuery editor, and the code is executed without the need for infrastructure provisioning.
Learn more about Spark on Google Cloud solutions here.
Many critical business decisions revolve around location data. For example, you may record the latitude and longitude of your delivery vehicles or packages over time. You may also record customer transactions and join the data to another table with store location data.
With BigQuery, you can use this type of location data to determine when a package is likely to arrive or to determine which customers should receive a mailer for a particular store location. Geospatial analytics let you analyze and visualize geospatial data in BigQuery by using geography data types and GoogleSQL geography functions.
Here are some additional resources on getting started with geospatial data analysis in BigQuery:
Near real-time data streaming analytics is a common data engineering pattern for businesses who want to derive insights and react to new information in real time. Google Cloud provides a three-product pairing that works well in this particular use case: Pub/Sub for ingesting the data into a messaging queue, Dataflow to transform that data, and then BigQuery to analyze and extract insights from the data.
However, you can also write directly from Pub/Sub to BigQuery using a BigQuery subscription, and then use SQL to transform the data. BigQuery is designed for real-time streaming so you can stream events as they come in and they’ll be available for querying almost immediately.
In addition, you can use a service called BI Engine, which is an in-memory analysis engine designed to cache data from BigQuery so that when you’re querying, you have very low latency response for your queries.
When BI Engine is looking at a streaming table (a table with streaming data entered into it), BI Engine will determine how often you’re querying that table and will preemptively pull that data into its own memory cache so it’s ready for you when you’re querying. This accelerates your query time and reduces the time it takes to get insights from your real-time data.
The BI Engine SQL interface expands BI Engine to integrate with other business intelligence (BI) tools such as Looker, Tableau, Power BI, and custom applications to accelerate data exploration and analysis. This page provides an overview of the BI Engine SQL interface, and the expanded capabilities that it brings to BI Engine.
Are you using BigQuery today? Have you explored any of the five BigQuery capabilities we’ve highlighted in this article?
We’d love to hear from you in the comments how you’re using BigQuery today (or how you’d like to start using BigQuery) and if you have any questions or suggestions!
Lastly, here’s a roundup of some additional resources to help you make the most of BigQuery in your organization:
Special thanks to Russell Nash, APAC Solutions Lead Data Analytics (@russnash) for delivering the original content of this blog during the 2023 Cloud Technical Series.