This content, written by Lloyd Tabb, was initially posted in Looker Blog on Nov 5, 2015. The content is subject to limited support.
Today Looker is announcing our integration with BigQuery, we think its a big deal.
Analytic databases are getting faster and a lot easier to deploy. Technologies are leap-frogging each other quickly. MPP databases have opened up new ways to look at really large datasets. Products like Amazon’s AWS Redshift changed the game by deploying quickly in the cloud. Now those cloud products are becoming even more powerful through truly elastic multi-tenant architectures. This progress is a perfect match for Looker’s in-database architecture… and Looker customers are benefiting from each step forward.
We think BigQuery is the next step in data warehouse evolution.
A true multi-tenant clustered database, BigQuery gives you Redshift’s ease-of-setup, but adds essentially infinite scale-out. One of our customers is pumping 50TB of data per day into BigQuery, and they use Looker to execute queries that might scan 500GB. They have lots of people looking at this data. Because of the way BigQuery works, it never gets slow when multiple people use it at the same time. All queries return in seconds.
Growth in BigQuery is smooth. If you’re running your own cluster, you have to decide at some point whether to make the cluster bigger or to remove data. Because of BigQuery’s multi-tenant architecture, this doesn’t happen. Like a Tesla going from 0 to 120 mph, there are no shift points, just smooth acceleration.
Big big data. High volume. Super easy to manage. No limits. I think we’re in love.
BigQuery is a multi-tenant database. That means part of your data is sitting on a machine next to somebody else's data, maybe on a single machine or on a number of them. In fact, you have no control over where your data sits; you just know it’s in “the cloud.” When you push in more data, your bill goes up, but you never run out of space.
Because of this architecture, BigQuery — unlike all other databases — never runs out of storage and never gets slow. (While other big data engines don’t operate this way, almost all cloud apps do. Salesforce and Gmail, for example, are both multi-tenant.)
What’s even more surprising is that, whether I’m querying small amounts of data or large amounts of data, the results seem to take about the same amount of time, with an average wait of 20 seconds or less.
Someone who knows the internals of Google once told me that when you type a query into the Google search engine, your query hits about 3,000 machines before you see your search results. Each one of these machines knows a small portion of the web — and looks to see if it has a piece that might be interesting and returns it to a server, which then puts all the pieces together on a single search results page.
BigQuery operates in a similar fashion: by widely distributing the data and widely querying it when asked. Every day I run maybe 100 Google search queries and close to 150 database queries. It would be ludicrous to even consider purchasing 3,000 machines to perform these queries in-house. On the other hand, renting a database from BigQuery lets me get the benefits of a huge database cluster at a very low cost.
Loading data into BigQuery is a really simple process. You only need push log files into Google storage and then tell BigQuery where the data is. Because the Google cluster is so big, data ingestion happens very fast.
With other databases, when things start to get slow, you start adding machines to the cluster and rebalancing. When things are slow, you’re never quite sure why. Are we having more load? Or rogue queries? Or is your data just getting bigger? With BigQuery, it doesn’t matter. There’s no cluster management. Google manages all the machines, and your queries are always fast.
BigQuery has a number of unique (or mostly unique) features. In our BigQuery release, we’ve built out Looker to take advantage of them.
When data gets big, you often split it up by time so you can manage it better. We do this manually all the time. For example, in the old world, we might have had filing cabinets for each year of the business. Want to look up a transaction that happened in 2001? Just start at the right cabinet.
With data, we do the same thing. For example, Most system log files are broken up by day. In any given file, there is a day’s worth of data. BigQuery can partition data the same way, one table for each day, treating all the separate days/tables as a single table. When you decide you no longer need some data, you simply delete a day/table and everything continues to work as usual. Sunsetting data in BigQuery is much easier than in other databases.
LookML can take complete advantage of partitioned tables.
When running a query in a regular database, the query time goes up as the amount of data you query goes up. Naturally, you want to make queries faster, so you optimize them and they become fast again. Since BigQuery runs in such a large cluster, the query never gets slow, it just gets more expensive.
When using BigQuery in Looker, we show you how much data is going to be scanned before you run your query. As you see your queries scanning more and more data, you can optimize them to be more efficient.
One of the really nice features of BigQuery is how aggressively it caches. If you run a query and the answer comes from the cache, there is essentially no cost. Most databases, if they cache at all, just cache the results. BigQuery caches all the intermediate results — so if you build a set of related queries, they’re all fast and inexpensive. BigQuery does this better than any other data engine we’ve seen.
Unfortunately, you don’t get much control over how long things are cached for. Looker’s derived tables give you complete control over caching and automated transformation.
Nested objects (think customers having multiple addresses) are a non-standard SQL extension. Several data engines support this, but all in very different ways. BigQuery’s implementation of nested objects works reasonably well.
We added syntax to LookML to build these nested objects into your models.
BigQuery isn’t all roses. The SQL is a little weird and non-standard. Looker’s symmetric aggregates and time zones currently can’t be supported. With BigQuery nested objects, there are some arbitrary limitations about how they can be used.
What’s exciting is that BigQuery continues to get better and better.
BigQuery’s architecture is right. A giant shared cluster has tremendous advantage. BigQuery still has its quirks, but it sure looks like the future.
Want to learn more about Looker & BigQuery? Read our full