System Design: Analytics Best Practices

In this article, you'll find recommendations and best practices focused on the topic of Analytics, as part of the System Design Pillar of the Google Cloud Architecture Framework.

Throughout this article, we often refer to the analyze your data documentation. We suggest you review this documentation to learn basic concepts before evaluating the following assessment questions and recommendations.

Data ingestion

Do you need a hands-off approach to data ingestion with automated tooling?

Show More
  • Bringing data from other systems into the cloud can be challenging. Based on your team’s skills and capabilities, you may want to use tools that allow you to automate data processes through a user interface (UI). One such tool is Cloud Data Fusion, which is a fully managed, cloud-native data integration tool with connectors and plugins, so you can integrate data from external sources - code free!

  • If you need to be more involved and would like to write some code, you could opt for Dataflow, Pub/Sub, or BigQuery. You can use the gsutil command to bring data into storage buckets, but keep in mind that gsutil is only recommended for data sizes of up to 1 TB.

What is your data source? Streaming or batch?

Show More
  • For streaming workloads where you’re handling global events with low latency requirements, use Pub/Sub. If you want to store streaming data for immediate analytics or to use with an analytics-focused workforce, then you should consider streaming data into BigQuery.

  • If you’re streaming data from another system like Apache Kafka from an on-premises or cloud environment, use an existing click and deploy template in Dataflow that will read data from Kafka into BigQuery.

  • For batch workloads, getting data into Google Cloud Storage is the first step. Use gsutil or a Dataflow template to fetch and store data in the many storage options available in Google Cloud.

Is your data source on-premises or on another cloud?

Show More
  • You can ingest data from other cloud providers using Data Fusion, Storage Transfer Service, or BigQuery Data Transfer Service.

  • For on-premises ingestion, depending on the data volume and your team’s skill set, you could choose a Transfer Appliance, the Storage Transfer Service, or Data Fusion with a suitable connector (e.g. JDBC connector).

  • Data Transfer Service writes data to a Cloud Storage bucket, while BigQuery Data Transfer service writes data only into a BigQuery dataset. Data Fusion on the other hand, can select a destination from many options.

Are you ingesting data or migrating from a data warehouse?

Show More

How much data do you want to ingest?

Show More
  • For streaming ingestion of data, Pub/Sub scales to 10s of GBs of data per second. For lower volumes of 1 MB to 1 GB per second, Pub/Sub Lite is recommended. If you see your volumes growing to 10s of GBs of data, use Pub/Sub to guard against future tech debt.

  • For batch ingestion of data, the specific Google Cloud service you use depends on how much data you want to transfer in total and how quickly you want to do it. We recommend you explore the options available from Google that are outlined here.

Do you need to ingest data at a regular frequency or on a schedule?

Show More
  • There are several options available to you depending on your current technical stack, the amount of data you need to ingest, and your team’s expertise. For data transfer up to 1 TB, it’s recommended you use the gsutil command with cron. For production transfers, here are some best practices you should follow.

  • Storage Transfer Service and BigQuery Data Transfer Service both enable you to schedule ingestion jobs. If you want fine-grained control on the timing of ingestion and/or the data sources and destinations, you could consider workflow management systems like Cloud Composer. You can also create a custom serverless application with the help of Cloud Scheduler, Cloud Functions, and Pub/Sub.

Do you want to ingest data from IoT devices?

Show More
  • Use IoT Core to connect and manage devices and to also collect and store all your data into the cloud.

Do you need to ingest data from a File Transfer Protocol (FTP) or SSH File Transfer Protocol (SFTP) server?

Show More
  • You can create and manage simple bash scripts to read data from an FTP/SFTP server and write into Cloud Storage, but such solutions are high maintenance, more error prone, and don’t provide end-to-end visibility into the process.

  • If you need a code-free environment, use Data Fusion, which has an FTP source plugin.

  • If you want to modernize and are looking for a long-term workflow solution, you can create a simple ingestion process in Cloud Composer. Cloud Composer is a fully managed service that allows you to read and write from various sources and sinks.

  • There are third-party solutions to read data from FTP locations into Google Cloud as well, which can be a good option if you’re already using a third-party solution.

Do you use Apache Kafka? Do you want to integrate it with Google Cloud to ingest data?

Show More

Data storage

How will you be using your data that’s stored in Google Cloud? What is your downstream use case?

Show More
  • Understanding the downstream usage of your data can be a crucial deciding factor in what type of storage solution you choose. We recommend the following Google Cloud services depending on your downstream data access semantics:

    • File-based? use Filestore

    • Object-based? use Cloud Storage

    • Low latency? use Cloud Bigtable

    • Time series? use Cloud Bigtable

    • Online cache? use Memorystore

    • Transaction processing? use Cloud SQL

    • Business intelligence (BI) and analytics? use BigQuery

    • Batch processing? use Cloud Storage, or use Bigtable if incoming data is time series and you need low latency access to it, or use BigQuery if you use SQL

Is your data structured, semi-structured, or unstructured?

Show More
  • For most unstructured data, such as documents, text files, audio and video files, logs, etc. An object-based store is the most suitable landing area. Once there, it can be loaded and processed in a processing environment.

  • For semi-structured data, such as XML, JSON, etc., your use cases and data access patterns should guide your decision. You can load semi-structured datasets into BigQuery for automatic schema detection, or if you have low latency requirements, you can load your JSON data into Bigtable. If you have legacy requirements or your applications work with relational databases, you can load these datasets into a relation store.

  • For structured data, such as CSV, Parquet, Avro, or ORC, where you have BI and analytics requirements and are SQL-focused, you can use BigQuery. If you’re looking to create a data lake on open standards and technologies, use Cloud Storage.

Do you want to move on-premises Hadoop Distributed File System (HDFS) data to a less-expensive object storage system?

Show More
  • Depending on your requirements, there are two ways to move HDFS data into Google Cloud: the push or the pull model. Both models use the hadoop distcp command. Best practices and recommendations involve fine-grained planning.

Do you intend to move away from HDFS?

Show More
  • Moving from HDFS to an object store has its pros and cons. Storing data in Google Cloud Storage is the most common choice enterprises make. Cloud Storage connector is an open source connector that allows Apache Hadoop and Spark jobs to access data in Cloud Storage. While it’s already installed on Dataproc clusters, it can be easily installed on other clusters as well.

Do you want to build a data lake in the cloud?

Show More
  • Data lake storage is usually backed by an object store. Google Cloud Storage enables you to build a data lake, while being a central data storage system. If you want to build a modern data platform, you can use BigQuery as your central data store. BigQuery is a modern data warehouse with separation of storage and compute. A data lake built on top of BigQuery not only allows you to perform traditional analytics from within the BigQuery UI, but it also allows you to access the data stored from other frameworks like Apache Spark.

Data processing and transformation

Are you aligned with open source software? If yes, do you want to continue using those in the cloud?

Show More
  • Many Google Cloud services align with open source software to make your transition to cloud seamless. Google Cloud offers managed and serverless solutions with open APIs and are compatible with open source frameworks, so you can avoid vendor lock-in.

  • Dataproc is a Hadoop-compatible managed service that allows you to host open source software in the cloud with little operational burden - providing a faster, easier-to-manage, and more cost-efficient way to run Hadoop or Spark clusters. Dataproc currently supports Spark, Hive, Pig, Presto, Zookeeper, and others. It also offers Apache Hive metastore as a managed service (Dataproc Metastore) to remove it from being a single point of failure in the Hadoop ecosystem.

  • If you use Apache Beam as a combined batch and streaming processing engine, then you can use Dataflow, which is based on Apache Beam, but is fully managed and serverless. It enables you to write jobs in Beam, but it manages the execution environment for you, thereby reducing operational complexities.

  • If you use CDAP as your data integration platform and want to continue using that in the cloud, then use Cloud Data Fusion, a fully managed, cloud-native option.

Is your data processing extract, transform, and load (ETL) or extract, load, and transform (ELT)?

Show More
  • Google Cloud enables you to choose either traditional ETL or more modern ELT data processing systems. For ETL pipelines, choose from either Data Fusion, Dataproc, or Dataflow, based on your current data processing capabilities.

  • If you want to start fresh, then Dataflow is the recommended service because it provides one, unified way to create batch and streaming applications. If you want a hands-off approach, then Data Fusion is recommended, because you can create pipelines with a drag-and-drop UI.

  • For ELT pipelines, use BigQuery. It supports both batch and streaming data load with equal ease. Once the data is inside BigQuery, use SQL to perform the transformations needed to derive new datasets for your business use cases.

  • If you want to modernize and move from ETL to ELT, you can use Dataform, a recent addition to the Google Cloud suite of products. Or if you use a third-party service like Fivetran for ELT, we have detailed pipeline guidance for such a use case.

Do you want a hands-off approach to data transformation (with minimal coding)?

Show More
  • Dataprep by Trifacta provides a visual interface to explore, clean, and prepare data (i.e., transform it for analysis and machine learning). It enables you to enforce data quality rules and standardization with features such as pattern matching and sampling. You can create data transformation steps and choose where to run those transformations at scale, either with a Dataflow job or with BigQuery SQL. Dataprep is also integrated with Data Fusion, which is an end-to-end data pipeline and integration service.

Do you have a batch, streaming, or real-time data transformation use case?

Show More
  • Different use cases require different tools and frameworks. While some systems are built from the ground up to handle all three use cases, other systems are suited to only one particular use case.

  • For a batch data processing system, you can choose data processing and transformation in BigQuery with a familiar SQL interface. But if you have an existing pipeline running with Hadoop or Spark on-premises or in another public cloud, you can use Dataproc. You can also consider Dataflow for a unified programing interface for both batch and streaming use cases. Google’s recommendation is to modernize and use Dataflow for ETL and BigQuery for ELT.

  • For streaming data pipelines, you can use a managed serverless service like Dataflow, which provides windowing, autoscaling, templates, etc. BigQuery easily handles streaming inserts if you have analytics and SQL-focused teams, capabilities, or strategies.

  • For real-time use cases, such as time series analysis or streaming video analytics, use Dataflow.

Do you want an open programming interface with the ability to change the execution engine if need be?

Show More

Do you want to ingest data from multiple sources with a managed service that enables autoscaling, debugging, and no operations (no-ops)?

Show More
  • Use Dataflow. Dataflow templates provide an easy way to get started with most data ingestion sources, such as Pub/Sub, GCS, HDFS, S3, Kafka. etc. And because Dataflow is a managed serverless service, you have more time to focus on the business use case at hand without having to worry about operations.

  • Dataflow Prime is a new service that takes the Dataflow service to the next level by offering horizontal and vertical autoscaling of machines used in the execution process of a pipeline - taking pipeline tuning out of the picture. It also adds smart diagnostics and recommendations that identify problems and make suggestions on the fly.

Do you want a managed solution for discovering, identifying, and protecting sensitive data?

Show More

Do you want to modernize your data transformation processes with SQL expertise?

Show More
  • With Dataform, you can write data transformations as code and enable version control by default. It brings software development best practices like continuous integration and continuous delivery (CI/CD), unit testing, and version control to SQL code. It also supports all major cloud data warehouse products and databases like PostgreSQL.

Data analytics and warehousing

Do you have a data lake, but you’re not sure if you need a data warehouse?

Show More
  • Data lakes and data warehouses complement each other; they’re not an either-or choice. While data lakes are very useful for unstructured and semi-structured data storage and processing, data warehouses are great for analytics and business intelligence. BigQuery is a great option, since it can process PBs of data with ease.

Do you want to migrate from a traditional data warehouse to BigQuery?

Show More
  • For migrating from a traditional data warehouse to BigQuery as an architectural pattern, see guidance here.

Do you want to avoid vendor lock-in when choosing a data warehousing solution?

Show More
  • BigQuery is designed such that the storage and compute layers can scale independently of each other. You can read data stored in BigQuery via external applications, such as Apache Spark, and the BigQuery Storage Read API enables both read and write operations to BigQuery-managed storage independent of the BigQuery UI. While giving you complete control over the data stored, BigQuery is a fully managed and serverless service - removing any hardware purchases and maintenance requirements. You are not tied to the service at any point.

Do you need federated access to data?

Show More
  • BigQuery enables you to define external tables that can read data from either Bigtable, Cloud SQL, Cloud Storage, or Google Drive. You can join these external sources with tables that are stored natively in BigQuery.

Do you have a Star or Snowflake schema and want to move to BigQuery?

Show More
  • BigQuery supports both Star and Snowflake schemas, but its native schema representation is neither of those two. It uses nested and repeated fields for a more natural representation of the data. For more information, see the example schema in the BigQuery documentation.

Reporting and visualization

Are you currently using any visualization and BI tools (i.e, Qlik, Looker, Tableau) that you want to continue using with BigQuery?

Show More
  • Use BigQuery BI Engine, which works with any BI or custom dashboarding application that requires sub-second response times. BI Engine is natively integrated into the BigQuery API. Users of any BI tool or custom applications that connect to the BigQuery API via JDBC/ODBC can take advantage of BI Engine by reserving BI Engine capacity in their Google Cloud projects and specifying the size of memory.

Do you want to modernize your BI processes with tools that can federate data access across transactional and analytical datastores?

Show More
  • Looker is a modern enterprise platform for BI, data applications, and embedded analytics. It allows you to create consistent data models on top of your data with speed and accuracy while allowing you to access data inside transactional and analytical datastores. Looker works across multicloud environments as well.

Key Google Cloud services

  • Pub/Sub: Ingest events for streaming into BigQuery, data lakes, or operational databases

  • Dataflow: Unified stream and batch data processing that's serverless, fast, and cost-effective

  • Dataprep by Trifacta: Intelligent cloud data service to visually explore, clean, and prepare data for analysis and machine learning

  • Datalab: Explore, visualize, analyze, and transform data using familiar languages, such as Python and SQL, interactively

  • Dataproc: Service for running Apache Spark and Apache Hadoop clusters

  • Cloud Data Fusion: Fully managed, cloud-native data integration at any scale

  • BigQuery: Serverless, highly scalable, and cost-effective multicloud data warehouse designed for business agility

  • Cloud Composer: Fully managed workflow orchestration service built on Apache Airflow

  • Data Catalog: Fully managed and highly scalable data discovery and metadata management service

  • Google Data Studio: Interactive data suite for dashboarding, reporting, and analytics

  • Looker: Enterprise platform for business intelligence, data applications, and embedded analytics

  • Dataplex: Intelligent data fabric for unifying data management across silos

  • Analytics Hub: Service for securely and efficiently exchanging data analytics assets

Resources

What's next?

We've just covered the topic of Analytics as part of the System Design Pillar of the Google Cloud Architecture Framework. There are several other topics within the System Design Pillar that may be of interest to you:

Version history
Last update:
‎12-13-2021 03:03 PM
Updated by: