Since social media-related data is large, they should be stored in a horizontally scalable databases. There are 2 horizontally scalable databases in Google cloud: Bigtable and Spanner. From all the resources that I read so far, Spanner seems to be better in every aspect than Bigtable (unless I need to store unstructured data). However is that the case? or are there cases where Bigtable is actually better?
What I need to store are social media posts, likes, views, etc. Also, I need to store data where consistency is important.
Solved! Go to Solution.
When deciding between Bigtable and Spanner for storing large-scale, evolving social media data (such as posts, likes, and views), it’s important to recognize that while both databases offer horizontal scalability, they cater to different use cases.
Bigtable is designed for extreme write throughput and schema flexibility, making it ideal for applications that require rapid, low-latency data handling, such as high-volume social media interactions. Bigtable can efficiently manage the constant stream of writes generated by user activities while allowing for flexible schema changes without complex migrations. This is particularly useful in social media platforms where the data model may evolve over time. Additionally, Bigtable is a cost-effective solution for storing large amounts of data, especially when strong consistency across rows is not critical. It provides strong consistency for single-row reads and writes but only eventual consistency for multi-row operations, making it suitable for workloads where real-time accuracy isn’t always necessary, such as displaying the number of views or likes on a post.
On the other hand, Spanner offers strong consistency for all reads and writes, making it the better choice when data integrity is essential. Spanner’s relational model supports complex SQL queries, including joins and aggregations, which are useful for analyzing user behavior, generating reports, and performing real-time data analysis. Although Spanner may not handle write throughput as efficiently as Bigtable in certain scenarios, it can still scale horizontally to manage large volumes of writes, particularly when transactional consistency is required. Spanner’s SQL interface also simplifies development and integration with other tools, providing a structured, reliable approach to managing consistent data.
For social media platforms, both databases could play complementary roles. Bigtable excels in handling high-throughput, low-latency operations like user interactions, while Spanner’s strong consistency and SQL support are ideal for storing and querying data where accuracy and complexity are crucial, such as user engagement analysis or reporting.
Ultimately, the choice between Bigtable and Spanner depends on the specific needs of your application. If cost-effectiveness and handling high volumes of writes are your priorities, Bigtable is a strong candidate. However, if you require strong consistency and need to run complex queries on your data, Spanner is the better option. For some workloads, a combination of both databases could be the most effective solution.
When deciding between Bigtable and Spanner for storing large-scale, evolving social media data (such as posts, likes, and views), it’s important to recognize that while both databases offer horizontal scalability, they cater to different use cases.
Bigtable is designed for extreme write throughput and schema flexibility, making it ideal for applications that require rapid, low-latency data handling, such as high-volume social media interactions. Bigtable can efficiently manage the constant stream of writes generated by user activities while allowing for flexible schema changes without complex migrations. This is particularly useful in social media platforms where the data model may evolve over time. Additionally, Bigtable is a cost-effective solution for storing large amounts of data, especially when strong consistency across rows is not critical. It provides strong consistency for single-row reads and writes but only eventual consistency for multi-row operations, making it suitable for workloads where real-time accuracy isn’t always necessary, such as displaying the number of views or likes on a post.
On the other hand, Spanner offers strong consistency for all reads and writes, making it the better choice when data integrity is essential. Spanner’s relational model supports complex SQL queries, including joins and aggregations, which are useful for analyzing user behavior, generating reports, and performing real-time data analysis. Although Spanner may not handle write throughput as efficiently as Bigtable in certain scenarios, it can still scale horizontally to manage large volumes of writes, particularly when transactional consistency is required. Spanner’s SQL interface also simplifies development and integration with other tools, providing a structured, reliable approach to managing consistent data.
For social media platforms, both databases could play complementary roles. Bigtable excels in handling high-throughput, low-latency operations like user interactions, while Spanner’s strong consistency and SQL support are ideal for storing and querying data where accuracy and complexity are crucial, such as user engagement analysis or reporting.
Ultimately, the choice between Bigtable and Spanner depends on the specific needs of your application. If cost-effectiveness and handling high volumes of writes are your priorities, Bigtable is a strong candidate. However, if you require strong consistency and need to run complex queries on your data, Spanner is the better option. For some workloads, a combination of both databases could be the most effective solution.
Thank you, that is very helpful! I have a few questions:
- You mentioned that Bigtable handles write throughput more efficiently than Spanner. In terms of numbers: Bigtable handles 10,000 write per second while Spanner handles 3,500 writes per second (which can be increased to 22,500). But both support adding more nodes unlimitedly which increases the speed linearly. Is that correct? I found these numbers in https://cloud.google.com/bigtable/docs/performance#typical-workloads and https://cloud.google.com/spanner/docs/performance#typical-workloads. Though in that case, wont spanner have higher write throughput (22,500) than Bigtable (10,000)?
- When you mentioned eventual consistency across different rows, I guess you meant atomicity instead (https://cloud.google.com/bigtable/docs/writes#batch)? From my understanding so far, the only case where Bigtable is not strongly consistent (but eventually consistent) is when you replicate across clusters. Is my understanding correct?
1. Write Throughput Comparison Between Bigtable and Spanner
Both Bigtable and Spanner support horizontal scaling by adding more nodes, which increases their throughput linearly. Based on the performance numbers from the sources you shared:
Bigtable: Handles 10,000 writes per second per node.
Spanner: Handles 3,500 writes per second per node, which can increase up to 22,500 writes per second per node depending on configuration (e.g., adding nodes and optimizing performance).
From these numbers, Spanner can indeed have a higher write throughput than Bigtable if scaled appropriately. The initial per-node write throughput may be lower for Spanner (3,500 compared to Bigtable's 10,000), but Spanner’s performance can scale much higher by adding more nodes. This means that while Bigtable starts with a higher per-node capacity, Spanner can potentially surpass Bigtable in terms of write throughput if you provision sufficient resources.
Therefore, if write throughput is a critical factor, Spanner can achieve higher throughput with the right configuration, especially when you need both throughput and consistency. However, Spanner is typically more expensive and complex to configure for these high-throughput use cases compared to Bigtable.
2. Consistency vs. Atomicity in Bigtable
Your understanding is largely correct. Let's clarify:
Strong Consistency: Bigtable provides strong consistency for single-row reads and writes. That means when you perform operations on a single row, the data is immediately consistent.
Atomicity: In cases where you perform batch writes or updates across multiple rows, the operations are atomic within a single row but not across multiple rows. This means that each row's write is isolated and consistent, but if you're writing to several rows in a single batch, these operations are not atomic across all the rows together.
Eventual Consistency: You are correct that Bigtable is only eventually consistent when you're replicating data across clusters in different regions. In a replicated, multi-cluster setup, there can be slight delays in data synchronization between clusters, leading to eventual consistency across regions. However, for most standard operations within a single region, Bigtable maintains strong consistency at the row level.
It sounds good and we have to improve and do it in another way also it's fine