CloudSQL for Postgres - Unable to Create HNSW Inde...

wahajali

Hi, I'm trying to use the latest pgvector v0.6.0 Postgres extension on CloudSQL Postgres v15.5. At the time of HNSW index creation, I get the following errors:

2024-04-25 16:08:08.830 UTC [214]: [22-1] db=database,user=user ERROR: could not resize shared memory segment "/PostgreSQL.1088724100" to 8586826336 bytes: No space left on device

I have the following specs/configurations:
1M vectors, 768 dimensions
4 vCPU and 16 GB ram
maintenance_work_mem = 8GB
max_parallel_maintenance_workers = 2
shared_buffer = 5336MB (set by default)

The same works fine on a similar sized RDS Postgres instance on AWS. The one noticeable difference is between the version, AWS has postgres v16.2 and pgvector v0.6.2 while GCP has postgres v 15.5 and pgvector v0.6.0.
Setting max_parallel_maintenance_workers = 0 works fine.

ms4446

The discrepancy in PostgreSQL and pgvector versions between your AWS RDS and Google Cloud SQL environments is indeed an important clue. It's possible that enhancements in PostgreSQL v16.2 and pgvector v0.6.2 are optimizing memory usage and management better than the older versions.

Given that setting max_parallel_maintenance_workers to 0 resolves the issue on GCP, it appears that parallel processing during index creation might be overloading your system's memory capabilities under the current Cloud SQL configuration. This setting essentially reduces the memory demand by not allowing parallel maintenance work, which, while it may slow down the operation, helps avoid the shared memory errors.

Resource Adjustment:
- Memory Allocation: Since setting max_parallel_maintenance_workers to 0 helps, you might also experiment with slightly increasing maintenance_work_mem while keeping parallel workers disabled. This could optimize the operation without exceeding memory limits.
- Scaling Up: If possible, temporarily increase the machine type during index creation to handle higher loads, then scale down once complete.
Performance Testing:
- Comparative Analysis: Run parallel tests on both AWS and GCP with the same version settings to definitively pinpoint whether the issue is version-related or possibly due to inherent differences in how Cloud SQL and RDS manage resources.
- Monitoring Tools: Utilize GCP's detailed monitoring tools to track memory and process efficiency during these tests.
Google Cloud Support: It might be worthwhile to discuss this scenario with Google Cloud support. They could provide insights specific to pgvector's performance on Cloud SQL and any upcoming support for newer versions.

wahajali

With `maintenance_work_mem = 3GB` the index build works fine, anything higher fails.

I also tried with a larger instance but got the same error:
8 vCPUs, 32G mem
maintenance_work_mem = 15GB
max_parallel_maintenance_workers = 7

ms4446

Hi @wahajali ,

Since you've identified that 3GB for maintenance_work_mem works, it would be advisable to use this setting for now to avoid the "no space left on device" error. This setting seems to be the sweet spot between performance and memory allocation limits on Cloud SQL.

You could try incrementally increasing maintenance_work_mem from 3GB upwards in smaller steps (e.g., 4GB, 5GB) to find the maximum stable value under your current system.

Similarly, adjusting max_parallel_maintenance_workers in smaller increments might help you optimize the use of your CPU cores without overloading memory.

Confirm that the disk space and I/O capabilities are not bottlenecks on the larger instance. Even with more memory, disk I/O limitations can still cause performance issues.

Utilize GCP's monitoring tools to observe how memory and other resources are utilized during the index building process. Look for any spikes or unusual patterns that could indicate underlying problems.

It may be beneficial to consult with Google Cloud Support to understand if there are undocumented limits or optimization techniques specific to pgvector and PostgreSQL on Cloud SQL.

wahajali

Thanks for the reply.

Actually I'm trying to benchmark pgvector across GCP, Azure and AWS. And while the configurations are working on both Azure and AWS, its failing on GCP. I agree that changing the `maintenance_work_mem` to a lower value will work to get the index build, but I won't be able to get an accurate comparision.

ms4446

Hi @wahajali ,

The version discrepancies between PostgreSQL and pgvector on AWS and Google Cloud would also complicate achieving an accurate comparison.

Ideally, all environments should use the same versions of PostgreSQL and pgvector to ensure the benchmarking is as fair and useful as possible.

Try to standardize the settings (like maintenance_work_mem, max_parallel_maintenance_workers, etc.) across all platforms as much as possible. Since lowering maintenance_work_mem on GCP helps avoid errors but may affect performance metrics, finding a middle ground that works across all environments might be necessary.

Ensure that memory allocation and the number of CPUs are comparable across all test environments. This includes checking and adjusting shared buffers, work memory, and even disk I/O settings as these can influence performance outcomes.

CloudSQL for Postgres - Unable to Create HNSW Index with Latest pgvector 0.6.0