Analytics and Monetization issue when switching ov...

skhendkar · 03-16-2025 11:34 PM

Issue: While performing the drill to failover to a DR data center, we are observing below issues:

The analytics data is still being written to the old PG master node
Monetization messages are stuck in qpid queue

Background and overview:

We have two Apigee OPDK data centers - a primary DC and a DR DC, in active-passive mode. At a given point in time, the API traffic will flow in any one of the DC. If the primary DC fails, the traffic is to be routed to the DR.

We have a 14 node setup in each DC -

Node 1	Apigee Message Processor and Router
Node 2	Apigee Message Processor and Router
Node 3	Apigee Message Processor and Router
Node 4	Apigee Message Processor and Router
Node 5	Apigee Message Processor and Router
Node 6	Apigee Message Processor and Router
Node 7	Cassandra and Zookeeper
Node 8	Cassandra and Zookeeper
Node 9	Cassandra and Zookeeper
Node 10	Edge UI, management server, OpenLDAP
Node 11	Edge UI, management server, OpenLDAP
Node 12	Postgres DB
Node 13	Qpid Server
Node 14	Qpid Server

Postgres replication is enabled with PG node in DC-1 as master and the one in DR as slave.
Cassandra ring has been established between the two DCs.
Followed this Apigee documentation to add DC-2 data center as a disaster recovery DC - https://docs.apigee.com/private-cloud/v4.50.00/adding-data-center
Monetization is installed on both DC-1 and DC-2.

We performed a planned failover and failback activity to verify that the DC-2 Apigee instance works as expected. However, we faced some issues mentioned below during the activity.

DB error - can not write to the read-only DB:
- "ERROR: cannot execute INSERT in a read-only transaction"
- "STATEMENT: INSERT INTO analytics. ****............**** GROUP BY apiproxy,apigee_timestamp,api_product"
Monetization messages are stuck in qpid queue: We see that the monetization messages are not processes and are stuck in qpid queue.

A quick summary of failover steps that we performed:

Ensure DC-2 components are up and running, ensure Postgres and Cassandra are in sync with DC-2, and other prerequisites
Stop traffic on DC-1 Apigee instance
Promote DC-2 PG as master and DC-1 PG as slave - https://docs.apigee.com/private-cloud/v4.50.00/handling-postgressql-database-failover
Change Postgres database settings for monetization - https://docs.apigee.com/private-cloud/v4.50.00/change-pg-settings-monetization
Restart all Apigee components in suggested order - https://docs.apigee.com/private-cloud/v4.50.00/starting-stopping-and-restarting-apigee-edge
Update LBs to point to DC-2 Apigee instance and enable traffic on DC-2 Apigee instance
Monitor Apigee traffic, check for component logs

Could anyone help with why the why the analytics data is being routed to the old PG master node even after the PG DB failover as suggested in the documents?

AlexET

Dear @skhendkar,

After reviewing your situation, we believe that submitting a support request would be the most effective way to ensure you receive the precise guidance. Here's more information to open a support case:

If you have a Google Cloud Support Plan file a support ticket through Google Cloud Console.
If you do not have a support plan, you should contact your existing sales point of contact or use the Contact Us form to talk to someone.

In the meantime, we encourage you to explore our product articles: https://www.googlecloudcommunity.com/gc/Cloud-Product-Articles/tkb-p/cloud-articles/label-name/apige...

Additionally, we host weekly office hours every Thursday at 4:00 PM CET (10:00 AM EST | 12:00 PM BRT), and tomorrow's session will focus on Apigee API Management Monitoring.

This session will provide an overview and a deep-dive of the product advanced out of the box capabilities for Monitoring, Alerting, Tracing and Reporting together with additional details on external open source and commercial possible integrations based on OpenTelemetry standards. You can find more details and join the session here: https://rsvp.withgoogle.com/events/apigee-emea-office-hours-2024/home

We appreciate your engagement with our community!

skhendkar

Thank you for your reply @AlexET .

We are actively working with the Google team to resolve an issue with the support portal and we plan to get the right support from them.

Meanwhile, it would be great if you could share your inputs / document / guide that talks about steps for switching between the data centers. As I could see that there is no specific page on Apigee documentation that talks about the same, we came up with the steps ourselves and tried them out, but to our surprise, the same steps yield different results.

While in one iteration, the failover to a DR data center was successful, the failback using similar steps fails and we observed issues mentioned in the ticket description. In the next iteration, the exact same steps did not work during failover.

We mainly focused on updating below configurations -

PG database failover
Monetization changes after PG failover
PG UUID change for analytics configuration

Analytics and Monetization issue when switching over to Apigee OPDK DR Data Center