We are running Microservices on GKE and we subscribe and publish messages to PubSub asynchronously. We are seeing huge delays executing the callbacks where the P99 is in the order of 6-10s irrespective of the load.
Each of our GKE nodes are provisioned with 2 CPUs and 2GB memory. The CPU and memory usage is < 25% even under high loads.
My publish code
private void publish(
final ProcessingContext context,
final PubsubMessage message,
final AckReplyConsumer consumer,
final TopicName targetTopic,
ExecutorService callbackExecutor,
PublisherFactory publisherFactory) {
// Get Publisher from Cache
// This will be a persistent publisher
Supplier<ApiFuture<String>> futureSupplier =
() -> publisherFactory.createPublisher(String.valueOf(targetTopic)).publish(message);
long ts = System.nanoTime();
ApiFuture<String> future = context.execPublishCallForMessage(futureSupplier);
long publishCallLatency = System.nanoTime() - ts;
log.info("Success publishCallLatency: {}", publishCallLatency);
final long publishBeginTs = System.nanoTime();
ApiFutures.addCallback(
future,
new ApiFutureCallback<>() {
@Override
public void onSuccess(String result) {
long publishAndCallbackLatency = System.nanoTime() - publishBeginTs;
log.info("Success publishAndCallbackLatency: {}", publishAndCallbackLatency);
callbackExecutor.execute(consumer::ack);
}
@Override
public void onFailure(Throwable t) {
long publishAndCallbackLatency = System.nanoTime() - publishBeginTs;
log.info("Failed publishAndCallbackLatency: {}", publishAndCallbackLatency);
callbackExecutor.execute(consumer::nack);
}
},
MoreExecutors.directExecutor());
}
The publishCallLatency is low in the order of 5-10 ms but the publishAndCallbackLatency is very high in the order of 6-10s.
I have flow control and batching enabled. But I'm unable to figure why it takes such a long time for the callback to execute. I tried increasing the publisher executor threads and ran profiler but most of the executor threads are idle. I'm trying to reduce this latency and can't explain this latency.
Any ideas?