Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Cloud Function (Node.js + Puppeteer + Express) completes, but Airflow (Cloud Composer) never receive

Hey guys

So I’m currently experiencing an interminable integration issue between a Google Cloud function (Node.js, running Express, orchestrating Puppeteer for web scraping) and Airflow running on Cloud Composer. Here’s my setup and issue in brief:

Setup:
Cloud Function: Node.js 18, Express, accepts POST JSON body, downloads dynamically imports JS file from Cloud Storage, runs Puppeteer to scrapefrom a platform & saves files, processes data (convert CSV to JSON, uploads data to BigQuery.

Airflow (Cloud Composer): Makes use of Python’s google.auth.transport.requests.AuthorizedSession with ID token auth to invoke the Cloud Function synchronously (POST, with a 1-hour timeout).

Issue:
For small jobs (a few iterations of Puppeteer), HTTP 200 is sent back to Airflow and it proceeds normally.

For large jobs, the logs show everything completes as expected, and res.status(200).send(...) or .json(...) is called at the end — but HTTP response is never received by Airflow. It waits until its own timeout. Cloud Function finished — logs confirm response was sent.

No errors are thrown in either Airflow or Cloud Function, and all Node.js file handles and promises appear resolved at the end. Printing process._getActiveHandles() and process._getActiveRequests() shows only normal items (sockets, short-lived FSReqCallback).

What I've tried:
await-ed all async operations

Nothing was open using why-is-node-running, and running under a test harness leaked no more than without the patch.

Content-Length was forced in the response. Processing was broken into shorter files, Puppeteer loops.

Memory & CPU both increased for the Cloud Function and Composer. Newer requests, google-auth, etc Python libraries.

Shuffling between res.json, res.send, and plain HTTP responses.

Executed on GCP environment; occurs only with heavy jobs and in Cloud Functions. Cloud Function not bailing out early (all logs post completion). What else can cause a CF (Node.js/Express) to appear to finish and send HTTP response, but for the caller (Airflow/Composer) to never receive that response?

Any known bug or limitation in GCF or Cloud Composer/AuthorizedSession that might be causing this behavior for heavy/long-running synchronous requests?

 

0 1 74
1 REPLY 1

So, I know nothing about Airflow, but is this what's happening?

- Airflow->Cloud Functions: works
- Airflow->Cloud Run jobs: Doesn't work

If that's the case, I think I know the issue. Cloud Functions sends an http
response when it's done processing the requests, which is probably what
Airflow is expecting.
Cloud Run jobs sends an http response when it STARTS the job. It doesn't
send anything else afterwards.

One way to work around this is to wrap the Cloud Run job in a Cloud
Workflow. Cloud Workflows has a Cloud Run jobs connector that monitors job
status. Though I'm not sure any system that relies on a response to an http
request will work well with a long-running job; here we get back to me not
knowing much about Airflow. I'd love to hear how you end up resolving this!



-------------------------------------------
Karolina Netolicka
Product Manager, Serverless
knet@google.com