Re: GCP Workflows - How to convert JSON to NDJSON/...

peraltar · 10-30-2024 04:40 AM

We are developing a workflow that will store the results obtained from a web API into files in a bucket.

The workflow ends up generating a JSON with this format:
[{entry},{entry},{entry}...]

Each entry has multiple keys with corresponding values.

Here is the challenge, we want to load these JSON files into BigQuery and it does not like the format, it wants a NDJSON instead, something like this should be the format:

{entry}
{entry}
{entry}

The question is: How can I convert my JSON to NDJSON in a Workflow? Is it even possible?

If its not, can we call a Function, send the //bucket/file and have the function convert it to NDJSON? How would it work?

mars124

In a notebook in the workflow, you can convert json 'data' to newline delimited json this way:

ndjson_data = '\n'.join([json.dumps(record) for record in data])

NorieRam

Hi @peraltar,

Welcome to Google Cloud Community!

Yes, JSON to NDJSON conversion is possible. Incorporating a Python notebook within your workflow, as suggested by @mars124, provides an effective method for carrying out this conversion.

Also, you can use the Cloud Function, as you mentioned, as an additional option if you want to automate this process further that triggers whenever a JSON file is uploaded to your bucket. This function can handle the conversion for you, making it easier to manage multiple files.

Here’s a sample code:

#Convert JSON to NDJSON and upload 
ndjson_lines = '\n'.join(json.dumps(record) for record in json.loads(json_data)) 
bucket.blob(data['name'].replace('.json', '.ndjson')).upload_from_string(ndjson_lines)

I hope the above information is helpful.

GCP Workflows - How to convert JSON to NDJSON/JSONL