Data Fusion JSON Files

jpeters3047 · 02-19-2022 09:04 AM

It looks like Data Fusion requires files read from cloud storage to be newline-delimited, period.

But I have a customer that is sending a file that is a single JSON object with a nested array containing the actual records, which has many newline characters therein.

I have tried different things to get this to work. The docs for the cloud storage data source say that if 'blob' format is used with a 'bytes' output format, it will treat the entire file as a single record. I wish this were true, because then I could use the JavaScript transform to manually break apart the records.

Ive also tried using the 'delimited' format with a junk delimiter value, to try to force it to treat the whole file as 1 record.

The only workaround I have left is to create a 1-off cloud function that extracts the records and writes a file that is properly de-nested and newline-delimited, the way Data Fusion apparently demands it. I find this to be a less than elegant approach.

Please tell me I am missing something and there is an easy DF solution.

Thanks 🙏

dikaur

Hello,

Your analysis is correct . The connector is designed to read the file line by line, as if it is a jsonl file. It is mentioned in the BigQuery Limitations as well in the documentation [1] and we do have an existing Feature request [2].

- Some additional workaround can be found in [3]

- You can add the following to the file system's properties section in the GCS source plugin.

```

{

"textinputformat.record.delimiter": "[DELIMETIER]"

}

```

You can replace [DELIMITER] with some "junk delimiter value", as you mentioned.

This will modify the delimiter, from the default \n, directly in the spark job, allowing the entire to be read as one record.

[1] https://cloud.google.com/bigquery/docs/loading-data-cloud-storage-json#limitations

[2] https://issuetracker.google.com/134804951

[3] https://groups.google.com/g/cdap-user/c/kzFPjniBsU8