Create table with upload file and schema fails wit...

cassia · 10-09-2023 02:46 PM

Hi there,

I am new to BigQuery, all my files are CSV files with a simple schema:

DateTime:TIMESTAMP,
Mth_yr:STRING,
kWh_gas:FLOAT,
kWh_ekec:FLOA

However when I try upload a file like this:

I am getting this error:

Error while reading data, error message: Could not parse 'DateTime' as TIMESTAMP for field DateTime (position 0) starting at location 0 with message 'Could not parse 'DateTime' as a timestamp. Required format is YYYY-MM-DD HH:MM[:SS[.SSSSSS]] or YYYY/MM/DD HH:MM[:SS[.SSSSSS]]'

But this is the beginning of my file which looks as expected

DateTime,Mth-yr,kWh_gas,kWh_elec
2019-01-01 00:00:00,Jan-19,,
2019-01-01 00:30:00,Jan-19,,

Notice that those 2 kWhvalues are nullable.

I am not sure what I am doing wrong.

ms4446

The error message indicates that BigQuery is encountering an issue when trying to parse a value in the DateTime column as a TIMESTAMP. The required format for TIMESTAMP values in BigQuery is YYYY-MM-DD HH:MM[:SS[.SSSSSS]]or YYYY/MM/DD HH:MM[:SS[.SSSSSS]].

From the sample data you provided, the format of the DateTime column appears correct. One potential issue could be that BigQuery is mistakenly trying to interpret the header "DateTime" as a TIMESTAMP value, which would lead to the error. It's also possible that there are other rows in your dataset with incorrect or missing DateTime values.

Before resorting to more complex solutions, ensure the following:

The entire dataset adheres to the correct DateTime format.
The header row is recognized as a header by BigQuery and not as data.

If the issue persists, a workaround is to first load the data as STRING and then cast it to TIMESTAMP:

1. Create a Staging Table

CREATE TABLE staging_table (
  DateTime STRING,
  Mth_yr STRING,
  kWh_gas FLOAT,
  kWh_elec FLOAT
);

2. Load the CSV File

Command-line tool:

bq load staging_table my_csv_file.csv

3. Create a Final Table

CREATE TABLE final_table (
  DateTime TIMESTAMP,
  Mth_yr STRING,
  kWh_gas FLOAT,
  kWh_elec FLOAT
);

4. Cast and Insert Data

INSERT INTO final_table (
  DateTime,
  Mth_yr,
  kWh_gas,
  kWh_elec
)
SELECT
  CAST(DateTime AS TIMESTAMP),
  Mth_yr,
  kWh_gas,
  kWh_elec
FROM staging_table;

Once the data is in the final table, you can query it as usual. This approach ensures that any discrepancies in the DateTime format are addressed before the final insertion into the desired table.

cassia

Hello thanks for the suggestions, this is definitely a bug because:

a) all my rows are formatted correctly (it is produced from pandas)

b) I have other files where by chance the first rows are not null and correctly inferred with a proper schema

I can show a quick simple file to prove my point.

In terms of ingestion I have a lot of files and I would need to do it via Python, do you have any pointers to show what API should I call?

ms4446

If you believe it's a bug, especially given the consistency in your data formatting, it might be worth raising the issue with Google Cloud Support or checking BigQuery's known issues to see if others have encountered a similar problem.

In the meantime, you can use the following Python code to ingest your large number of CSV files into BigQuery:

import io
import os
from google.cloud import bigquery

# Create a BigQuery client.
client = bigquery.Client()

# Create a dataset to store your data in.
dataset = client.create_dataset(
    "my_dataset",
    location="US",
)

# Get a list of all the CSV files in the current directory.
csv_files = os.listdir()

# Iterate over the CSV files and load them into BigQuery.
for csv_file in csv_files:
    if csv_file.endswith(".csv"):
        with io.open(csv_file, "rb") as f:
            data = f.read()

        # Create a BigQuery load job.
        job_config = bigquery.LoadJobConfig()
        job_config.source_format = bigquery.SourceFormat.CSV
        job_config.write_disposition = bigquery.WriteDisposition.WRITE_TRUNCATE

        # Load the data into BigQuery.
        job = client.load_table_from_uri(
            data,
            dataset.table(csv_file[:-4]),
            job_config=job_config,
        )

        # Wait for the job to finish.
        job.result()

# Print a success message.
print("All CSV files have been loaded into BigQuery successfully!")

This code will iterate over all the CSV files in the current directory and load them into BigQuery, truncating the existing tables if they already exist. You can modify the code to suit your specific needs, such as changing the destination dataset or the write disposition.