Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Having trouble with csv file in data fusion

Hello,
I'm doing a project for practice. The project is on creating a dummy data in csv file for employees using python script and store that csv file in cloud storage bucket. I used data fusion wrangler to transform data. So, when I opened the csv file in DataFusion I saw few fields were empty and last two columns was totally blank. In my desktop csv file all the data is available in all the column. Can anyone help me overcome this problem. I did alot of troubleshooting using chatgpt right now as a begineer im stuck here. If any one interested I can share my screen and we can collaborate and trouble shoot together. I want to overcome this challenge.

Here is my VS code:

import csv
from faker import Faker
import random
import string
from google.cloud import storage
import os

# Set Google Cloud project environment variable
os.environ['GOOGLE_CLOUD_PROJECT'] = 'marine-champion-432318-n3'

# Initialize Faker
fake = Faker()

# Generate dummy data
def generate_employee_data():
    data = {
        "first_name": fake.first_name(),
        "last_name": fake.last_name(),
        "email": fake.email(),
        "address": fake.address(),
        "phone_number": fake.phone_number(),
        "ssn": fake.ssn(),
        "date_of_birth": fake.date_of_birth(minimum_age=18, maximum_age=65).isoformat(),
        "password": fake.password(length=12, special_chars=True, digits=True, upper_case=True, lower_case=True)
    }
    print(data)  # Print generated data for debugging
    return data

# Save data to CSV
def save_to_csv(file_path, data_list😞
    with open(file_path, mode='w', newline='', encoding='utf-8') as file:
        writer = csv.DictWriter(file, fieldnames=data_list[0].keys())
        writer.writeheader()
        writer.writerows(data_list)
    print(f"Data saved to {file_path}")

# Upload file to GCS
def upload_to_gcs(bucket_name, source_file_name, destination_blob_name😞
    """Uploads a file to the bucket."""
    # Initialize a client
    storage_client = storage.Client(project='marine-champion-432318-n3')

    # Get the bucket
    bucket = storage_client.bucket(bucket_name)

    # Create a blob object
    blob = bucket.blob(destination_blob_name)

    # Upload the file
    blob.upload_from_filename(source_file_name)
    print(f"File {source_file_name} uploaded to {destination_blob_name}.")

if __name__ == "__main__":
    # Generate a list of employee data
    employees = [generate_employee_data() for _ in range(10)]  # Adjust the number of records as needed

    # Define file paths
    csv_file_path = "employee_data.csv"

    # Save data to CSV
    save_to_csv(csv_file_path, employees)

    # Define GCS parameters
    bucket_name = "employee-project"  # Replace with your bucket name
    source_file_name = "employee_data.csv"
    destination_blob_name = "employee_data.csv"  # Blob name in GCS

    # Upload the file to GCS
    upload_to_gcs(bucket_name, source_file_name, destination_blob_name)

I have attached screenshot of the file on how it looks in wrangler.
Asif_Shaharia_0-1724090249323.png

 

0 1 296
1 REPLY 1

Hi @Asif_Shaharia,

Welcome to Google Cloud Community!

There's a good chance that you're using a wrong delimiter in your Data Fusion Wrangler. If Data Fusion expects a different delimiter like for example a semicolon (;), you’ll encounter data loss. Since you have employee_data.csv, you should be using a comma (,) delimiter. 

jangemmar_0-1724277475643.png

 

And also, in your Data Fusion pipeline, examine the data types assigned to each field in the Wrangler transformations. Ensure that the types are compatible with the data you’re trying to process.

Note: Data Fusion is a visual point-and-click interface enabling code-free deployment of ETL/ELT data pipelines. If you really want to use python code in your pipeline, I highly suggest to use Dataflow instead.

I hope the above information is helpful.