Data Fusion add one table

Hello, I've encountered an issue in Data Fusion where in the Replication task, it's not possible to add a new table without stopping and recreating the task. Is this the correct behavior? Do I have to recreate the task, or are there other options available?

Solved Solved
2 1 115
1 ACCEPTED SOLUTION

Yes, you are correct. You cannot directly add new tables to an existing Replication job in Google Cloud Data Fusion without stopping and recreating the job. This is the current limitation of the Replication feature.

Here are the two options you can use if you need to add new tables to a Replication job:

1. Create a New Replication Job:

  • This is the most straightforward approach.
  • Create a new Replication job that includes the new tables you need to replicate.
  • It can be more manageable if you are adding a significant number of new tables.

2. Stop and Duplicate the Existing Job:

  • Stop the currently running Replication job.
  • Create a duplicate of the job.
  • Modify the duplicated job to include the new tables.
  • Start the duplicated job.

Important to Consider:

  • Duplicate Jobs & Historical Data: If you duplicate a job, enabling the snapshot option will load all table data from scratch (historical load). This is recommended if you can't separate the jobs into different pipelines.
  • Overlapping Jobs: While it's tempting to create an overlap between the old and new jobs to avoid missing data,this is not recommended as it could still lead to data loss, especially with the historical data for the newly added tables.

Additional Tips:

  • Planning: If you anticipate adding tables to your Replication job frequently, plan your database structure and Replication configuration to make it easier to manage in the future.
  • Separate Pipelines: Consider splitting your Replication tasks into multiple pipelines if you need to manage tables from separate database schemas independently.
  • Static Dataproc Cluster: If running multiple replication jobs, you can potentially reduce compute costs by sharing a static Dataproc cluster between jobs (instead of each job using an ephemeral cluster). See more about this:https://cloud.google.com/data-fusion/docs/concepts/configure-clusters

Example Scenario:

Let's say you initially had a Replication job replicating Tables A and B. To add a new Table C, you would need to either:

  1. Create a new Replication job that includes Tables A, B, and C.
  2. Stop the existing job, duplicate it, add Table C, and start the new job.

View solution in original post

1 REPLY 1

Yes, you are correct. You cannot directly add new tables to an existing Replication job in Google Cloud Data Fusion without stopping and recreating the job. This is the current limitation of the Replication feature.

Here are the two options you can use if you need to add new tables to a Replication job:

1. Create a New Replication Job:

  • This is the most straightforward approach.
  • Create a new Replication job that includes the new tables you need to replicate.
  • It can be more manageable if you are adding a significant number of new tables.

2. Stop and Duplicate the Existing Job:

  • Stop the currently running Replication job.
  • Create a duplicate of the job.
  • Modify the duplicated job to include the new tables.
  • Start the duplicated job.

Important to Consider:

  • Duplicate Jobs & Historical Data: If you duplicate a job, enabling the snapshot option will load all table data from scratch (historical load). This is recommended if you can't separate the jobs into different pipelines.
  • Overlapping Jobs: While it's tempting to create an overlap between the old and new jobs to avoid missing data,this is not recommended as it could still lead to data loss, especially with the historical data for the newly added tables.

Additional Tips:

  • Planning: If you anticipate adding tables to your Replication job frequently, plan your database structure and Replication configuration to make it easier to manage in the future.
  • Separate Pipelines: Consider splitting your Replication tasks into multiple pipelines if you need to manage tables from separate database schemas independently.
  • Static Dataproc Cluster: If running multiple replication jobs, you can potentially reduce compute costs by sharing a static Dataproc cluster between jobs (instead of each job using an ephemeral cluster). See more about this:https://cloud.google.com/data-fusion/docs/concepts/configure-clusters

Example Scenario:

Let's say you initially had a Replication job replicating Tables A and B. To add a new Table C, you would need to either:

  1. Create a new Replication job that includes Tables A, B, and C.
  2. Stop the existing job, duplicate it, add Table C, and start the new job.