Get hands-on experience with 20+ free Google Cloud products and $300 in free credit for new customers.

Training NeMo FastPitch on Colab

 

I’m training a multi-speaker FastPitch model in NeMo using my own dataset, and have been blocked by a persistent issue: pitch tensors are reported as having shape [B] instead of [B, T], even though I’ve verified that they are correctly padded and shaped prior to being passed to the model.


 
 

What I’m doing

  • Returning a 7-field tuple from collate_fn in the exact order expected by FastPitchModel.input_types:

    python
    CopyEdit
    ( audio, # [B, audio_len] audio_lens, # [B] tokens, # [B, T] tokens_lens, # [B] pitch_padded, # [B, T] pitch_lens, # [B] attn_prior # [B, T, T] )
  • Padding pitch like this:

    python
    CopyEdit
    pitch_padded = pad_sequence(pitch, batch_first=True) pitch_lens = torch.tensor([len(p) for p in pitch], dtype=torch.long)
  • Injecting collate_fn via:

    python
    CopyEdit
    train_dl = DataLoader(..., collate_fn=custom_tuple_collate) model._train_dl = train_dl model._train_dl.dataset._collate_fn = custom_tuple_collate
  • Monkeypatching:

    python
    CopyEdit
    def setup_training_data(self, config=None😞 self._train_dl = self._setup_dataloader_from_config( config=config, dataset=self._train_dl.dataset, collate_fn=custom_tuple_collate ) model.setup_training_data = MethodType(setup_training_data, model)
  • Doing the same for setup_validation_data()


Diagnostics I’ve Run

  • Confirmed pitch_padded.shape == [4, T] inside collate_fn

  • Manually ran:

    python
    CopyEdit
    batch = next(iter(train_dl)) model.training_step({ "audio": batch[0], "audio_lens": batch[1], ..., "pitch": batch[4], ... }, batch_idx=0)

    Works fine. Confirms the model accepts padded pitch when passed as dict.

  • But running:

    python
    CopyEdit
    trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)

    Always fails with pitch.shape == torch.Size([B])


What I Think Is Going Wrong

Something in NeMo or PyTorch Lightning:

  • Is unpacking the tuple incorrectly

  • Or expecting a different shape

  • Or misusing the batch structure when it hits _validate_input_types() in ModelPT


What I Need Help With

  • What exact format should collate_fn return for FastPitch + NeMo 2.x + PyTorch Lightning 2.x?

  • Do I need to return a dict instead? (Tried it — training_step(dict) works, but trainer.fit() crashes.)

  • Is there a canonical way to override NeMo’s internal dataloader construction without it resetting the collate_fn?

Even though pitch.shape == [B, T], when collate_fn returns a tuple, NeMo seems to unpack or misinterpret the pitch field incorrectly — possibly due to mismatch with input_types order or lack of process_batch()-style conversion.
Is it possible NeMo expects process_batch() to be used implicitly, and if so, how can I inject a custom version that preserves padded pitch?

 


Environment

  • NeMo 2.x (FastPitchModel from nvidia/tts_en_fastpitch)

  • PyTorch Lightning 2.0+

  • Google Colab Pro (A100)

  • Using patched YAML with multi-speaker manifests and extracted pitch via NeMo's own compute_speaker_stats.py and get_pitch.py


Any help would be greatly appreciated — I've exhausted the debugging options I can think of.

Thanks so much!

1 4 231
4 REPLIES 4

dongm
NVIDIA Dev Advocate
NVIDIA Dev Advocate

Hello Mad_AI_Engineer,
Could you open a GitHub issue at https://github.com/NVIDIA/NeMo? We can get an NVIDIA engineer to help with answering. 
Thanks

Thanks for the suggestion! I just posted my first GitHub issue, about this.  I have little idea of the etiquette of this, I hope my post is relevant and or in the right style.

https://github.com/NVIDIA/NeMo/issues/13695

can you post the exact error the snippet of code where it fails?

My error:

TypeError: Input shape mismatch occured for pitch in module FastPitchModel :
Input shape expected = (batch, time) |
Input shape found : torch.Size([4])


My actual code that fails:

trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)

I'm using a custom collate_fn that returns a 7-field tuple in this order:

 

( audio, # torch.Size([B, ?]) audio_lens, # [B] tokens, # [B, T] tokens_lens, # [B] pitch_padded, # [B, T] ← this is the one that fails pitch_lens, # [B] attn_prior # [B, T, T] )

 

 

…but NeMo throws an error saying:

 

 
Input shape found: torch.Size([4])

This suggests it never receives my padded pitch, or misinterprets the batch index in unpacking.

 

When I run it manually:

batch = next(iter(train_dl))
model.training_step({
"audio": batch[0],
"audio_lens": batch[1],
"tokens": batch[2],
"tokens_lens": batch[3],
"pitch": batch[4],
"pitch_lens": batch[5],
"attn_prior": batch[6],
}, batch_idx=0)


It works - I think my pitch tensor is valid

Top Labels in this Space