Training NeMo FastPitch on Colab

Mad_AI_Engineer

I’m training a multi-speaker FastPitch model in NeMo using my own dataset, and have been blocked by a persistent issue: pitch tensors are reported as having shape [B] instead of [B, T], even though I’ve verified that they are correctly padded and shaped prior to being passed to the model.

What I’m doing

Returning a 7-field tuple from collate_fn in the exact order expected by FastPitchModel.input_types:
python
CopyEdit
( audio, # [B, audio_len] audio_lens, # [B] tokens, # [B, T] tokens_lens, # [B] pitch_padded, # [B, T] pitch_lens, # [B] attn_prior # [B, T, T] )
Padding pitch like this:
python
CopyEdit
pitch_padded = pad_sequence(pitch, batch_first=True) pitch_lens = torch.tensor([len(p) for p in pitch], dtype=torch.long)
Injecting collate_fn via:
python
CopyEdit
train_dl = DataLoader(..., collate_fn=custom_tuple_collate) model._train_dl = train_dl model._train_dl.dataset._collate_fn = custom_tuple_collate
Monkeypatching:
python
CopyEdit
def setup_training_data(self, config=None😞 self._train_dl = self._setup_dataloader_from_config( config=config, dataset=self._train_dl.dataset, collate_fn=custom_tuple_collate ) model.setup_training_data = MethodType(setup_training_data, model)
Doing the same for setup_validation_data()

Diagnostics I’ve Run

Confirmed pitch_padded.shape == [4, T] inside collate_fn
Manually ran:
python
CopyEdit
batch = next(iter(train_dl)) model.training_step({ "audio": batch[0], "audio_lens": batch[1], ..., "pitch": batch[4], ... }, batch_idx=0)
→ ✅ Works fine. Confirms the model accepts padded pitch when passed as dict.
But running:
python
CopyEdit
trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)
→ ❌ Always fails with pitch.shape == torch.Size([B])

What I Think Is Going Wrong

Something in NeMo or PyTorch Lightning:

Is unpacking the tuple incorrectly
Or expecting a different shape
Or misusing the batch structure when it hits _validate_input_types() in ModelPT

What I Need Help With

What exact format should collate_fn return for FastPitch + NeMo 2.x + PyTorch Lightning 2.x?
Do I need to return a dict instead? (Tried it — training_step(dict) works, but trainer.fit() crashes.)
Is there a canonical way to override NeMo’s internal dataloader construction without it resetting the collate_fn?

Even though pitch.shape == [B, T], when collate_fn returns a tuple, NeMo seems to unpack or misinterpret the pitch field incorrectly — possibly due to mismatch with input_types order or lack of process_batch()-style conversion.
Is it possible NeMo expects process_batch() to be used implicitly, and if so, how can I inject a custom version that preserves padded pitch?

Environment

NeMo 2.x (FastPitchModel from nvidia/tts_en_fastpitch)
PyTorch Lightning 2.0+
Google Colab Pro (A100)
Using patched YAML with multi-speaker manifests and extracted pitch via NeMo's own compute_speaker_stats.py and get_pitch.py

Any help would be greatly appreciated — I've exhausted the debugging options I can think of.

Thanks so much!

dongm

Hello Mad_AI_Engineer,
Could you open a GitHub issue at https://github.com/NVIDIA/NeMo? We can get an NVIDIA engineer to help with answering.
Thanks

Mad_AI_Engineer

Thanks for the suggestion! I just posted my first GitHub issue, about this. I have little idea of the etiquette of this, I hope my post is relevant and or in the right style.

https://github.com/NVIDIA/NeMo/issues/13695

asrivas

can you post the exact error the snippet of code where it fails?

Mad_AI_Engineer

My error:

TypeError: Input shape mismatch occured for pitch in module FastPitchModel :
Input shape expected = (batch, time) |
Input shape found : torch.Size([4])

My actual code that fails:

trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)

I'm using a custom collate_fn that returns a 7-field tuple in this order:

( audio, # torch.Size([B, ?]) audio_lens, # [B] tokens, # [B, T] tokens_lens, # [B] pitch_padded, # [B, T] ← this is the one that fails pitch_lens, # [B] attn_prior # [B, T, T] )

…but NeMo throws an error saying:

Input shape found: torch.Size([4])

This suggests it never receives my padded pitch, or misinterprets the batch index in unpacking.

When I run it manually:

batch = next(iter(train_dl))
model.training_step({
"audio": batch[0],
"audio_lens": batch[1],
"tokens": batch[2],
"tokens_lens": batch[3],
"pitch": batch[4],
"pitch_lens": batch[5],
"attn_prior": batch[6],
}, batch_idx=0)

It works - I think my pitch tensor is valid