I’m training a multi-speaker FastPitch model in NeMo using my own dataset, and have been blocked by a persistent issue: pitch tensors are reported as having shape [B] instead of [B, T], even though I’ve verified that they are correctly padded and shaped prior to being passed to the model.
Returning a 7-field tuple from collate_fn in the exact order expected by FastPitchModel.input_types:
Padding pitch like this:
Injecting collate_fn via:
Monkeypatching:
Doing the same for setup_validation_data()
Confirmed pitch_padded.shape == [4, T] inside collate_fn
Manually ran:
→ ✅ Works fine. Confirms the model accepts padded pitch when passed as dict.
But running:
→ ❌ Always fails with pitch.shape == torch.Size([B])
Something in NeMo or PyTorch Lightning:
Is unpacking the tuple incorrectly
Or expecting a different shape
Or misusing the batch structure when it hits _validate_input_types() in ModelPT
What exact format should collate_fn return for FastPitch + NeMo 2.x + PyTorch Lightning 2.x?
Do I need to return a dict instead? (Tried it — training_step(dict) works, but trainer.fit() crashes.)
Is there a canonical way to override NeMo’s internal dataloader construction without it resetting the collate_fn?
Even though pitch.shape == [B, T], when collate_fn returns a tuple, NeMo seems to unpack or misinterpret the pitch field incorrectly — possibly due to mismatch with input_types order or lack of process_batch()-style conversion.
Is it possible NeMo expects process_batch() to be used implicitly, and if so, how can I inject a custom version that preserves padded pitch?
NeMo 2.x (FastPitchModel from nvidia/tts_en_fastpitch)
PyTorch Lightning 2.0+
Google Colab Pro (A100)
Using patched YAML with multi-speaker manifests and extracted pitch via NeMo's own compute_speaker_stats.py and get_pitch.py
Any help would be greatly appreciated — I've exhausted the debugging options I can think of.
Thanks so much!
Hello Mad_AI_Engineer,
Could you open a GitHub issue at https://github.com/NVIDIA/NeMo? We can get an NVIDIA engineer to help with answering.
Thanks
Thanks for the suggestion! I just posted my first GitHub issue, about this. I have little idea of the etiquette of this, I hope my post is relevant and or in the right style.
https://github.com/NVIDIA/NeMo/issues/13695
can you post the exact error the snippet of code where it fails?
My error:
TypeError: Input shape mismatch occured for pitch in module FastPitchModel :
Input shape expected = (batch, time) |
Input shape found : torch.Size([4])
My actual code that fails:
trainer.fit(model, train_dataloaders=train_dl, val_dataloaders=val_dl)
I'm using a custom collate_fn that returns a 7-field tuple in this order:
…but NeMo throws an error saying:
This suggests it never receives my padded pitch, or misinterprets the batch index in unpacking.
When I run it manually:
batch = next(iter(train_dl))
model.training_step({
"audio": batch[0],
"audio_lens": batch[1],
"tokens": batch[2],
"tokens_lens": batch[3],
"pitch": batch[4],
"pitch_lens": batch[5],
"attn_prior": batch[6],
}, batch_idx=0)
It works - I think my pitch tensor is valid