Google Speech to Text Empty Transcription Response

TheAliHaider · 11-28-2024 02:11 AM

Error: Empty String in transcription Response.

May be the issue is with how I am using the credentials.json file (which I downloaded from google cloud console speech to text api, and loaded in the c# code i.e., c# 6, dot net 4.8), but not sure whether it is causing issue or not as it does not throw any error

Here is the code to authenticate credentials.json

public LeadController()
 {
    // Get the absolute path to the root directory of the project
    var rootDirectory = 
    Directory.GetParent(AppContext.BaseDirectory).FullName;
    var credentialPath = Path.Combine(rootDirectory, 
    "credentials.json");

     Environment.SetEnvironmentVariable("GOOGLE_APPLICATION_CREDENTIALS", credentialPath);
     _speechClient = SpeechClient.Create(); // Initialize the 
     SpeechClient
}

I created a method to send the mic-recorded audio and transcript it to text. But unfortunately, the result is always Empty string.

Then I verified whether the recorded voice was correct or not. For that, I created a method to save the mic-recorded voice into a .wav file in the project folder. For testing, I clicked the mic icon, recorded voice, and the recording was successfully saved as a .wav file. So, the audio is correct

It means the recording is correct, and the mic-recorded parameter is successfully sending the audio to the backend as a byte array. But when I transfer this byte array to google speech to text api, it returns Success status, but the result is always the empty string.

This is the front end code from where I am passing the mic audio to backend, it works fine

async function startRecording() {
const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
mediaRecorder = new MediaRecorder(stream);

mediaRecorder.ondataavailable = event => {
    audioChunks.push(event.data);
};

mediaRecorder.onstop = async () => {
    // Create a Blob from recorded chunks
    const audioBlob = new Blob(audioChunks, { type: 'audio/wav' });

    // Add audio file to FormData
    const formData = new FormData();
    formData.append('audio', audioBlob, 'audio.wav');

    // Send to backend
    try {
        const response = await fetch('/api/speech-to-text/stream3', {
            method: 'POST',
            body: formData, // Content-Type is set automatically
        });

        if (response.ok) {
            const result = await response.json();
            console.log(result)
            console.log('Transcription:', result.transcription);
        } else {
            console.error('Error:', response.statusText);
        }
    } catch (error) {
        console.error('Error uploading audio:', error);
    }
};


mediaRecorder.start();

and it is the backend c# code from where I am receiving the audio from frontend, it correctly receives the audio but when sends to api, returns the empty response

[HttpPost]
[Route("api/speech-to-text/stream3")]
public async Task<IHttpActionResult> StreamAudioToText3()
{
  try
  {
    var httpRequest = HttpContext.Current.Request;
    // Convert the incoming audio stream into a Google Cloud Speech recognition request
    var audioFile = httpRequest.Files["audio"];
    byte[] audioData;
    using (var memoryStream = new MemoryStream())
    {
        await audioFile.InputStream.CopyToAsync(memoryStream);
        audioData = memoryStream.ToArray();
    }
    File.WriteAllBytes("D:\\Projects\\soldster GIT Google Speech to Text\\soldster GIT\\output.wav", audioData);

    // Create the recognition config
    var config = new RecognitionConfig
    {
        Encoding = RecognitionConfig.Types.AudioEncoding.Linear16, // Assuming linear 16-bit PCM encoding
        SampleRateHertz = 16000, // Modify based on the audio recording configuration
        LanguageCode = "en-US",
        EnableAutomaticPunctuation = true, // Optional: improves readability
        Model = "default"
    };

    // Create the streaming recognize request
    var streamingCall = _speechClient.StreamingRecognize();

    // Start the stream and send the initial request for config
    await streamingCall.WriteAsync(new StreamingRecognizeRequest
    {
        StreamingConfig = new StreamingRecognitionConfig
        {
            Config = config,
            InterimResults = true // Optionally set to true for interim results
        }
    });

    // Process each audio chunk
    foreach (var chunk in audioData)
    {
        var recognitionAudio = new RecognitionAudio
        {
            Content = ByteString.CopyFrom(chunk)
        };

        // Send the audio content to the API
        await streamingCall.WriteAsync(new StreamingRecognizeRequest
        {
            AudioContent = recognitionAudio.Content
        });
    }

    // Close the request stream
    await streamingCall.WriteCompleteAsync();

    // Process the responses from the stream
    string transcription = string.Empty;

    // Use MoveNext() to handle the response asynchronously
    var responseStream = streamingCall.GetResponseStream();
    while (await responseStream.MoveNextAsync())  // Use MoveNextAsync to check for next response
    {
        var response = responseStream.Current;

        // Check if there are results in the response
        if (response.Results.Count > 0)
        {
            var result = response.Results[0];

            // Check if there are alternatives and transcriptions available
            if (result.Alternatives.Count > 0)
            {
                var alternative = result.Alternatives[0];
                transcription += alternative.Transcript + " "; // Append the transcription
            }
        }
    }

    // Return the transcription
    return Ok(new { transcription });
}
catch (Exception ex)
{
    // Log any errors here (using a logging service, etc.)
    return InternalServerError(ex);
}
}

I researched about it further, created many versions of the c# code but still have the same issue.

If I go to google console and upload the recorded .wav file and click on transcript button, it successfully transcripts the file. But when using the API, it returns an empty response. Here is google cloud console response

I would really appreciate any guidance. Thanks

MarvinLlamas

Hi @TheAliHaider,

Welcome to Google Cloud Community!

The problem is that you're transmitting audio data byte to byte (foreach (var chunk in audioData)) to the streaming API. The Google Cloud Speech-to-Text API requires audio to be sent in chunks, with each chunk representing a portion of the audio stream, not individual bytes.

Here are some potential ways to address your issue:

Chunking: You may adjust the chunkSize to optimize the processing or transmission of the audioData array. Achieving the right balance between reducing overhead and avoiding latency is crucial, with 4096 bytes being a solid starting point.
Response Handling: You might want to loop through all result objects in the response to capture all transcriptions, instead of just the first one.
Error Handling: Although the try-catch block is implemented, consider improving it with more detailed logging to better capture and analyze potential errors from the Google Cloud Speech-to-Text API.

In addition, after making the changes, test with different audio files, check the error messages and network logs, and use the commented-out File.WriteAllBytes line to save and verify the WAV file for debugging.

You may refer to the following documentation, which can help you understand the required conceptual information and API specifications:

I hope the above information is helpful.