Cloud speech v1 not accepting wav base 64 string

R_Darwish · 09-17-2024 04:18 AM

Im currently using RecordRTC to get a an audio blob in a wav format. I post this to several transcription ai's like openAi's and whisper and there is no issue, but when i use this wav in v1 cloud speech it doesnt work. simply gives me error 400. This means bad request. It used to work when i used the media recorder API but the media recorder API didnt work on safari, hence the switch to RecordRTC for the audio blob. What is wrong with my request?

export const googleTranscribe = functions.https.onRequest(
    { cors: true },
    (req, res) => {
        const { base64Audio }: GoogleTranscribeRequest = req.body;

        console.log("Google base64Audio", base64Audio);

        if (!base64Audio) {
            res.status(400).send({ message: "Missing base64Audio" });
            return;
        }
        axios.post(
            `https://speech.googleapis.com/v1/speech:recognize?key=`,//removed my key for this thread
            {
                config: {
                    languageCode: "en-US",
                },
                audio: {
                    content: base64Audio,
                },
            }
        )
            .then(response => {
                const results: GoogleTranscriptionResult[] | undefined = response.data.results;

                if (results && results.length > 0) {
                    let transcription = "";
                    for (const result of results) {
                        transcription += result.alternatives[0].transcript + " ";
                    }
                    res.status(200).send({ transcription });
                } else {
                    res.status(200).send({ transcription: "" });
                }
            })
            .catch(error => {
                console.error('Error with Google Speech-to-Text API:', error);//this is what returns 400
                res.status(500).send({ error: "Error with Google Speech-to-Text API " });
            });
    }
);

above is the cloud function that tries to transcribe it. Below is the code in my react app that gets the audio blob with RecordRTC

const startRecording = async (): Promise<void> => {
    try {
      setGoogleOneHot([]);
      setOpenaiOneHot([]);
      setAssemblyaiOneHot([]);
      setAudioBlob(null);
      setStatusMsg('Click stop when you are done reading')

      const stream = await navigator.mediaDevices.getUserMedia({ audio: true });

      let recorder = new RecordRTCPromisesHandler(stream, {
        type: "audio",
        mimeType: "audio/wav",
        recorderType: RecordRTC.StereoAudioRecorder, // force for all browsers
      });
      recorder.startRecording();
      setRecordRTC(recorder);

      setRecording(true);
      console.log('Recording started');

      // Set timeout for auto-stop after 30 seconds
      setTimeout(async () => {
        if (await recorder.getState() === 'recording') {
          console.log('Auto-stopping recording after 30 seconds');
          stopRecording();
          setRecording(false);
        }
      }, 29000);

    } catch (error) {
      console.error('Error getting user media:', error);
    }
  };

  const stopRecording = async (): Promise<void> => {
    if (recordRTC) {
      await recordRTC.stopRecording();
      let blob = await recordRTC.getBlob();
      handleTranscription(blob);
      console.log('Recording stopped');
      setRecording(false);
    }

  };

the handle transcription method is what calls my cloud function

      const base64Audio = await TranscriptionService.blobToBase64(audioBlob);
      const googlePromise = TranscriptionService.googleTranscribe(base64Audio, calibrationWords);

this is the code in my transcription service

  static async blobToBase64(blob: Blob): Promise<string> {
    return new Promise((resolve, _) => {
      const reader = new FileReader();
      reader.onloadend = () => resolve(reader.result as string);
      reader.readAsDataURL(blob);
    });
  }

  static async googleTranscribe(base64Audio: string, wordsToLookFor: string[]): Promise<string> {
    try {
      // Only send first 15 words
      if (wordsToLookFor.length > 15) {
        wordsToLookFor = wordsToLookFor.slice(0, 15);
      }
      const body: GoogleTranscribeRequest = { base64Audio };
      const result = await axios.post(GOOGLE, body);
      console.log(result.data.transcription);
      return result.data.transcription;
    } catch (error) {
      console.error('Google Transcription Error:', error);
      return '';  // Return empty string on error
    }
  }

if the above stays the same but i get my audio like this, then it works but i dont understand why

  const startRecording = async (): Promise<void> => {
    setRecording(true);
    resetState();

    const stream = await navigator.mediaDevices.getUserMedia({ audio: true });
    const mediaRecorder = new MediaRecorder(stream);
    const audioChunks: Blob[] = [];

    mediaRecorder.ondataavailable = (event) => {
      audioChunks.push(event.data);
    };

    // Set timeout for auto-stop after 30 seconds
    setTimeout(() => {
      if (mediaRecorder.state === 'recording') {
        console.log('Auto-stopping recording after 30 seconds');
        mediaRecorder.stop();
        setRecording(false);
      }
    }, 29000);

    mediaRecorder.onstop = async () => {
      const blob = new Blob(audioChunks, { type: 'audio/wav' });
      setAudioBlob(blob);

      if (audioRef.current) {
        audioRef.current.src=URL.createObjectURL(blob);
      }


    };

and then i also used this method to convert to base64 string but it doesnt work with the RecordRTC library either so i dont think the issue is there

export const audioBlobToBase64 = (blob: Blob): Promise<string> => {
    return new Promise((resolve, reject) => {
      const reader = new FileReader();
      reader.onloadend = () => {
        const arrayBuffer = reader.result as ArrayBuffer;
        const base64Audio = btoa(
          new Uint8Array(arrayBuffer).reduce(
            (data, byte) => data + String.fromCharCode(byte),
            ''
          )
        );
        resolve(base64Audio);
      };
      reader.onerror = reject;
      reader.readAsArrayBuffer(blob);
    });
  };

. the above 2 things work and can upload to google speech to text but i dont understand the difference. i had a look at the base 64 strings and they seemed similar enough. and if it worked with open AI then surely the wav isnt necessarily bad. I tried adding FLAC and Linear16 in the config but neither worked. What could the issue be? let me know