java - 用于使用 Java 流式传输音频的 Google Speech-to-Text

Question

我正在尝试使用 Google Speech-to-Text API 进行一些语音到语音的翻译（也使用翻译和文本到语音）。我想让一个人对着麦克风讲话，然后将该文本转录为文本。我使用谷歌文档中的流式音频教程作为此方法的基础。当此人停止讲话时，我还希望音频流停止。

这是修改后的方法：

public static String streamingMicRecognize(String language) throws Exception {

        ResponseObserver<StreamingRecognizeResponse> responseObserver = null;
        try (SpeechClient client = SpeechClient.create()) {

            responseObserver =
                    new ResponseObserver<StreamingRecognizeResponse>() {
                ArrayList<StreamingRecognizeResponse> responses = new ArrayList<>();

                public void onStart(StreamController controller) {}

                public void onResponse(StreamingRecognizeResponse response) {
                    responses.add(response);
                }

                public void onComplete() {
                    SPEECH_TO_TEXT_ANSWER = "";
                    for (StreamingRecognizeResponse response : responses) {
                        StreamingRecognitionResult result = response.getResultsList().get(0);
                        SpeechRecognitionAlternative alternative = result.getAlternativesList().get(0);
                        System.out.printf("Transcript : %s\n", alternative.getTranscript());
                        SPEECH_TO_TEXT_ANSWER = SPEECH_TO_TEXT_ANSWER + alternative.getTranscript();
                    }
                }

                public void onError(Throwable t) {
                    System.out.println(t);
                }
            };

            ClientStream<StreamingRecognizeRequest> clientStream =
                    client.streamingRecognizeCallable().splitCall(responseObserver);

            RecognitionConfig recognitionConfig =
                    RecognitionConfig.newBuilder()
                    .setEncoding(RecognitionConfig.AudioEncoding.LINEAR16)
                    .setLanguageCode(language)
                    .setSampleRateHertz(16000)
                    .build();
            StreamingRecognitionConfig streamingRecognitionConfig =
                    StreamingRecognitionConfig.newBuilder().setConfig(recognitionConfig).build();

            StreamingRecognizeRequest request =
                    StreamingRecognizeRequest.newBuilder()
                    .setStreamingConfig(streamingRecognitionConfig)
                    .build(); // The first request in a streaming call has to be a config

            clientStream.send(request);
            // SampleRate:16000Hz, SampleSizeInBits: 16, Number of channels: 1, Signed: true,
            // bigEndian: false
            AudioFormat audioFormat = new AudioFormat(16000, 16, 1, true, false);
            DataLine.Info targetInfo =
                    new Info(
                            TargetDataLine.class,
                            audioFormat); // Set the system information to read from the microphone audio stream

            if (!AudioSystem.isLineSupported(targetInfo)) {
                System.out.println("Microphone not supported");
                System.exit(0);
            }
            // Target data line captures the audio stream the microphone produces.
            TargetDataLine targetDataLine = (TargetDataLine) AudioSystem.getLine(targetInfo);
            targetDataLine.open(audioFormat);
            targetDataLine.start();
            System.out.println("Start speaking");
            playMP3("beep-07.mp3");
            long startTime = System.currentTimeMillis();
            // Audio Input Stream
            AudioInputStream audio = new AudioInputStream(targetDataLine);
            long estimatedTime = 0, estimatedTimeStoppedSpeaking = 0, startStopSpeaking = 0;
            int currentSoundLevel = 0;
            Boolean hasSpoken = false;
            while (true) {
                estimatedTime = System.currentTimeMillis() - startTime;
                byte[] data = new byte[6400];
                audio.read(data);

                currentSoundLevel = calculateRMSLevel(data);
                System.out.println(currentSoundLevel);

                if (currentSoundLevel > 20) {
                    estimatedTimeStoppedSpeaking = 0;
                    startStopSpeaking = 0;
                    hasSpoken = true;
                }
                else {
                    if (startStopSpeaking == 0) {
                        startStopSpeaking = System.currentTimeMillis();
                    }
                    estimatedTimeStoppedSpeaking = System.currentTimeMillis() - startStopSpeaking;
                }

                if ((estimatedTime > 15000) || (estimatedTimeStoppedSpeaking > 1000 && hasSpoken)) { // 15 seconds or stopped speaking for 1 second
                    playMP3("beep-07.mp3");
                    System.out.println("Stop speaking.");
                    targetDataLine.stop();
                    targetDataLine.drain();
                    targetDataLine.close();
                    break;
                }
                request =
                        StreamingRecognizeRequest.newBuilder()
                        .setAudioContent(ByteString.copyFrom(data))
                        .build();
                clientStream.send(request);
            }
        } catch (Exception e) {
            System.out.println(e);
        }
        responseObserver.onComplete();
        String ans = SPEECH_TO_TEXT_ANSWER;
        return ans;
    }

输出应该是字符串形式的转录文本。但是，它非常不一致。大多数时候，它返回一个空字符串。但是，有时该程序确实有效并且确实返回了转录文本。

我还尝试在程序运行时单独录制音频。虽然该方法返回一个空字符串，但当我将单独录制的音频文件保存并直接通过 api 发送时，它返回了正确的转录文本。

我不明白程序为什么/如何只在某些时候工作。

java - 用于使用 Java 流式传输音频的 Google Speech-to-Text

0 回答 0

Related

Reference