java - Real-time identification of non-speech, non-music sound from a continuous microphone stream

Question

I'm looking to log events corresponding to a specific sound, such as a car door slamming, or perhaps a toaster ejecting toast.

The system needs to be more sophisticated than a "loud noise detector"; it needs to be able to distinguish that specific sound from other loud noises.

The identification need not be zero-latency, but the processor needs to keep up with a continuous stream of incoming data from a microphone that is always on.

Is this task significantly different than speech recognition, or could I make use of speech recognition libraries/toolkits to identify these non-speech sounds?
Given the requirement that I only need to match one sound (as opposed to matching among a library of sounds), are there any special optimizations I can do?

This answer indicates that a matched filter would be appropriate, but I am hazy on the details. I don't believe a simple cross-correlation on the audio waveform data between a sample of the target sound and the microphone stream would be effective, due to variations in the target sound.

My question is also similar to this, which didn't get much attention.

score 3 · Accepted Answer

Cowling (2004) 的博士论文“自主监视的非语音环境声音分类系统”对音频特征提取和分类的不同技术进行了实验结果。他使用诸如叮当声的按键和脚步声等环境声音，并且能够达到 70% 的准确率：

发现最好的技术是具有动态时间规整的连续小波变换特征提取或具有动态时间规整的梅尔频率倒谱系数。这两种技术都达到了 70% 的识别率。

如果你把自己限制在一种声音，也许你可以达到更高的识别率？

作者还提到，在语音识别（学习向量量化和神经网络）方面效果很好的技术在环境声音方面效果不佳。

我还在这里找到了一篇更新的文章：Detecting Audio Events for Semantic Video Search，作者为 Bugalho 等人。(2009)，他们检测电影中的声音事件（如枪声、爆炸等）。

我在这方面没有经验。由于您的问题激起了我的兴趣，我只是偶然发现了这些材料。我在这里发布我的发现，希望它对您的研究有所帮助。

score 3 · Accepted Answer

我发现了一篇关于这个主题的有趣论文

通过频率矢量主成分分析识别车辆声音特征，作者：Huadong Wu、Mel Siegel 和 Pradeep Khosla（IEEE Transactions on Instrumentation and Measurement，第 48 卷，第 5 期，1999 年 10 月）

它也应该适用于您的应用程序，如果不比车辆声音更好的话。

在分析训练数据时，它...

采样 200ms
对每个样本进行傅里叶变换 (FFT)
对频率向量进行主成分分析
- 计算该类所有样本的平均值
- 从样本中减去平均值
- 计算平均协方差矩阵的特征向量（每个向量与其自身的外积的平均值）
- 存储均值和最重要的特征向量。

然后对声音进行分类，它...

采样 200 毫秒 (S)。
对每个样本进行傅里叶变换。
从频率向量 (F) 中减去类别 (C) 的平均值。
将频率向量与 C 的每个特征向量相乘，得到一个数字。
从 F 中减去每个数字和相应特征向量的乘积。
获取结果向量的长度。
如果该值低于某个常数，则 S 被识别为属于 C 类。

java - Real-time identification of non-speech, non-music sound from a continuous microphone stream

2 回答 2

Related

Reference