我正在努力检测 Python 中的“Hi Siri”、“Ok Google”等触发词。我的方法是在 wav 文件中记录触发器和一些输出单词。然后阅读它们并使用 pyAudioAnalysis 提取特征。最后,我将触发词特征的余弦相似度与输出滑动窗口提取的特征进行比较。问题是,对于完全相同的单词代码如下:
def match_transcription(tar,out):
"returns list of correlations where both audio transcriptions match"
print(tar.shape) # trigger word's features (34,num of frames)
print(out.shape) # output sound's features (34,num of frames)
sims=[] # will have similarities for all features, for all chunks
for i in range(tar.shape[0]): # loop over all features
chunk_tar=tar[i] # pick one feature from target
chunk_out=out[i] # pick same feature from output
sims1=[]
chunk_outs=window(chunk_out,tar.shape[1]) # generate sliding window for ouput features
for chunk in chunk_outs: # loop over all output features
sim = 1 - spatial.distance.cosine(chunk, chunk_tar) # calculate cosine similarity between target and output features
sims1.append(sim) # add similarities to list
sims.append(np.array(sims1))
sims=np.array(sims)
means=np.mean(sims,axis=0) # take mean of all frames features
print(sims)
print(means)
输出是这样的:
Mean Similarity: [0.25522565 0.25120983 0.25925772 0.27925796 0.28873657 0.289228
0.3081794 0.3477496 0.33269364 0.34055122 0.34868945 0.33925324
0.34162649 0.32976345 0.32332807 0.33668049 0.34458411 0.36058285
0.37208687 0.37574359 0.400042 0.40289759 0.3872925 0.35079805
0.36320806 0.36803756 0.35871608 0.35921478 0.36508046 0.39065785
0.40899824 0.43283008 0.43767465 0.42003872 0.41108351 0.41531505
0.39725584 0.38569253 0.35555717 0.36983754 0.37081652 0.39188315]
输出显示输出的所有滑动窗口与说出相同单词的触发词几乎没有相似性。
触发词的特征提取是相同的,输出听起来像:
def get_features(f_name):
"returns short term features from the audio"
[Fs, x] = audioBasicIO.readAudioFile(f_name)
F, f_names = stFeatureExtraction(x, Fs, 0.050*Fs, 0.025*Fs)
return F,f_names
F1,f1_names=get_features('trigger_word.wav') # done for all output sounds as well
我的问题是 34 个特征中的哪一个与检查触发词和输出声音之间的相似性相关?或者有没有其他方法可以在 python 中执行相同的工作。谢谢!