When measuring the effectiveness of a Voice Activity Detection algorithm (VAD) looking at 0-1 accuracy is rarely enough. We typically also look at Nonspeech Hit Rate (HR0) and Speech Hit Rate (HR1).
- HR0 is computed as the ratio of the number of correctly detected nonspeech frames to the number of real nonspeech frames.
- HR1 is computed as the ratio of the number of correctly detected speech frames to the number of real speech frames.
Park et al. 2014 
Another way to put it is _the percentage of nonspeech and speech frames that are correctly predicted. In Python, this can be calculated in the following way:
import numpy as np import our-vad-library as VAD X = VAD.load_data() y = VAD.load_targets() y_hat = VAD.predict(X) # Find nonspeech and speech hit rates: index0 = np.where(y ==0) index1 = np.where(y ==1) hr0 = (y_hat[index0] == y[index0]).mean() hr1 = (y_hat[index1] == y[index1]).mean()
First we create 2 indexes of y using numpy's
where() function (see more).
index0 is a vector of all the positions of y that represents a silent frame in our data. Say
y = [0,0,0,1,1,0], then
index0 = [0,1,2,5], since
y = y = y = y = 0.
this means that
print y[index0] # -> [0,0,0,0]
Which in and of itself is not interesting. However, we can use the same index to pull out all the predictions in ŷ and compare them to the ground-truth in y
y_hat[index0] == y[index0] # -> (True, True, False ... , dtype=bool)
This gives us a new array of the same dimensions with boolean
False values. Each
True represents a correct prediction and each
False an incorrect. A neat python trick is that boolean values are treated as 0 and 1, so we can take the mean of this boolean result array to get the ratio between correct and incorrect prediction using the
 Park, Jinsoo, Wooil Kim, David K. Han, and Hanseok Ko. “Voice Activity Detection in Noisy Environments Based on Double-Combined Fourier Transform and Line Fitting.” The Scientific World Journal 2014 (August 6, 2014): e146040. doi:10.1155/2014/146040.