Google researchers have published a novel deep learning-based approach for sound pitch estimation. The novel method is in fact, the first one to be able to perform sound pitch estimation in a self-supervised way i.e. in absence of annotated data.
As researchers mention in their paper, sound pitch estimation has received a lot of attention in the past decades, mostly because of its importance in different application domains. However, traditional signal processing methods have been outperforming the more recent machine learning approaches in solving this task.
The reason behind this lies partially in the absence of large amounts of labeled data as well as in the difficulty to obtain such data. In their novel paper, Beat Gfeller and his fellow researchers have proposed a new, different approach for sound pitch estimation that does not require labeled data.
SPICE, as the new method was called, is able to learn sound pitch estimation in a completely unsupervised manner. Researchers exploit the self-supervised learning approach where they designed the model to learn a pretext or proxy task and the model additionally learns how to estimate sound pitch. They designed an encoder-decoder convolutional architecture that allows a reference signal to be fed into the model along with a pitch-shifted signal obtained from this reference signal. The loss function that researchers employed forces the difference of the embeddings of these signals to be proportional to the original (and known) difference in pitch between the signals. The architecture is depicted in the image below.
The proposed method was evaluated using publicly available datasets and researchers showed that it outperforms traditional methods with handcrafted features. They also report that the method performs on par with CREPE – a state-of-the-art but supervised learning-based pitch estimation method.
In collaboration with Youtube, the model was deployed in an interesting web app – FreddieMeter, where users can score their performance against Freddie Mercury in terms of pitch, timbre and melody.