Automatic Speech Recognition (ASR)
Speaker Diarization answers the question: who spoke when?
What is Speaker Diaraization and How Does Speaker Diarization Work?
The fundamental task of Speaker Diarization is to apply speaker labels
to each utterance in the transcription
The first step is to break the audio file into a set of “utterances.” What constitutes an utterance?
utterances are at least a half second to 10 seconds of speech.
a single word wouldn’t be enough for a human to identify a speaker, Machine Learning models also need more data to identify speakers too
There are many ways to break up an audio/video file into a set of utterances, with one common way being to use silence and punctuation markers.
Once an audio file is broken into utterances, those utterances get sent through a Deep Learning model
An embedding is a Deep Learning model’s low-dimensional representation of an input
the embedding of a word looks like:
We perform a similar process to convert not words, but segments of audio, into embeddings as well
we need to determine how many speakers are present in the audio file--this is a key feature of a modern Speaker Diarization model.
modern Speaker Diarization models is that they can accurately predict this number.
Our first goal here is to overestimate the number of speakers
want to determine the greatest number of speakers that could reasonably be heard in the audio
Why overestimate? It's much easier to combine the utterances of one speaker that has been incorrectly identified as two than it is to disentangle the utterances of two speakers which have incorrectly been combined into one.
Speaker Diarization models take the utterance embeddings (produced above), and cluster them into as many clusters as there are speakers
There are many ways to determine similarity of embeddings, and this is a core component of accurately predicting speaker labels with a Speaker Diarization model
Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.