• Home
  • Explore

Top Speaker Diarization Libraries and APIs in 2022

www.assemblyai.com/blog/top-speaker-diarization-libraries-and-apis/

1 Users

0 Comments

25 Highlights

0 Notes

Tags

Top Highlights

  • Automatic Speech Recognition (ASR)

  • Speaker Diarization answers the question: who spoke when?

  • What is Speaker Diaraization and How Does Speaker Diarization Work?

  • The fundamental task of Speaker Diarization is to apply speaker labels

  • to each utterance in the transcription

  • The first step is to break the audio file into a set of “utterances.” What constitutes an utterance?

  • utterances are at least a half second to 10 seconds of speech.

  • a single word wouldn’t be enough for a human to identify a speaker, Machine Learning models also need more data to identify speakers too

  • There are many ways to break up an audio/video file into a set of utterances, with one common way being to use silence and punctuation markers.

  • Once an audio file is broken into utterances, those utterances get sent through a Deep Learning model

  • An embedding is a Deep Learning model’s low-dimensional representation of an input

  • the embedding of a word looks like:

  • We perform a similar process to convert not words, but segments of audio, into embeddings as well

  • we need to determine how many speakers are present in the audio file--this is a key feature of a modern Speaker Diarization model.

  • modern Speaker Diarization models is that they can accurately predict this number.

  • Our first goal here is to overestimate the number of speakers

  • want to determine the greatest number of speakers that could reasonably be heard in the audio

  • Why overestimate? It's much easier to combine the utterances of one speaker that has been incorrectly identified as two than it is to disentangle the utterances of two speakers which have incorrectly been combined into one.

  • Speaker Diarization models take the utterance embeddings (produced above), and cluster them into as many clusters as there are speakers

  • There are many ways to determine similarity of embeddings, and this is a core component of accurately predicting speaker labels with a Speaker Diarization model

Ready to highlight and find good content?

Glasp is a social web highlighter that people can highlight and organize quotes and thoughts from the web, and access other like-minded people’s learning.

AboutPrivacyTerms

© 2023 Glasp Inc. All rights reserved.