β‘iApp ASR
Features
πΉππ¬π§ Supports Thai / English languages
π― Very high accuracy, tested on a test set of 10,000 audio files from Common Voice 11, spoken by over 1,000 people, including male, female, and child voices, resulting in an average Word Error Rate (WER) of 1.5% (on single-speaker audio files without background noise)
β‘οΈ High-speed transcription at 70 times real-time
πͺΆ Faster backend system with parallelizing the answer selection process into 5 simultaneous processes
π― Accurate word-level timestamps using audio-to-text alignment (wav2vec2 alignment)
π―ββοΈ Multi-speaker recognition using speaker diarization from pyannote-audio (with labels for each speaker)
π£οΈ Voice Activity Detection (VAD) pre-processing helps reduce false predictions and process batches without compromising accuracy
π£οΈ Phoneme-based ASR uses an improved set of models to distinguish the smallest units of sound in speech, helping to differentiate one word from another, such as the 'p' sound in "tap"
π£οΈ Supports both gRPC streaming and file upload with REST API
Last updated