[iFlyTek] Real-time ASR (WebSocket)

Real-time ASR (Automatic Speech Recognition) converting audio stream data into text stream data results in real-time. Audio stream data can be stream from file or capture from microphone directly.

Endpoint

  • For a WebSocket without Encryption

ws://api-iflytek.iapp.co.th/asr
  • For a Secure WebSocket with Encryption

wss://api-iflytek.iapp.co.th/asr

Use to authentication during client send Websocket handshake request to WebSocket server

Key
Value

apikey*

"Your API KEY"

Audio format

  • Sample Rate: 16,000 Hz

  • Channel: Mono

  • Bit-depth: 16 bit

  • Audio-encode: PCM (Pulse Code Modulation)

Stream from file

Example client side code using Python WebSockets Package

Prerequisited

pip install websockets colorama
from websockets.client import WebSocketClientProtocol
from colorama import Fore
import websockets
import asyncio
import json
import sys

async def send_audio_data(ws_client: WebSocketClientProtocol):
    frame_size = 1024 
    with open("audios/male.pcm", "rb") as file:
        while True:
            buff = file.read(frame_size)
            if buff:
                await ws_client.send(buff)
            else:
                data_to_send = {
                    "data": {
                        "status": "CloseConnection"
                    }
                }
                await ws_client.send(json.dumps(data_to_send))
                break
            await asyncio.sleep(0.02)
 
async def receive_message(ws_client: WebSocketClientProtocol):
    try:
        begin = True
        while True:
            message = await ws_client.recv()
            data = json.loads(message)
            state = data.get('state')
            transcribe = data.get('text')
            if transcribe == "Confirm Close Connection":
                break
            if state == "NewSentence":
                sys.stdout.write('\033[s')  # Restore saved cursor position
                sys.stdout.flush()
                begin = False
            if begin:
                print(f"\r{transcribe}", end='', flush=True)
            else:
                sys.stdout.write('\033[u')
                print(f"{transcribe}", end='', flush=True)
    except websockets.exceptions.ConnectionClosed as e:
        print(e)
    finally:
        print(Fore.CYAN + f"Connection closed gracfully" + Fore.RESET)


async def client_test():
    url = "wss://api-iflytek.iapp.co.th/asr"
    async with websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"}) as cli_ws:
        send_data_task = asyncio.create_task(send_audio_data(cli_ws))
        receive_data_task = asyncio.create_task(receive_message(cli_ws))
        await send_data_task
        await receive_data_task

if __name__ == "__main__":
    asyncio.run(client_test())

Connection workflow

  • Get API KEY from platform and pass to apikey parameter in Header, then send the WebSocket protocol handshake request to the server-side.

  • After the handshake succeeds, the clients can upload and receive data simultaneously through the WebSocket connection. And the data is completed uploaded, the client should upload the data like JSON format it contains status field with "end of stream" value

  • Disconnect the websocket connection after server-side process all audio stream come from client success, then a server send indicator data to client for close the connection

Stream directly from microphone

Example client side code that capture from microphone

Prerequisited

Install portaudio for use pyaudio

  • Mac OS

brew install portaudio
  • Windows OS

pip install pipwin
pipwin install pyaudio

Ref: https://stackoverflow.com/questions/52283840/i-cant-install-pyaudio-on-windows-how-to-solve-error-microsoft-visual-c-14

pip install pyaudio websockets colorama
from websockets.client import WebSocketClientProtocol
from colorama import Fore
import websockets
import asyncio
import pyaudio
import json
import sys


async def send_audio_data(ws_client: WebSocketClientProtocol, 
                          stream_mic: pyaudio.PyAudio.Stream, 
                          mic_obj: pyaudio.PyAudio) -> None:

    CHUNK_SIZE = 1024
    print("Recording... Press Ctrl+C to stop")
    try:
        while True:
            buff = stream_mic.read(CHUNK_SIZE, exception_on_overflow=False)
            await ws_client.send(buff)
            await asyncio.sleep(0.05)
    finally:
        stream_mic.stop_stream()
        stream_mic.close()
        mic_obj.terminate()
        print("Record timeout")

async def receive_message(ws_client: WebSocketClientProtocol) -> None:
    try:
        begin = True
        while True:
            message = await ws_client.recv()
            data = json.loads(message)
            transcribe = data.get('text')
            state = data.get('state')
            if transcribe == "Confirm Close Connection":
                break
            if state == "NewSentence":
                sys.stdout.write('\033[s')  # Restore saved cursor position
                sys.stdout.flush()
                begin = False
            if begin:
                print(f"\r{transcribe}", end='', flush=True)
            else:
                sys.stdout.write('\033[u')
                print(f"{transcribe}", end='', flush=True)
    except websockets.exceptions.ConnectionClosed as e:
        print(e)
    finally:
        print(Fore.CYAN + f"Connection closed gracfully" + Fore.RESET)

async def client_test():
    url = "wss://api-iflytek.iapp.co.th/asr"
    async with websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"}) as client_ws:
        SAMPLE_RATE = 16000
        CHANNELS = 1
        FRAME_SIZE = 1024
        FORMAT = pyaudio.paInt16

        cap_mic = pyaudio.PyAudio()

        # open the input stream for the microphone
        stream = cap_mic.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=SAMPLE_RATE,
                              input=True,
                              frames_per_buffer=FRAME_SIZE)
        
        send_data_task = asyncio.create_task(send_audio_data(client_ws, stream, cap_mic))
        receive_data_task = asyncio.create_task(receive_message(client_ws))
        await send_data_task
        await receive_data_task

if __name__ == "__main__":
    asyncio.run(client_test())
 

Response Pattern

The server will process for each audio data only if audio data is speech at the very beginning the state will be "Sentence" and will continue to process until silence it detected , then will process again if speech is detected and the state will be "New sentence" for to say that now a new sentence has begun.

Response state

Sentence (for speech is detected)

Example JSON

{
    "type": "realtime",
    "state": "Sentence",
    "text": "สวัสดีครับ" 
}

NewSentence (for end of transcription)

Example JSON

{
    "type": "realtime",
    "state": "NewSentence",
    "text": "วันนี้เป็นอย่างไร" 
}

Last updated