Real-time ASR (Automatic Speech Recognition) converting audio stream data into text stream data results in real-time. Audio stream data can be stream from file or capture from microphone directly.
WebSockets Protocol Enables Real-time Communication
Full-duplex Communication: WebSocket provides a persistent, bidirectional communication channel over a single TCP connection, Unlike HTTP, which is request-response based, WebSocket allowed both client and server to send data independently at any time.
Low Latency: The persistent connection reduces the overhead of establishing new connections which is a critical factor for real-time applications.
Efficient Data Transfer: WebSocket frames have minimal overhead compared to HTTP request, making then suitable for transmitting small chunks of data frequently.
Endpoint
For a WebSocket without Encryption
ws://api.iapp.co.th/asrx
For a Secure WebSocket with Encryption
wss://api.iapp.co.th/asrx
Header
Use to authentication during client send Websocket handshake request to WebSocket server
Key
Value
Audio
Supported Audio Sampling Rate: 16k
Supported Audio Type: Mono 16bit inter PCM
đī¸ Stream from file
Example client side code using Python WebSockets Package
Prerequisited
pip install websockets colorama
from websockets.client import WebSocketClientProtocolfrom colorama import Foreimport websocketsimport asyncioimport jsonimport sysasyncdefsend_audio_data(ws_client: WebSocketClientProtocol): frame_size =1024withopen("audios/male.pcm", "rb")as file:whileTrue: buff = file.read(frame_size)if buff:await ws_client.send(buff)else: data_to_send ={"data":{"status":"CloseConnection"}}await ws_client.send(json.dumps(data_to_send))breakawait asyncio.sleep(0.02)asyncdefreceive_message(ws_client: WebSocketClientProtocol):try: begin =TruewhileTrue: message =await ws_client.recv() data = json.loads(message) state = data.get('state') transcribe = data.get('text')if transcribe =="Confirm Close Connection":breakif state =="NewSentence": sys.stdout.write('\033[s')# Restore saved cursor position sys.stdout.flush() begin =Falseif begin:print(f"\r{transcribe}", end='', flush=True)else: sys.stdout.write('\033[u')print(f"{transcribe}", end='', flush=True)except websockets.exceptions.ConnectionClosed as e:print(e)finally:print(Fore.CYAN +f"Connection closed gracfully"+ Fore.RESET)asyncdefclient_test(): url ="wss://api.iapp.co.th/asrx"asyncwith websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"})as cli_ws: send_data_task = asyncio.create_task(send_audio_data(cli_ws)) receive_data_task = asyncio.create_task(receive_message(cli_ws))await send_data_taskawait receive_data_taskif__name__=="__main__": asyncio.run(client_test())
Connection workflow
Get API KEY from platform and pass to apikey parameter in Header, then send the WebSocket protocol handshake request to the server-side.
After the handshake succeeds, the clients can upload and receive data simultaneously through the WebSocket connection. And the data is completed uploaded, the client should upload the data like JSON format it contains status field with "end of stream" value
Disconnect the websocket connection after server-side process all audio stream come from client success, then a server send indicator data to client for close the connection
Response from ASR streaming APIs
When ASR engine in server-side processed audio chunk successes then send transcibe text back to client with follow JSON format continuously
from websockets.client import WebSocketClientProtocolfrom colorama import Foreimport websocketsimport asyncioimport pyaudioimport jsonimport sysasyncdefsend_audio_data(ws_client: WebSocketClientProtocol,stream_mic: pyaudio.PyAudio.Stream,mic_obj: pyaudio.PyAudio) ->None: CHUNK_SIZE =1024print("Recording... Press Ctrl+C to stop")try:whileTrue: buff = stream_mic.read(CHUNK_SIZE, exception_on_overflow=False)await ws_client.send(buff)await asyncio.sleep(0.05)finally: stream_mic.stop_stream() stream_mic.close() mic_obj.terminate()print("Record timeout")asyncdefreceive_message(ws_client: WebSocketClientProtocol) ->None:try: begin =TruewhileTrue: message =await ws_client.recv() data = json.loads(message) transcribe = data.get('text') state = data.get('state')if transcribe =="Confirm Close Connection":breakif state =="NewSentence": sys.stdout.write('\033[s')# Restore saved cursor position sys.stdout.flush() begin =Falseif begin:print(f"\r{transcribe}", end='', flush=True)else: sys.stdout.write('\033[u')print(f"{transcribe}", end='', flush=True)except websockets.exceptions.ConnectionClosed as e:print(e)finally:print(Fore.CYAN +f"Connection closed gracfully"+ Fore.RESET)asyncdefclient_test(): url ="wss://api.iapp.co.th/asrx"asyncwith websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"})as client_ws: SAMPLE_RATE =16000 CHANNELS =1 FRAME_SIZE =1024 FORMAT = pyaudio.paInt16 cap_mic = pyaudio.PyAudio()# open the input stream for the microphone stream = cap_mic.open(format=FORMAT, channels=CHANNELS, rate=SAMPLE_RATE, input=True, frames_per_buffer=FRAME_SIZE) send_data_task = asyncio.create_task(send_audio_data(client_ws, stream, cap_mic)) receive_data_task = asyncio.create_task(receive_message(client_ws))await send_data_taskawait receive_data_taskif__name__=="__main__": asyncio.run(client_test())
Response Pattern
The server will process for each audio data only if audio data is speech and will continue to process until silence it detected, then will process again if speech is detected and state key will be "New sentence" for to say that now a new sentence has begun.