WebSocket APIs Documentation (Beta)

Real-time ASR (Automatic Speech Recognition) converting audio stream data into text stream data results in real-time. Audio stream data can be stream from file or capture from microphone directly.

WebSockets Protocol Enables Real-time Communication

Full-duplex Communication: WebSocket provides a persistent, bidirectional communication channel over a single TCP connection, Unlike HTTP, which is request-response based, WebSocket allowed both client and server to send data independently at any time.
Low Latency: The persistent connection reduces the overhead of establishing new connections which is a critical factor for real-time applications.
Efficient Data Transfer: WebSocket frames have minimal overhead compared to HTTP request, making then suitable for transmitting small chunks of data frequently.

Endpoint

For a WebSocket without Encryption

ws://api.iapp.co.th/asrx

For a Secure WebSocket with Encryption

wss://api.iapp.co.th/asrx

Header

Use to authentication during client send Websocket handshake request to WebSocket server

Key

Value

apikey (require)

"Your API KEY"

Audio

Supported Audio Sampling Rate: 16k
Supported Audio Type: Mono 16bit inter PCM

🗂️ Stream from file

Example client side code using Python WebSockets Package

Prerequisited

pip install websockets colorama

from websockets.client import WebSocketClientProtocol
from colorama import Fore
import websockets
import asyncio
import json
import sys

async def send_audio_data(ws_client: WebSocketClientProtocol):
    frame_size = 1024 
    with open("audios/male.pcm", "rb") as file:
        while True:
            buff = file.read(frame_size)
            if buff:
                await ws_client.send(buff)
            else:
                data_to_send = {
                    "data": {
                        "status": "CloseConnection"
                    }
                }
                await ws_client.send(json.dumps(data_to_send))
                break
            await asyncio.sleep(0.02)
 
async def receive_message(ws_client: WebSocketClientProtocol):
    try:
        begin = True
        while True:
            message = await ws_client.recv()
            data = json.loads(message)
            state = data.get('state')
            transcribe = data.get('text')
            if transcribe == "Confirm Close Connection":
                break
            if state == "NewSentence":
                sys.stdout.write('\033[s')  # Restore saved cursor position
                sys.stdout.flush()
                begin = False
            if begin:
                print(f"\r{transcribe}", end='', flush=True)
            else:
                sys.stdout.write('\033[u')
                print(f"{transcribe}", end='', flush=True)
    except websockets.exceptions.ConnectionClosed as e:
        print(e)
    finally:
        print(Fore.CYAN + f"Connection closed gracfully" + Fore.RESET)


async def client_test():
    url = "wss://api.iapp.co.th/asrx"
    async with websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"}) as cli_ws:
        send_data_task = asyncio.create_task(send_audio_data(cli_ws))
        receive_data_task = asyncio.create_task(receive_message(cli_ws))
        await send_data_task
        await receive_data_task

if __name__ == "__main__":
    asyncio.run(client_test())

Connection workflow

Get API KEY from platform and pass to apikey parameter in Header, then send the WebSocket protocol handshake request to the server-side.
After the handshake succeeds, the clients can upload and receive data simultaneously through the WebSocket connection. And the data is completed uploaded, the client should upload the data like JSON format it contains status field with "end of stream" value
Disconnect the websocket connection after server-side process all audio stream come from client success, then a server send indicator data to client for close the connection

Response from ASR streaming APIs

When ASR engine in server-side processed audio chunk successes then send transcibe text back to client with follow JSON format continuously

Result of chunk 1

{
    "type": "realtime",
    "state": "Sentence",
    "text": "สวัสดีครับ " 
}

Result of chunk 2

{
    "type": "realtime",
    "state": "Sentence",
    "text": "ผมเสียงผู้ใหญ่ ผู้ชายครับ" 
}

When client send "end of stream" to server, server will send below data back to client and client will be the one to close the connection.

{
    "type": "realtime",
    "text": "end_of_process" 
}

🎤 Stream directly from microphone (Beta)

Example client side code that capture from microphone

Prerequisited

Install portaudio for use pyaudio

MacOS

brew install portaudio

Windows OS

pip install pipwin
pipwin install pyaudio

Ref: https://stackoverflow.com/questions/52283840/i-cant-install-pyaudio-on-windows-how-to-solve-error-microsoft-visual-c-14

pip install pyaudio websockets colorama

from websockets.client import WebSocketClientProtocol
from colorama import Fore
import websockets
import asyncio
import pyaudio
import json
import sys


async def send_audio_data(ws_client: WebSocketClientProtocol, 
                          stream_mic: pyaudio.PyAudio.Stream, 
                          mic_obj: pyaudio.PyAudio) -> None:

    CHUNK_SIZE = 1024
    print("Recording... Press Ctrl+C to stop")
    try:
        while True:
            buff = stream_mic.read(CHUNK_SIZE, exception_on_overflow=False)
            await ws_client.send(buff)
            await asyncio.sleep(0.05)
    finally:
        stream_mic.stop_stream()
        stream_mic.close()
        mic_obj.terminate()
        print("Record timeout")

async def receive_message(ws_client: WebSocketClientProtocol) -> None:
    try:
        begin = True
        while True:
            message = await ws_client.recv()
            data = json.loads(message)
            transcribe = data.get('text')
            state = data.get('state')
            if transcribe == "Confirm Close Connection":
                break
            if state == "NewSentence":
                sys.stdout.write('\033[s')  # Restore saved cursor position
                sys.stdout.flush()
                begin = False
            if begin:
                print(f"\r{transcribe}", end='', flush=True)
            else:
                sys.stdout.write('\033[u')
                print(f"{transcribe}", end='', flush=True)
    except websockets.exceptions.ConnectionClosed as e:
        print(e)
    finally:
        print(Fore.CYAN + f"Connection closed gracfully" + Fore.RESET)

async def client_test():
    url = "wss://api.iapp.co.th/asrx"
    async with websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"}) as client_ws:
        SAMPLE_RATE = 16000
        CHANNELS = 1
        FRAME_SIZE = 1024
        FORMAT = pyaudio.paInt16

        cap_mic = pyaudio.PyAudio()

        # open the input stream for the microphone
        stream = cap_mic.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=SAMPLE_RATE,
                              input=True,
                              frames_per_buffer=FRAME_SIZE)
        
        send_data_task = asyncio.create_task(send_audio_data(client_ws, stream, cap_mic))
        receive_data_task = asyncio.create_task(receive_message(client_ws))
        await send_data_task
        await receive_data_task

if __name__ == "__main__":
    asyncio.run(client_test())

Response Pattern

The server will process for each audio data only if audio data is speech and will continue to process until silence it detected, then will process again if speech is detected and state key will be "New sentence" for to say that now a new sentence has begun.

PreviousAPI Documentation (Streaming - gRPC)NextThai Text-to-Speech (TTS)

Last updated 1 month ago