WebSocket APIs Documentation (Beta)

Real-time ASR (Automatic Speech Recognition) converting audio stream data into text stream data results in real-time. Audio stream data can be stream from file or capture from microphone directly.

WebSockets Protocol Enables Real-time Communication

  • Full-duplex Communication: WebSocket provides a persistent, bidirectional communication channel over a single TCP connection, Unlike HTTP, which is request-response based, WebSocket allowed both client and server to send data independently at any time.

  • Low Latency: The persistent connection reduces the overhead of establishing new connections which is a critical factor for real-time applications.

  • Efficient Data Transfer: WebSocket frames have minimal overhead compared to HTTP request, making then suitable for transmitting small chunks of data frequently.

Endpoint

  • For a WebSocket without Encryption

ws://api.iapp.co.th/asrx
  • For a Secure WebSocket with Encryption

wss://api.iapp.co.th/asrx

Use to authentication during client send Websocket handshake request to WebSocket server

Key
Value

apikey (require)

"Your API KEY"

Audio

  • Supported Audio Sampling Rate: 16k

  • Supported Audio Type: Mono 16bit inter PCM

🗂ī¸ Stream from file

Example client side code using Python WebSockets Package

Prerequisited

pip install websockets colorama
from websockets.client import WebSocketClientProtocol
from colorama import Fore
import websockets
import asyncio
import json
import sys

async def send_audio_data(ws_client: WebSocketClientProtocol):
    frame_size = 1024 
    with open("audios/male.pcm", "rb") as file:
        while True:
            buff = file.read(frame_size)
            if buff:
                await ws_client.send(buff)
            else:
                data_to_send = {
                    "data": {
                        "status": "CloseConnection"
                    }
                }
                await ws_client.send(json.dumps(data_to_send))
                break
            await asyncio.sleep(0.02)
 
async def receive_message(ws_client: WebSocketClientProtocol):
    try:
        begin = True
        while True:
            message = await ws_client.recv()
            data = json.loads(message)
            state = data.get('state')
            transcribe = data.get('text')
            if transcribe == "Confirm Close Connection":
                break
            if state == "NewSentence":
                sys.stdout.write('\033[s')  # Restore saved cursor position
                sys.stdout.flush()
                begin = False
            if begin:
                print(f"\r{transcribe}", end='', flush=True)
            else:
                sys.stdout.write('\033[u')
                print(f"{transcribe}", end='', flush=True)
    except websockets.exceptions.ConnectionClosed as e:
        print(e)
    finally:
        print(Fore.CYAN + f"Connection closed gracfully" + Fore.RESET)


async def client_test():
    url = "wss://api.iapp.co.th/asrx"
    async with websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"}) as cli_ws:
        send_data_task = asyncio.create_task(send_audio_data(cli_ws))
        receive_data_task = asyncio.create_task(receive_message(cli_ws))
        await send_data_task
        await receive_data_task

if __name__ == "__main__":
    asyncio.run(client_test())

Connection workflow

  • Get API KEY from platform and pass to apikey parameter in Header, then send the WebSocket protocol handshake request to the server-side.

  • After the handshake succeeds, the clients can upload and receive data simultaneously through the WebSocket connection. And the data is completed uploaded, the client should upload the data like JSON format it contains status field with "end of stream" value

  • Disconnect the websocket connection after server-side process all audio stream come from client success, then a server send indicator data to client for close the connection

Response from ASR streaming APIs

When ASR engine in server-side processed audio chunk successes then send transcibe text back to client with follow JSON format continuously

Result of chunk 1

{
    "type": "realtime",
    "state": "Sentence",
    "text": "ā¸Ēā¸§ā¸ąā¸Ēā¸”ā¸ĩā¸„ā¸Ŗā¸ąā¸š " 
}

Result of chunk 2

{
    "type": "realtime",
    "state": "Sentence",
    "text": "ā¸œā¸Ąāš€ā¸Ēā¸ĩā¸ĸā¸‡ā¸œā¸šāš‰āšƒā¸Ģā¸āšˆ ā¸œā¸šāš‰ā¸Šā¸˛ā¸ĸā¸„ā¸Ŗā¸ąā¸š" 
}

When client send "end of stream" to server, server will send below data back to client and client will be the one to close the connection.

{
    "type": "realtime",
    "text": "end_of_process" 
}

🎤 Stream directly from microphone (Beta)

Example client side code that capture from microphone

Prerequisited

Install portaudio for use pyaudio

  • MacOS

brew install portaudio
  • Windows OS

pip install pipwin
pipwin install pyaudio

Ref: https://stackoverflow.com/questions/52283840/i-cant-install-pyaudio-on-windows-how-to-solve-error-microsoft-visual-c-14

pip install pyaudio websockets colorama
from websockets.client import WebSocketClientProtocol
from colorama import Fore
import websockets
import asyncio
import pyaudio
import json
import sys


async def send_audio_data(ws_client: WebSocketClientProtocol, 
                          stream_mic: pyaudio.PyAudio.Stream, 
                          mic_obj: pyaudio.PyAudio) -> None:

    CHUNK_SIZE = 1024
    print("Recording... Press Ctrl+C to stop")
    try:
        while True:
            buff = stream_mic.read(CHUNK_SIZE, exception_on_overflow=False)
            await ws_client.send(buff)
            await asyncio.sleep(0.05)
    finally:
        stream_mic.stop_stream()
        stream_mic.close()
        mic_obj.terminate()
        print("Record timeout")

async def receive_message(ws_client: WebSocketClientProtocol) -> None:
    try:
        begin = True
        while True:
            message = await ws_client.recv()
            data = json.loads(message)
            transcribe = data.get('text')
            state = data.get('state')
            if transcribe == "Confirm Close Connection":
                break
            if state == "NewSentence":
                sys.stdout.write('\033[s')  # Restore saved cursor position
                sys.stdout.flush()
                begin = False
            if begin:
                print(f"\r{transcribe}", end='', flush=True)
            else:
                sys.stdout.write('\033[u')
                print(f"{transcribe}", end='', flush=True)
    except websockets.exceptions.ConnectionClosed as e:
        print(e)
    finally:
        print(Fore.CYAN + f"Connection closed gracfully" + Fore.RESET)

async def client_test():
    url = "wss://api.iapp.co.th/asrx"
    async with websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"}) as client_ws:
        SAMPLE_RATE = 16000
        CHANNELS = 1
        FRAME_SIZE = 1024
        FORMAT = pyaudio.paInt16

        cap_mic = pyaudio.PyAudio()

        # open the input stream for the microphone
        stream = cap_mic.open(format=FORMAT,
                              channels=CHANNELS,
                              rate=SAMPLE_RATE,
                              input=True,
                              frames_per_buffer=FRAME_SIZE)
        
        send_data_task = asyncio.create_task(send_audio_data(client_ws, stream, cap_mic))
        receive_data_task = asyncio.create_task(receive_message(client_ws))
        await send_data_task
        await receive_data_task

if __name__ == "__main__":
    asyncio.run(client_test())
 

Response Pattern

The server will process for each audio data only if audio data is speech and will continue to process until silence it detected, then will process again if speech is detected and state key will be "New sentence" for to say that now a new sentence has begun.

Last updated