Real-time ASR (Automatic Speech Recognition) converting audio stream data into text stream data results in real-time. Audio stream data can be stream from file or capture from microphone directly.
Endpoint
For a WebSocket without Encryption
ws://api-iflytek.iapp.co.th/asr
For a Secure WebSocket with Encryption
wss://api-iflytek.iapp.co.th/asr
Header
Use to authentication during client send Websocket handshake request to WebSocket server
Key
Value
apikey*
"Your API KEY"
Audio format
Sample Rate: 16,000 Hz
Channel: Mono
Bit-depth: 16 bit
Audio-encode: PCM (Pulse Code Modulation)
Stream from file
Example client side code using Python WebSockets Package
Prerequisited
pipinstallwebsocketscolorama
from websockets.client import WebSocketClientProtocolfrom colorama import Foreimport websocketsimport asyncioimport jsonimport sysasyncdefsend_audio_data(ws_client: WebSocketClientProtocol): frame_size =1024withopen("audios/male.pcm", "rb")as file:whileTrue: buff = file.read(frame_size)if buff:await ws_client.send(buff)else: data_to_send ={"data":{"status":"CloseConnection"}}await ws_client.send(json.dumps(data_to_send))breakawait asyncio.sleep(0.02)asyncdefreceive_message(ws_client: WebSocketClientProtocol):try: begin =TruewhileTrue: message =await ws_client.recv() data = json.loads(message) state = data.get('state') transcribe = data.get('text')if transcribe =="Confirm Close Connection":breakif state =="NewSentence": sys.stdout.write('\033[s')# Restore saved cursor position sys.stdout.flush() begin =Falseif begin:print(f"\r{transcribe}", end='', flush=True)else: sys.stdout.write('\033[u')print(f"{transcribe}", end='', flush=True)except websockets.exceptions.ConnectionClosed as e:print(e)finally:print(Fore.CYAN +f"Connection closed gracfully"+ Fore.RESET)asyncdefclient_test(): url ="wss://api-iflytek.iapp.co.th/asr"asyncwith websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"})as cli_ws: send_data_task = asyncio.create_task(send_audio_data(cli_ws)) receive_data_task = asyncio.create_task(receive_message(cli_ws))await send_data_taskawait receive_data_taskif__name__=="__main__": asyncio.run(client_test())
Connection workflow
Get API KEY from platform and pass to apikey parameter in Header, then send the WebSocket protocol handshake request to the server-side.
After the handshake succeeds, the clients can upload and receive data simultaneously through the WebSocket connection. And the data is completed uploaded, the client should upload the data like JSON format it contains status field with "end of stream" value
Disconnect the websocket connection after server-side process all audio stream come from client success, then a server send indicator data to client for close the connection
Stream directly from microphone
Example client side code that capture from microphone
from websockets.client import WebSocketClientProtocolfrom colorama import Foreimport websocketsimport asyncioimport pyaudioimport jsonimport sysasyncdefsend_audio_data(ws_client: WebSocketClientProtocol,stream_mic: pyaudio.PyAudio.Stream,mic_obj: pyaudio.PyAudio) ->None: CHUNK_SIZE =1024print("Recording... Press Ctrl+C to stop")try:whileTrue: buff = stream_mic.read(CHUNK_SIZE, exception_on_overflow=False)await ws_client.send(buff)await asyncio.sleep(0.05)finally: stream_mic.stop_stream() stream_mic.close() mic_obj.terminate()print("Record timeout")asyncdefreceive_message(ws_client: WebSocketClientProtocol) ->None:try: begin =TruewhileTrue: message =await ws_client.recv() data = json.loads(message) transcribe = data.get('text') state = data.get('state')if transcribe =="Confirm Close Connection":breakif state =="NewSentence": sys.stdout.write('\033[s')# Restore saved cursor position sys.stdout.flush() begin =Falseif begin:print(f"\r{transcribe}", end='', flush=True)else: sys.stdout.write('\033[u')print(f"{transcribe}", end='', flush=True)except websockets.exceptions.ConnectionClosed as e:print(e)finally:print(Fore.CYAN +f"Connection closed gracfully"+ Fore.RESET)asyncdefclient_test(): url ="wss://api-iflytek.iapp.co.th/asr"asyncwith websockets.connect(url, ping_timeout=None, extra_headers={"apikey": "YOUR APIKEY"})as client_ws: SAMPLE_RATE =16000 CHANNELS =1 FRAME_SIZE =1024 FORMAT = pyaudio.paInt16 cap_mic = pyaudio.PyAudio()# open the input stream for the microphone stream = cap_mic.open(format=FORMAT, channels=CHANNELS, rate=SAMPLE_RATE, input=True, frames_per_buffer=FRAME_SIZE) send_data_task = asyncio.create_task(send_audio_data(client_ws, stream, cap_mic)) receive_data_task = asyncio.create_task(receive_message(client_ws))await send_data_taskawait receive_data_taskif__name__=="__main__": asyncio.run(client_test())
Response Pattern
The server will process for each audio data only if audio data is speech at the very beginning the state will be "Sentence" and will continue to process until silence it detected , then will process again if speech is detected and the state will be "New sentence" for to say that now a new sentence has begun.