API Reference
Real-time, production-grade speech-to-text. Stream microphone input over WebSockets with sub-second latency, transcribe audio files, and translate speech — over a small, OpenAI-compatible REST + WebSocket surface. Plug it into any stack in minutes, no SDK required.
Prefer Python? The official SDK wraps every endpoint below with a fully-typed, async-friendly client — same model, same accuracy, same guarantees. Its full guide lives further down this same page.
Get an API key
You'll need an API key to use the API. Request one from your UltraSafe AI contact. Keys look like:
usf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Keep your key secret — anyone with it can make transcription requests against your account. Don't commit it to source control or paste it into client-side code.
Base URL
All endpoints are served under a single base URL. For the current beta:
https://api-prod-usf.us.inc
For WebSocket endpoints (only /v1/audio/transcriptions/stream so far) replace
https:// with wss://:
wss://api-prod-usf.us.inc
Authentication
Every authenticated REST request needs an API key in the Authorization
header:
Authorization: Bearer usf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
The WebSocket endpoint accepts the key as a query-string parameter instead — see Real-time streaming below.
A few endpoints are public (no key needed): GET /health,
GET /ready, and GET /v1/capabilities.
Quick start
# 1. Set your key + base URL once
export USF_API_KEY="usf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
export USF_BASE_URL="https://api-prod-usf.us.inc"
# 2. Transcribe a file
curl -sS "$USF_BASE_URL/v1/audio/transcriptions" \
-H "Authorization: Bearer $USF_API_KEY" \
-F file=@audio.wav \
-F model=usf-mini-asr
Response:
{
"text": "Hello, this is a test of the UltraSafe ASR transcription service.",
"duration": 4.13,
"processing_time_s": 0.31,
"model_inference_s": 0.142,
"timing": {
"total_s": 0.31,
"model_inference_ms": 141.8,
"audio_decode_ms": 1.0
}
}
That's the whole flow for a basic call. The rest of this document covers every parameter and every endpoint in detail.
Endpoints at a glance
| Method | Path | Auth | Status | Purpose |
|---|---|---|---|---|
GET | /health | none | Live | Liveness — is the proxy alive? |
GET | /ready | none | Live | Readiness — is upstream reachable? |
GET | /v1/capabilities | none | Live | Feature flags and model info for this deployment |
GET | /v1/models | bearer | Live | List available models |
GET | /v1/models/{id} | bearer | Live | Inspect a single model |
POST | /v1/audio/transcriptions | bearer | Live | Transcribe an audio file |
POST | /v1/audio/translations | bearer | Live | Translate audio to English text |
POST | /v1/audio/enhance | bearer | Under development | Standalone audio enhancement (denoise, BWE) with optional transcript — currently returns 503 |
WS | /v1/audio/transcriptions/stream | query-string key | Live | Real-time streaming transcription |
Status legend. Live — fully working in production today. Under development — the endpoint is reachable and the request shape is stable, but the underlying capability is not yet enabled on the production deployment. Sections below flag any individual fields that share the same status.
Capabilities
Not every deployment has every feature enabled. Call this first to see what's available. Public — no key required.
curl -sS "$USF_BASE_URL/v1/capabilities"
Response (truncated):
{
"model": { "id": "usf-mini-asr", "language": "en" },
"features": {
"vad": { "enabled": true },
"noise_reduction": { "enabled": false, "level": "medium" },
"audio_enhancement": { "enabled": false },
"diarization": { "enabled": false, "method": "clustering" },
"speaker_identification": { "enabled": false }
},
"streaming": { "websocket": true, "sse": false },
"response_formats": ["json", "verbose_json", "text"]
}
If features.diarization.enabled is false, passing diarization=clustering
to a transcription call will return a 400 error. Always check first.
Models
curl -sS "$USF_BASE_URL/v1/models" \
-H "Authorization: Bearer $USF_API_KEY"
{
"data": [
{
"id": "usf-mini-asr",
"object": "model",
"owned_by": "ultrasafe",
"created": 1714000000
}
],
"object": "list"
}
Or fetch a single model by id:
curl -sS "$USF_BASE_URL/v1/models/usf-mini-asr" \
-H "Authorization: Bearer $USF_API_KEY"
File transcription
POST /v1/audio/transcriptions — converts an audio file into text.
Request
| Field | Type | Where | Default | Description |
|---|---|---|---|---|
file | binary | multipart | required | The audio file. WAV, MP3, FLAC, M4A, OGG. Buffered into memory before sending. |
model | string | multipart | usf-mini-asr | Model id (see /v1/models). |
response_format | string | multipart | json | One of json, verbose_json, text. |
language | string | multipart | auto-detect | ISO 639-1 code, e.g. en, es, fr. |
prompt | string | multipart | — | Bias the decoder with domain words ("Medical terminology."). |
temperature | number | multipart | 0 | Sampling temperature, 0 to 1. |
timestamp_granularities | json array | multipart | — | Subset of ["segment", "word"] (JSON-encoded). Under development — the parameter is accepted but segments / words are not yet returned in the response. |
Voice activity detection (when features.vad.enabled is true)
| Field | Type | Default | Description |
|---|---|---|---|
enable_vad | boolean | server | Force VAD on or off. |
vad_threshold | number | 0.5 | Probability threshold for "speech." |
vad_min_speech_duration_ms | int | 250 | Drop speech segments shorter than this. |
vad_min_silence_duration_ms | int | 300 | Merge across silences shorter than this. |
Audio enhancement (when features.audio_enhancement.enabled is true)
Under development
noise_reduction,enable_background_suppression,enable_voice_extraction, andenable_audio_upscaleare listed as capabilities by/v1/capabilitiesbut are not yet wired up on the production deployment. The fields are accepted by the API for forward compatibility — currently they have no effect. VAD (above) and diarization (below) are live.
| Field | Type | Description |
|---|---|---|
noise_reduction | string | off, low, medium, high. |
enable_background_suppression | boolean | Drop background noise before ASR. |
enable_voice_extraction | boolean | Isolate the dominant voice. |
enable_audio_upscale | boolean | Bandwidth extension before ASR. |
Diarization (when features.diarization.enabled is true)
| Field | Type | Description |
|---|---|---|
diarization | string | off, pyannote, clustering, spectral. |
num_speakers | int | Hint: exact number of speakers. |
min_speakers | int | Hint: minimum. |
max_speakers | int | Hint: maximum. |
Speaker separation
| Field | Type | Description |
|---|---|---|
enable_speaker_separation | boolean | Split overlapping speakers before ASR. |
speaker_similarity_threshold | number | 0–1; lower = more aggressive splits. |
Examples
Minimal:
curl -sS "$USF_BASE_URL/v1/audio/transcriptions" \
-H "Authorization: Bearer $USF_API_KEY" \
-F file=@audio.wav \
-F model=usf-mini-asr
With language hint and a prompt:
curl -sS "$USF_BASE_URL/v1/audio/transcriptions" \
-H "Authorization: Bearer $USF_API_KEY" \
-F file=@meeting.wav \
-F model=usf-mini-asr \
-F language=en \
-F prompt="Quarterly earnings call. Acme Corp." \
-F response_format=verbose_json
With VAD enabled:
curl -sS "$USF_BASE_URL/v1/audio/transcriptions" \
-H "Authorization: Bearer $USF_API_KEY" \
-F file=@audio.wav \
-F model=usf-mini-asr \
-F response_format=verbose_json \
-F enable_vad=true
Word- and segment-level timestamps are under development. The
timestamp_granularitiesparameter is accepted, but the batch endpoint does not yet emitsegments/wordsin the response. Per-segment timestamps are available today on the real-time streaming endpoint (thetranscriptevent includessegment.start/segment.end/confidence).
Response (response_format=json)
{
"text": "Hello, this is a test of the UltraSafe ASR transcription service.",
"duration": 4.13,
"processing_time_s": 0.31,
"model_inference_s": 0.142,
"timing": {
"total_s": 0.31,
"model_inference_ms": 141.8
}
}
Response (response_format=verbose_json)
Adds detected language, audio duration, and timing breakdown:
{
"task": "transcribe",
"text": "Hello, this is a test of the UltraSafe ASR transcription service.",
"language": "en",
"duration": 4.13,
"processing_time_s": 0.31,
"model_inference_s": 0.142,
"timing": {
"total_s": 0.31,
"model_inference_ms": 141.8
}
}
segmentsandwordsare under development for the batch endpoint and will be added in a future release. If you need per-segment timestamps today, use the real-time streaming endpoint — everytranscriptevent includes asegmentblock withstart,end, andconfidence.
Response (response_format=text)
Plain-text body, no JSON wrapping:
Hello, this is a test of the UltraSafe ASR transcription service.
Python (no SDK, just requests)
import os, requests
url = f"{os.environ['USF_BASE_URL']}/v1/audio/transcriptions"
headers = {"Authorization": f"Bearer {os.environ['USF_API_KEY']}"}
with open("audio.wav", "rb") as fh:
r = requests.post(
url,
headers=headers,
files={"file": ("audio.wav", fh, "application/octet-stream")},
data={"model": "usf-mini-asr", "response_format": "json"},
timeout=300,
)
r.raise_for_status()
print(r.json()["text"])
JavaScript / Node 18+ (fetch + FormData)
import fs from "node:fs";
const form = new FormData();
form.append("file", new Blob([fs.readFileSync("audio.wav")]), "audio.wav");
form.append("model", "usf-mini-asr");
form.append("response_format", "json");
const r = await fetch(`${process.env.USF_BASE_URL}/v1/audio/transcriptions`, {
method: "POST",
headers: { Authorization: `Bearer ${process.env.USF_API_KEY}` },
body: form,
});
if (!r.ok) throw new Error(`HTTP ${r.status}: ${await r.text()}`);
console.log((await r.json()).text);
Browser (fetch from a <input type="file">)
const file = document.querySelector("#audio").files[0];
const form = new FormData();
form.append("file", file);
form.append("model", "usf-mini-asr");
const r = await fetch("/api-proxy/v1/audio/transcriptions", { // call YOUR backend
method: "POST",
body: form,
});
const { text } = await r.json();
Don't put the API key in browser code. Front your API with your own backend that injects the
Authorizationheader.
Translation
POST /v1/audio/translations — transcribes audio in any supported language
and returns English text.
Same request shape as transcription, but the language parameter is ignored
(the model auto-detects the source language).
curl -sS "$USF_BASE_URL/v1/audio/translations" \
-H "Authorization: Bearer $USF_API_KEY" \
-F file=@french_audio.wav \
-F model=usf-mini-asr
{
"text": "Hello everyone, welcome to today's meeting.",
"duration": 5.12
}
Audio enhancement
Under development
POST /v1/audio/enhanceis documented for reference but is not currently enabled on the production API — calls return503 Service Unavailableuntil the upstream upscale service is turned on. For voice activity detection, useenable_vadon/v1/audio/transcriptions; for live VAD events, use the streaming endpoint (it emitsspeech_activityevents).
POST /v1/audio/enhance — standalone audio enhancement pipeline. Returns
enhanced audio (base64-encoded WAV by default) with optional transcripts of
the before/after.
| Field | Type | Description |
|---|---|---|
file | binary | Source audio (multipart). |
model | string | Model id; usf-mini-asr works for the default pipeline. |
enable_denoise | boolean | Run the denoiser. |
enable_bwe | boolean | Bandwidth extension (upscale narrowband audio to 16 kHz/24 kHz). |
output_format | string | wav (default) or pcm. |
output_sample_rate | int | Target sample rate of the enhanced audio. |
include_transcription | boolean | Also transcribe the enhanced audio and return both texts. |
curl -sS "$USF_BASE_URL/v1/audio/enhance" \
-H "Authorization: Bearer $USF_API_KEY" \
-F file=@noisy.wav \
-F model=usf-mini-asr \
-F enable_denoise=true \
-F enable_bwe=true \
-F include_transcription=true
{
"audio_base64": "UklGRi…",
"audio_format": "wav",
"transcription": {
"text": "Hello everyone."
}
}
To save the enhanced audio:
curl -sS "$USF_BASE_URL/v1/audio/enhance" \
-H "Authorization: Bearer $USF_API_KEY" \
-F file=@noisy.wav \
-F enable_denoise=true \
-F enable_bwe=true \
| jq -r .audio_base64 | base64 -d > clean.wav
Real-time streaming
WSS /v1/audio/transcriptions/stream — open a WebSocket, push raw PCM
audio chunks, and receive live transcripts as they're produced.
Connect URL
The streaming endpoint takes the API key in the query string (most browser/CLI WebSocket clients can't send custom headers easily).
wss://api-prod-usf.us.inc/v1/audio/transcriptions/stream
?api_key=usf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
&model=usf-mini-asr
&sample_rate=16000
&audio_format=pcm_s16le
| Query param | Default | Description |
|---|---|---|
api_key | required | Your usf_… key. |
model | usf-mini-asr | Model id. |
sample_rate | 16000 | Audio sample rate in Hz (8000–48000 typical). |
audio_format | pcm_s16le | One of pcm_s16le, pcm_f32le, wav. |
Privacy note: the API key appears in the URL, which means it'll show up in any reverse-proxy access log between you and the server. Treat your key as a secret and don't paste streaming URLs into shared logs / chat / etc.
Frame protocol
- Client → server: raw audio bytes (binary frames). Each frame is one chunk of PCM at the rate/format you negotiated in the URL. ~100 ms per chunk is a good default.
- Server → client: JSON text frames, one event per frame, each tagged
with a
typefield.
When you're done sending audio, send the JSON control message:
{ "type": "control", "action": "stop" }
The server finalises, emits a done event, and (a few hundred ms later) a
higher-quality retranscribe_result event before closing the connection.
Event types
type | Sent when | Useful fields |
|---|---|---|
ready | Immediately after the handshake. | — |
speech_activity | Voice-activity detector flips on/off. | is_speech |
transcript | A new interim or finalised segment. | is_final, segment.text, segment.start, segment.end |
done | All audio processed (real-time pass complete). | response.text, response.duration |
retranscribe_result | Best-quality full-context re-pass complete. | response.full_context_text |
error | Something went wrong. | error.message, error.code |
The lifecycle is normally:
ready
↓ (one or more)
speech_activity → transcript (is_final=false) → transcript (is_final=true)
↓
done
↓
retranscribe_result ← prefer this for the final transcript
Python (no SDK, raw websockets)
import asyncio, json, wave, websockets
URL = (
"wss://api-prod-usf.us.inc/v1/audio/transcriptions/stream"
"?api_key=usf_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"
"&model=usf-mini-asr&sample_rate=16000&audio_format=pcm_s16le"
)
async def stream(path: str) -> None:
with wave.open(path, "rb") as w:
sr, sw = w.getframerate(), w.getsampwidth()
pcm = w.readframes(w.getnframes())
async with websockets.connect(URL, max_size=None) as ws:
chunk = int(sr * 0.1) * sw # 100ms chunks
for i in range(0, len(pcm), chunk):
await ws.send(pcm[i:i+chunk])
await ws.send(json.dumps({"type": "control", "action": "stop"}))
async for raw in ws:
evt = json.loads(raw)
t = evt.get("type")
if t == "transcript" and evt.get("segment", {}).get("is_final"):
print("FINAL:", evt["segment"]["text"])
elif t == "retranscribe_result":
print("BEST:", evt["response"]["full_context_text"])
break
asyncio.run(stream("audio.wav"))
JavaScript / Node (ws package)
import WebSocket from "ws";
import fs from "node:fs";
const url = `wss://api-prod-usf.us.inc/v1/audio/transcriptions/stream`
+ `?api_key=${process.env.USF_API_KEY}`
+ `&model=usf-mini-asr&sample_rate=16000&audio_format=pcm_s16le`;
const ws = new WebSocket(url);
ws.on("open", () => {
const pcm = fs.readFileSync("audio.pcm"); // raw 16-bit mono PCM @ 16 kHz
const chunkBytes = 16000 * 2 * 0.1; // 100ms
for (let i = 0; i < pcm.length; i += chunkBytes) {
ws.send(pcm.slice(i, i + chunkBytes));
}
ws.send(JSON.stringify({ type: "control", action: "stop" }));
});
ws.on("message", (raw) => {
const evt = JSON.parse(raw.toString());
if (evt.type === "retranscribe_result") {
console.log("FINAL:", evt.response.full_context_text);
ws.close();
}
});
Health & readiness
curl -sS "$USF_BASE_URL/health"
# → {"status":"healthy"}
curl -sS "$USF_BASE_URL/ready"
# → {"status":"ready","upstream_status":200}
/health reflects the proxy itself; /ready additionally probes the upstream.
Both are public (no key) and safe to use in load-balancer health checks.
Errors
The API returns conventional HTTP status codes and an OpenAI-style error envelope. Inspect the body to tell apart "you sent something wrong" from "the server is having a bad day."
| HTTP | Meaning | Common causes |
|---|---|---|
400 | Validation error | Unsupported model, missing file, bad query params, feature disabled in this deployment. |
401 | Missing or invalid key | Wrong Authorization header, key disabled, key revoked. |
403 | Forbidden | Auth header missing on a protected endpoint. |
413 | Payload too large | File exceeds the deployment's per-request limit. Split it. |
429 | Rate-limited | Slow down and retry with backoff. |
5xx | Server-side failure | Upstream broken, GPU OOM, transient. Retry with exponential backoff. |
Example error body:
{
"detail": {
"error": {
"message": "The specified model is not available.",
"type": "invalid_request_error",
"param": "model",
"code": "model_not_found"
}
}
}
For some endpoints the envelope is flatter ({"error": "..."}); always
check the HTTP status code first, then look at the body for context.
Suggested retry policy
- Transport-level errors (TCP reset, TLS handshake failed, DNS): retry up to 3 times with exponential backoff (250 ms, 500 ms, 1 s).
- HTTP
5xx: same retry budget. Use jitter to avoid thundering-herd retries when a brief upstream blip recovers. - HTTP
429: readRetry-Afterif present; otherwise back off ≥ 1 s. - HTTP
4xxother than429: do not retry. Fix the request.
Rate limits
There is no hard rate limit during the beta; we rely on per-key quotas
configured server-side. If you see sustained 429s, contact us and we'll
raise your quota or investigate.
Privacy & retention
- Audio is processed in-memory and not stored on the server.
- Transcripts are not persisted.
- Logs record
user_label, IP, path, status, and duration on every request — no audio, no transcript content. We use these for billing and debugging.
OpenAI compatibility
The endpoints under /v1/audio/* are intentionally close to OpenAI's
Whisper API surface (same paths, mostly the same form fields, very similar
response shapes). Most existing OpenAI Whisper code can be pointed at this
API by changing the base_url and api_key.
Differences worth knowing:
- The streaming endpoint (
/v1/audio/transcriptions/stream) is WebSocket-based, not SSE. - Some optional parameters (
enable_vad,diarization,enable_denoise, …) are UltraSafe-specific and have no OpenAI equivalent. modelids are different (usf-mini-asr, …).
Support
For API-key requests, increased quotas, or any issue with the API, contact your UltraSafe AI account manager.
