Skip to content
  • There are no suggestions because the search field is empty.

Transcription

Converts the given audio file to text.

Create Transcription

Creates a transcription for the audio file.

Azure OpenAI

Request

POST https://api.core42.ai/openai/deployments/whisper/audio/transcriptions

OpenAI

Request

POST https://api.core42.ai/v1/audio/transcriptions

Request Parameter

 

Name

Required

Type

Description

file

true

file

Audio file object to transcribe. Supported formats: flac, mp3, mp4, mpeg, mpga, m4a, ogg, wav, or webm.

model / deployment-id

true

string

Model ID to use for the request. Availabe models are gpt-4o-transcribe, gpt-4o-mini-transcribe, and whisper-1.
To transcribe a file larger than 25 MB, break it into chunks. Alternatively, you can use the Azure AI Speech batch transcription API if deploying on Azure OpenAI.

chunking_strategy

false

auto or object

Controls how the audio is cut into chunks. When set to "auto", the server first normalizes loudness and then uses voice activity detection (VAD) to choose boundaries. server_vad object can be provided to tweak VAD detection parameters manually. If unset, the audio is transcribed as a single block.

chunking_strategy.[].string

false

string

Automatically set chunking parameters based on the audio. Must be set to "auto".

chunking_strategy.[].object.[]type

true

string

Must be set to server_vad to enable manual chunking using server side VAD.

chunking_strategy[].object[].prefix_padding_ms

false

integer

Amount of audio to include before the VAD detected speech (in milliseconds).

chunking_strategy[].object[].silence_duration_ms

false

integer

Duration of silence to detect speech stop (in milliseconds). With shorter values the model will respond more quickly, but may jump in on short pauses from the user.

chunking_strategy[].object[].threshold

false

number

Sensitivity threshold (0.0 to 1.0) for voice activity detection. A higher threshold will require louder audio to activate the model, and thus might perform better in noisy environments.

include

false

array

Additional information to include in the transcription response. logprobs will return the log probabilities of the tokens in the response to understand the model's confidence in the transcription. logprobs only works with response_format set to json and
Note: This parameter is only available with the gpt-4o-transcribe and gpt-4o-mini-transcribe models.

language

false

string

Language of the audio file. An input language in ISO-639-1 format will improve accuracy and latency.
The supported languages are: Afrikaans, Arabic, Armenian, Azerbaijani, Belarusian, Bosnian, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, English, Estonian, Finnish, French, Galician, German, Greek, Hebrew, Hindi, Hungarian, Icelandic, Indonesian, Italian, Japanese, Kannada, Kazakh, Korean, Latvian, Lithuanian, Macedonian, Malay, Marathi, Maori, Nepali, Norwegian, Persian, Polish, Portuguese, Romanian, Russian, Serbian, Slovak, Slovenian, Spanish, Swahili, Swedish, Tagalog, Tamil, Thai, Turkish, Ukrainian, Urdu, Vietnamese, and Welsh.

prompt

false

string

Optional text to guide the model's style or continue a previous audio segment. Ensure the prompt matches the audio language.

response_format

false

string

Format of the transcript output. Supported output formats: json, text, srt, verbose_json, and vtt.
Note: For gpt-4o-transcribe and gpt-4o-mini-transcribe, the only supported format is json.

temperature

false

number

The temperature controls randomness. The range is from 0 to 2. Lowering results in less random completion. As the temperature approaches zero, the model will become deterministic and repetitive. If the value approaches 0, the model tends to produce more predictable and deterministic responses. This means the generated text is more likely to adhere closely to the input prompt and follow a coherent narrative. If the value approaches to 2, it produces more randomness, resulting in responses that are less predictable and more diverse. This may lead to more creative outputs, reducing the coherence and relevance of the input.

timestamp_granularities

false

array

The timestamp granularities to populate for this transcription.
Note: response_format must be set verbose_json to use timestamp granularities. Either or both of these options are supported: word, or segment.
Note: There is no additional latency for segment timestamps, but generating word timestamps incurs additional latency.