Generating Audio with Gemini Text-to-Speech
Both simple single-call generation and pipeline-grade parallel generation are valid.
Models
- default/high-quality:
gemini-2.5-pro-preview-tts - throughput-oriented option:
gemini-2.5-flash-preview-tts
Approach A: Basic Single-call TTS
const response = await genAI.models.generateContent({
model: 'gemini-2.5-pro-preview-tts',
contents: text,
config: {
responseModalities: ['AUDIO'],
speechConfig: {
voiceConfig: { prebuiltVoiceConfig: { voiceName: 'Leda' } }
},
},
});
Example voices often used: Aoede, Charon, Fenrir, Kore, Leda, Puck.
Approach B: Pipeline Per-slide TTS
- Generate one audio file per slide.
- Keep voice consistent across video.
- Modulate style via
audioProfileanddirectorNotes. - Run with concurrency limits by model rate profile.
Extract and Convert Audio
Gemini returns base64 PCM (24kHz mono 16-bit). Convert to WAV for downstream compatibility.
const parts = response.response.candidates?.[0]?.content?.parts;
const audioPart = parts?.find((part: any) => part.inlineData?.data);
const pcmBytes = Buffer.from(audioPart.inlineData.data, 'base64');
Cost and Logging
- Parse token usage from
usageMetadata. - Log per-slide costs where applicable.
Which Approach to Choose
- Use single-call for simple tools.
- Use per-slide pipeline for multi-slide video generation workflows.