logo

How to generate audio with the gen-ai-audio skill

SKILLS3 minBeginner

Add voice and music generation to your agent using ElevenLabs voices, MiniMax music, and speech-to-speech capabilities.

How to generate audio with the gen-ai-audio skill

What you'll learn

  • How to import the gen-ai-audio skill into your agent environment
  • How to generate voiceovers with natural-sounding AI voices
  • How to create background music and soundtracks for videos
  • How to use speech-to-speech for voice transformation and dubbing

What is the gen-ai-audio skill?

The gen-ai-audio skill gives your AI agent access to voice synthesis via ElevenLabs, music generation via MiniMax, and speech-to-speech voice transformation. It's like hiring a voice actor, composer, and audio engineer — ask for a voiceover in a specific tone, a 30-second background track, or a dubbed version of existing audio, and your agent produces it.

Common use cases

  • Video production: Generate voiceovers for explainer videos and tutorials
  • Podcasting: Create intro/outro music and voice effects
  • E-learning: Narrate course content in multiple languages
  • Marketing: Produce audio ads and brand jingles
  • Accessibility: Add audio descriptions to visual content
  • Social media: Create audio memes and TikTok sound effects

Generate your audio step by step

STEP 1: Download and import the skill

  • On web: Go to picsart.com/cli/#skills-starter → Download gen-ai-audio → Extract to your agent's skills folder
  • On mobile: Use desktop to download — audio generation requires a development environment
Get the skill

STEP 2: Choose your audio type and voice

Select what kind of audio you want to generate:

  • Text-to-speech: Convert written text into natural voiceover (ElevenLabs voices)
  • Music generation: Create background tracks and soundtracks (MiniMax)
  • Sound effects: Generate specific SFX for videos and games
  • Speech-to-speech: Transform existing audio to a different voice or language
  • Voice characteristics: Specify tone (warm, confident, energetic) and pacing

STEP 3: Generate and save

Your agent processes the request and generates the audio file. Output saves to your project folder in MP3 or WAV format. Check your terminal for the exact filename and location.

STEP 4: Review and refine

Listen to your generated audio and check for quality: Not quite right? Adjust your voice direction or prompt phrasing and generate again. For voiceovers, try different emotion cues or pacing instructions.

  • Check that pronunciation and intonation sound natural
  • Verify pacing matches your intended use (not too fast or slow)
  • For music, confirm the mood and energy level match your content
Start generating audio

Tips for best results

💡 Describe voice tone and emotion, not just words

Instead of just providing text, add direction like "warm and reassuring," "energetic and enthusiastic," or "calm and professional." The more context you give about how the voice should sound, the better the result.

💡 Use speech-to-speech for accent or language variants

If you already have a voiceover but need it in a different accent or language, use speech-to-speech mode. Provide the original audio file and specify the target voice characteristics or language.

💡 Generate music first, then sync to video

When creating soundtracks for video, generate the music separately with clear mood and length requirements ("upbeat 15-second track"). Then attach it to your video using the gen-ai-video skill's audio attachment feature.

Frequently asked questions

No. The skill uses licensed voice models from ElevenLabs that are trained ethically and legally. You can describe voice characteristics ("deep male voice," "young female British accent") but cannot clone or mimic specific individuals without proper authorization.

Use descriptive prompts with MiniMax music generation. Specify genre ("lo-fi hip-hop," "cinematic orchestral"), mood ("uplifting," "mysterious"), and instrumentation ("piano and strings," "electronic synths"). Also include length requirements ("30 seconds," "1 minute").

Yes. ElevenLabs supports multiple languages for text-to-speech. Specify the target language in your request ("Spanish voiceover," "French narration"). Speech-to-speech mode can also translate and transform audio across languages.

The skill typically outputs MP3 for voiceovers and music (smaller file size, widely compatible) and WAV for high-quality applications. You can request a specific format in your prompt if needed.

Ready to add voice and music to your content?

Import the gen-ai-audio skill and start generating professional voiceovers and soundtracks.

Download skill