Skip to main content
Unifically LogoUnificAlly
Model logo

Gemini Omni Flash API

  • Text to Video
  • Image to Video
  • Reference to Video
  • Video to Video

Upload images/videos above, then type @ in your prompt to reference them

Click or drag & dropPNG, JPG, WEBP, GIF · Max 100MB
Output

Your generated video will appear here

Gemini Omni Flash

What is Gemini Omni Flash?

Gemini Omni Flash is Google's first model in the Gemini Omni family, released May 19, 2026. It is a native multimodal video model where text, images, video, and audio can all sit in the model context, and the output is high-resolution video with audio.

The useful shift is the edit loop. Gemini Omni Flash treats media as context, then lets the next instruction build on what came before. A product still, a phone clip, a character reference, a voice preset, and a prompt can steer the same result. Characters, scene layout, camera motion, and spoken audio are meant to stay tied together as the edit changes.

Gemini Omni Flash is built around Gemini's real-world knowledge as well as media generation. That matters for scenes with cause and effect: gravity, fluid motion, object interaction, cultural references, and short explainers where the video has to show an idea instead of only matching a style.

For API work, Gemini Omni Flash is less like a plain text-to-video generator and more like a multimodal video editor. Give it the assets that matter, name those assets in the prompt, and use follow-up edits when the first result is close.

Key features of Gemini Omni Flash

Five examples show where the model is most different from plain text-to-video.

Create videos from any combination of inputs

Omni can use text, images, video, and audio context to produce one cohesive video. On Unifically, generation exposes prompts, image references, character references, voice fields, seeds, duration, and aspect ratio; edit mode adds one source video as context.

Complex ideas made visual

Gemini Omni Flash is useful for short explainers where the clip has to show a process, not just match an image style. Use it for science scenes, object interactions, chain reactions, and other prompts where the motion needs grounded logic.

Edit your videos through conversation

The edit model takes one source video and a natural-language instruction. Each prompt can change the world around the source clip while keeping the original scene as the starting point.

Reimagine the action

Use edit mode when the request changes what happens inside the clip: motion, objects, materials, camera behavior, or the action itself. Add image or character references when the edit needs a new subject, look, or identity.

Refine across multiple turns

Follow-up edits can change the environment, angle, style, or specific details while the original scene remains the thread. That makes Gemini Omni Flash useful when a video needs guided revision instead of a full regenerate.

Best for

Mixed-context video prompts

Use it when the prompt needs to combine written direction with images, source footage, characters, and voice choices.

Reference-led short clips

Use image and character references when the result needs to follow a subject, product, outfit, or setting from uploaded assets.

Existing video edits

Edit mode is built for one source video plus a prompt: restyle the clip, change the scene, or replace a subject with a character reference.

Concept explainers

Use it when a short clip needs to show a process, scientific idea, or cause-and-effect scene with grounded motion.

Character continuity

Character objects let one subject carry a name, description, image references, and voice metadata through generation or editing.

Voice-guided character clips

Use a request-level voice for generation, or per-character voice fields in edit mode when the clip needs speech direction tied to a subject.

Controlled reruns

Seed support helps rerun a prompt/reference setup when you need a close repeat for review or testing.

Use cases

Turn a rough product clip into a cleaner ad concept by giving the source video, brand stills, and a short prompt, then asking for lighting and background changes. Generate a short character scene by uploading one to three character references and calling them out with @Character1 in the prompt. Rework a creator clip for a new setting: keep the subject, change the environment, and ask for a camera move that fits the format. Build a compact explainer from a short prompt, such as a clay-style process video or a chain reaction with clear physical motion. Test a storyboard by combining sketches, a character reference, a voice preset, and written scene direction, then rerun with the same seed when the direction is close.

Limitations

Gemini Omni Flash is video-first. The current workflow is video generation and video editing with prompts, references, source video, and voice fields.

Audio input is also narrower than the broad multimodal framing can make it sound. Voice references are the first supported audio path, so plan prompts around visual references, source video, character references, and voice direction rather than arbitrary audio analysis.

Speech edits need extra review. Treat spoken lines, voice changes, people-centric edits, captions, and brand marks as review-heavy. Generated videos include SynthID watermarking, but the watermark does not replace human QA.

When to use Gemini Omni Flash

Use Gemini Omni Flash when a video job needs more than a prompt: source footage, still references, character references, voice direction, grounded scene logic, or a targeted edit of one uploaded clip. Use Veo 3.1 when the job depends on first-frame or first-and-last-frame workflows, Extend, or Upscale.

API examples

Call Gemini Omni Flash from any language by POSTing to /v1/tasks. Full parameter docs live at docs.unifically.com/models/video/google/gemini-omni-flash-video.

curl -X POST https://api.unifically.com/v1/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "google/gemini-omni-flash-video",
    "input": {
      "prompt": "Make @Character1 and @Image1 dance in a neon studio with energetic camera movement.",
      "reference_image_urls": [
        "REFERENCE_IMAGE_URL"
      ],
      "reference_characters": [
        {
          "image_urls": [
            "CHARACTER_IMAGE_URL"
          ],
          "name": "Dancer"
        }
      ],
      "duration": 10,
      "aspect_ratio": "9:16"
    }
  }'

Successful submission returns a task_id. Poll GET /v1/tasks/<task_id> or set a callback_url on the request to receive the finished result.

FAQs

People also ask

Gemini Omni Flash is Google's first Gemini Omni model, released May 19, 2026. It is a native multimodal video model where text, images, video, and audio can be part of the model context, and the output is high-resolution video with audio.

Google's model card lists text strings, images, audio, and video files as supported inputs. On Unifically, the callable video endpoints expose prompts, image references, character references, one source video for edit mode, seeds, aspect ratio, duration, and voice presets.

Yes. The model card describes Gemini Omni Flash output as high-resolution video with audio.

Yes. The edit model takes exactly one source video and an edit prompt. You can ask for scene changes, background swaps, camera moves, action changes, style changes, and other follow-up edits while keeping the source clip as context.

Two. Use google/gemini-omni-flash-video for new text-to-video or reference-to-video clips. Use google/gemini-omni-flash-video-edit when you already have one source video and want to change it with a prompt.

The generate model supports 4, 6, 8, or 10 second clips. The edit model accepts one uploaded source video up to 30 seconds and returns an edited result based on the selected source range.

Yes. Google uses SynthID on the images and videos it creates, so generated media carries a provenance watermark.

The model card calls out three hard parts, keeping perfect consistency across repeated edits, handling complex motion, and rendering perfectly accurate text. Treat text-heavy scenes, fast action, speech edits, and hard physics shots as review passes, not one-click final files.