Grok Imagine Video 1.5
What is Grok Imagine Video 1.5?
Grok Imagine Video 1.5 is xAI's image-to-video model, released in preview on May 30, 2026. You give it a starting image and a text prompt, and it animates that frame into a clip with motion and native sound in one pass. The source image becomes the first frame, so it sets composition, color, and identity, and the model holds the subject's shape as it moves. There is no text-to-video here: every generation needs an input image. It debuted at the top of the Arena image-to-video board at 1474 Elo, just ahead of Seedance 2.0, and it does one thing well: turning a frame you already have into motion with matching sound.
Key features of Grok Imagine Video 1.5
Glass and refraction from a single still
From one abstract render, the model animates a glossy morphic form so its surfaces flow like liquid mercury and the prismatic bands refract as the shape deforms. Translucent materials and accurate reflection are where many video models break, so this is a strong quality signal.
Water physics with designed sound
A breaking wave crests, folds, and explodes into foam, with spray drifting in the wind and water rushing back in white rivulets. The native audio carries the boom of the swell, the hiss of pullback, and wind across the coast, all timed to the motion.
Lip-synced dialogue while moving
A skateboarder accelerates as the handheld camera tracks his face, and his spoken line stays lip-matched while the background streaks past. Dialogue, ambient street sound, and motion all generate together, no separate audio edit.
Micro-expressions with a held frame
The subject touches her cheek and shifts from a gentle smile to a wide one at the camera, while the products and text in frame stay locked. Subtle facial motion plus a stable composition is what makes a clip usable for ads.
Best for
Product photo to motion ad
Animate a packaging or hero shot into a short clip with a matching music bed and sound effects, then output 9:16 for paid social.
Talking character clips
Turn a portrait into a lip-synced scene where the subject speaks a short line, with dialogue baked into the same pass.
Water, glass, and fire motion
Add believable physics to a still, the place many video models break, and let the native audio carry the swell, hiss, or crackle.
Short-form social clips
Reels, TikTok, and Shorts where audio and picture must match, generated together so the cut lands without a separate mix.
Concept frames to life
Bring a finished render or concept frame into motion for trailers and scene loops, no separate audio edit needed.
Quick motion tests
Run the same image with different prompts to pick the best take before scaling up resolution or duration.
Use cases
Start with a clean image, then let the prompt say only what moves. A cosmetics brand can take a packaging shot and get a 720p clip of a presenter touching her cheek and speaking to camera while the products stay locked in frame. A studio can animate an abstract render so its glass surfaces ripple and refract, useful for title cards and loops. A documentary-style edit can drop in a tracked sprint or a skateboarding beat with spoken lines lip-matched to the subject. Because audio rides along, an early-morning street scene can carry footsteps, breathing, and distant traffic without any post work. The 15-second ceiling and 24 fps make it a fit for ad spots, trailers, and social cuts rather than long sequences.
Limitations
This is an image-to-video model only. There is no text-to-video; for a clip from text alone, use the base Grok Imagine Video model instead. It is a preview release, so behavior can change. Negative prompts are ignored, so you describe what you want rather than what to avoid. Output tops out at 720p, and longer 15-second clips are more prone to artifacts than 5-to-8-second clips.
Grok Imagine Video 1.5 vs Seedance 2.0
On the Arena image-to-video board, Grok Imagine Video 1.5 (720p) ranks first at 1474 Elo, one point above Seedance 2.0 (720p) at 1473. On community votes the two are effectively tied. The practical split is audio: Grok 1.5 generates synced sound and lip-matched dialogue in the same pass, which Seedance does not. Use Grok 1.5 when the clip needs sound baked in; the leaderboard gap alone is too small to decide on.
API examples
Call Grok Imagine Video 1.5 from any language by POSTing to /v1/tasks. Full parameter docs live at docs.unifically.com/models/video/xai/grok-imagine-video-1.5.
curl -X POST https://api.unifically.com/v1/tasks \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{
"model": "xai/grok-imagine-video-1.5-preview",
"input": {
"image_urls": ["https://example.com/portrait.png"],
"prompt": "She touches her cheek and smiles gently, then smiles wide looking into the camera. The camera stays still. AUDIO: soft ambient room tone, a light breath, a gentle acoustic note.",
"aspect_ratio": "16:9",
"duration": 6,
"resolution": "720p"
}
}'
Successful submission returns a task_id. Poll GET /v1/tasks/<task_id> or set a callback_url on the request to receive the finished video URL.
FAQs
People also ask
It animates a still image into a short video with synchronized native audio. You provide a starting image and a prompt describing the motion, camera, and sound, and it returns a clip with picture and audio in a single pass.
No. This model is image-to-video only, and every request needs an input image that becomes the first frame. For a clip from text alone, use the base Grok Imagine Video model.
Yes. Sound effects, ambient audio, music, and lip-synced dialogue are generated in the same pass as the video and stay timed to the action. No second audio step.
From 1 to 15 seconds, with a default of 6. Clips of 5 to 8 seconds are the most stable; 15-second clips work but show more artifacts.
480p or 720p at 24 frames per second. You set the resolution per generation, and 720p is the ceiling. The 720p path renders at 1280×704.
Seven ratios, 16:9, 9:16, 1:1, 4:3, 3:4, 3:2, and 2:3, with 1:1 the default. That covers landscape, square, and vertical without a re-crop.
Lock composition and color in the input image, then have the prompt say only what moves. Name the camera move you want, use intensity words for motion, and describe the sound directly or add an AUDIO section at the end. Negative prompts are ignored.
Verwandte Modelle
Alle Modelle ansehen
Grok Imagine
T2V or I2V up to 10s at 480p or 720p, 5 aspect ratios, with custom/spicy/fun/normal presets.
- Text to Video
- Image to Video
- Text to Image
- Image to Image
- Video to Video
Veo 3.1
4 model variants, 720p–4K, lip-synced dialogue and SFX in one call.
- Text to Video
- Image to Video
- Reference to Video
- Video to Video