Skip to main content
Model logo

HappyHorse 1.1 API

  • Text to Video
  • Image to Video
  • Reference to Video
Watermark
Add watermark to the output
Output

Your generated video will appear here

HappyHorse 1.1

What is HappyHorse 1.1?

HappyHorse 1.1 is Alibaba's June 22, 2026 upgrade to the HappyHorse video family, built by the ATH AI team for production workflows that need controlled motion, stable subjects, and native audio in one pass. It outputs 720p or 1080p MP4 video from text, a first-frame image, or up to nine reference images. The release tightens reference fidelity, motion coherence, and lip-sync over 1.0.

What's new in HappyHorse 1.1

Reference-to-video consistency is the headline upgrade. The model reads up to nine reference images and holds products, characters, and scene elements closer to the sources across the full video.

Motion is smoother in complex action. Alibaba retuned temporal modeling so fast movement and camera motion stay coherent instead of smearing mid-sequence.

Instruction following improved for long prompts. You can describe multi-beat scenes, timed dialogue blocks, and camera moves in one request up to 2,500 characters.

Visual quality is sharper on faces, skin texture, and fine detail. Close-ups and talking-head shots hold up better than 1.0 at 1080p.

Native audio generates with the video, including multilingual lip-sync. Write spoken lines in the prompt; audio and mouth movement render in the same pass with no separate audio step.

Multi-image reference-to-video

Up to nine reference images feed one generation. Products, props, and scene elements from the board stay visible in the finished video, which is the main reason to pick 1.1 over text-only models.

Native lip-sync from text alone

Text-to-video renders spoken dialogue with matched lip movement and ambient audio in one pass. No input image and no separate audio render step.

Lip-synced image-to-video

A single portrait or key frame becomes a talking performance. The model adds motion and synced speech while keeping the subject look from the still.

First-frame motion fidelity

Image-to-video preserves the source composition while adding believable body motion. Good for wildlife, product hero shots, and any still that needs natural movement without drifting off-model.

Best for

Short-form drama and serialized character content

Identity has to match reference stills across episodes. Lock a cast with up to nine reference images per scene.

E-commerce and brand ads

Build motion from product photos, packaging shots, and styled reference boards without a physical shoot.

Talking-head explainers and instructors

Scripted dialogue in the prompt drives lip-sync and studio lighting without recording talent.

Marketing hero art in motion

Animate portrait photography or key art into short clips with native audio and camera drift.

Multi-reference scene builds

Several product or character stills must appear together in one video from a single reference-to-video call.

Variants

HappyHorse 1.1 has three input modes on one endpoint. Pick the mode that matches your inputs; resolution, duration, and seed work the same across all three.

Text-to-video

Write a prompt only. Pick 720p or 1080p, duration from 3 to 15 seconds (default 5), and an aspect ratio from 16:9, 9:16, 1:1, 4:3, or 3:4. Audio and lip-sync come from the prompt.

Image-to-video

Supply one first-frame image (minimum 300 px per side, aspect ratio between 1:2.5 and 2.5:1, up to 20 MB). Prompt is optional. Output aspect ratio follows the input image; you do not set aspect ratio in this mode.

Reference-to-video

Supply 1 to 9 reference images (short side at least 400 px, up to 20 MB each). Refer to subjects in the prompt as character1, character2, and so on in API order, or as [Image 1], [Image 2] on some hosts. Set aspect ratio and duration like text-to-video.

Use cases

A micro-drama studio can lock a cast with nine reference stills per episode, then generate new scenes from script prompts while keeping wardrobe and faces stable across shots. An e-commerce API can turn a flat product photo into a 5-second 1080p demo with natural camera drift and room tone, skipping a physical shoot. A corporate training app can render a presenter from text alone: timed dialogue in the prompt drives lip-sync and studio lighting without recording talent. A beauty brand can feed separate stills of a serum bottle, botanical props, and water droplets, then compose them into one spa-lit product video from a single reference-to-video call.

Limitations

Image-to-video does not accept an aspect ratio parameter. If you need vertical output from a still, start with a vertical source frame.

Maximum output resolution is 1080p. There is no 4K path in the current 1.1 API surface.

API examples

Call HappyHorse 1.1 from any language by POSTing to /v1/tasks. Full parameter docs live at docs.unifically.com/models/video/alibaba/happyhorse-1.1-video.

curl -X POST https://api.unifically.com/v1/tasks \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -d '{
    "model": "alibaba/happyhorse-1.1-video",
    "input": {
      "mode": "t2v",
      "prompt": "A golden retriever running through a field of wildflowers at sunset",
      "resolution": "1080P",
      "ratio": "16:9",
      "duration": 5
    }
  }'

Successful submission returns a task_id. Poll GET /v1/tasks/<task_id> or set a callback_url on the request to receive the finished result.

FAQs

People also ask

Text-only for text-to-video, one image for image-to-video, or one to nine images for reference-to-video.

720p or 1080p, with output length from 3 to 15 seconds. Default duration is 5 seconds.

Text-to-video and reference-to-video support 16:9, 9:16, 1:1, 4:3, and 3:4. Image-to-video inherits the first-frame aspect ratio automatically.

Yes. Native audio renders with the video, including lip-sync when the prompt includes spoken lines.

In reference-to-video, name subjects character1, character2, and so on in the order images are passed, or use [Image 1], [Image 2] labels depending on the host schema.

Stronger reference consistency, smoother motion, better long-prompt adherence, sharper visuals, and improved audio sync.

Yes. Optional integer from 0 to 2147483647 for reproducible runs when the host exposes it.