KALEIDO: OPEN-SOURCED MULTI-SUBJECT REFERENCE VIDEO GENERATION MODEL

Hefei University of Technology, Tsinghua University, Zhipu AI

*Corresponding Author Project Lead

Single Subject Video Generation

A cartoon figure with long blonde pigtails leaps playfully on a vast, fluffy pink cloud. Her pink, star-adorned cape billows with the movement, and her chunky boots press into the soft surface with each bounce. The butterfly charm on her necklace sways as she lands and springs up again, a determined expression on her face.

Inside a brightly lit retro diner, a young woman wearing Minnie Mouse ears sits in a red vinyl booth. She leans over a tall, frosty milkshake, her dark hair falling forward. With a small, focused smile, she slowly spins a long spoon in the glass, watching the creamy liquid swirl. She then lifts the spoon, her eyes following the single drop of milkshake clinging to its tip as she brings it close.

In a close-up shot, a stream of shimmering black sand is gently blown away, slowly revealing a rose gold watch fitted with a matte black leather strap. As the last grains of sand clear the dark, patterned dial, the diamond markers and bezel catch a soft light, casting a brilliant sparkle across the frame.

In the clear, turquoise water of a sunlit lagoon, a golden twisted bracelet slowly sinks. Tiny bubbles cling to its textured surface, catching the light as it gently rotates on its descent toward the soft, white sand below.

A young hero in a sleek, black armored suit stands on a high rooftop, overlooking the sprawling city at night. Intricate lines of glowing green energy trace the contours of his high-tech suit, casting an emerald sheen on his focused expression. A simple green domino mask hides his eyes as he scans the streets far below, his gaze sharp and unwavering. The cold night wind causes his dark, heavy cape to billow out behind him. For a moment, he is perfectly still, a silent guardian against the darkness. Then, the circular emblem on his chest flares with a brighter green light as he tenses, his gloved hands clenching into fists. He has found what he was looking for.

A powerful chestnut horse with a shimmering, copper-toned coat gracefully traverses the barren expanse of the moon. Its powerful hooves kick up delicate puffs of fine, grey lunar dust as it prances across the rocky terrain, each stride a blend of strength and elegance. With its long, reddish mane flowing in a silent, ethereal breeze, the horse pauses atop a gentle rise, overlooking the monumental craters that dot the stark landscape. In a moment of quiet reflection, the magnificent creature lifts its gaze to the distant Earth, a radiant blue marble suspended in the black immensity of space. The interplay of harsh light and deep shadow dances across its sleek, muscular form, creating a surreal spectacle of movement against the stately, serene backdrop of the celestial territory. As it resumes its impossible journey, the horse embodies a symbol of grace and freedom, an improbable yet mesmerizing presence amidst the solitude of the moon's ancient surface.

On the concrete sidewalk of a bustling city street, a pair of iridescent sneakers sits expectantly. Sunlight glints off their shimmering surfaces, causing the colors to shift and flow from electric blue to deep purple and vibrant pink, as if they hold a captured galaxy. Their chunky white soles provide a stark, modern contrast to the rustic charm of the old brick buildings that line the avenue. As people hurry past, their blurred shapes warp and slide across the sneakers' reflective exteriors. The distant, rhythmic sound of traffic provides a steady beat to the city's pulse, but the shoes remain perfectly still, a whimsical pause in the urban rush, waiting for their owner to return and continue their journey through the cityscape.

Iron Man's iconic red and gold armor gleams as he gracefully steps onto a futuristic catwalk, the metallic suit partially veiled by a stunning sheer illusion dress. The vibrant red lace is a bold statement, its intricate floral patterns and tiered midi-length skirt contrasting beautifully with the sleek, powerful lines of his armor. As the spotlights dim to a focused beam, he begins to move down the runway, his every step accentuating the delicate fabric's ethereal drape over his formidable silhouette. The sheer lace, with its unique diagonal texturing on the bodice and avant-garde graphic prints on the skirt, catches the ambient glow, creating a mesmerizing tapestry of light and shadow on his polished surface. With each movement, the short-sleeved dress sways enticingly, highlighting the synergy between advanced technology and high fashion. As he reaches the end of the runway, Iron Man pauses, the dress perfectly framing his powerful figure, before he gracefully pivots, the tiered fabric swirling around him, leaving the audience in awe of the seamless blend of sophistication and strength.

A young man with dark, neatly-cut hair stands on a brightly lit subway platform during the evening rush. He clutches a worn leather satchel strap slung over his shoulder, his knuckles white. His eyes, full of concentration, dart back and forth, scanning the electronic departures board overhead that flickers with train times and destinations. He shifts his weight from one foot to the other, a subtle sign of impatience. He pulls his dark jacket tighter around himself as a gust of wind, heralding an approaching train, sweeps through the tunnel. His gaze drops from the board and focuses down the dark tracks, a look of anticipation replacing his focused expression as the faint lights of his train appear in the distance.

A grey and white cat with wide, curious eyes carefully stalks across the top of a grand piano. Its tail gives a slight twitch as it lowers into a crouch, paws silently placed on the glossy black surface. The cat's gaze is fixed intently on a single, slowly descending dust mote dancing in a sunbeam. With a sudden, playful pounce, it bats the air before gracefully landing and continuing its quiet patrol along the piano keys.

A fluffy, lime-green creature with big, expressive black eyes stands under the warm glow of lanterns in a Taipei night market. Holding a dessert in its yellow, four-fingered hand, it tilts its furry head as it eats, gazing directly toward the camera. The background reveals a quiet street with a few scattered food stalls and sparse crowds, creating a peaceful night-time atmosphere.

A small, white dog with a coat of tight, fluffy curls trots energetically across a lush, green lawn, its plumed tail wagging with pure delight. As it explores the vibrant, sun-dappled garden, it suddenly freezes mid-trot, its pinkish-brown nose twitching as it catches the sweet, refreshing scent of watermelon on the breeze. Guided by the irresistible aroma, it makes its way to a wooden picnic table where a bright red slice of juicy watermelon rests. With its curious brown eyes gleaming, the dog eagerly sniffs the colorful fruit before taking a small, tentative nibble. The taste is an instant delight, and it begins to chomp enthusiastically, the sweet juice trickling down its chin as it savors its refreshing feast in the tranquil garden.

Multi Subject Video Generation

In a sun-drenched attic, a fox character in a green coat carefully pushes a vintage toy car across the dusty wooden floor. Seated in the driver's seat of the tiny car is a small penguin plush toy, its blue beanie askew as it wobbles with the movement. The fox gives the car one final, gentle push, sending it rolling slowly toward a ramp made from a stack of old books. The car, with the penguin plush toy inside, smoothly glides up the ramp and comes to a stop at the top.

In a warmly lit room, a fluffy corgi lies on a large, soft rug, its bushy tail twitching gently. A small grey and white kitten, wearing a delicate golden collar, crouches low, eyes fixed on the tail. With a wiggle of its hindquarters, the kitten pounces, batting playfully at the fluffy tip. The corgi lifts its head and turns, looking with curiosity as the kitten gives another gentle tap before scampering away to hide behind a nearby armchair.

In a beautifully appointed, sunlit room, a fluffy brown teddy bear is having a quiet moment of self-care. It sits comfortably on a plush sofa, a slightly lopsided blue ice pack still resting on its head and a "Get Well" patch peeking out from its chest. Cradled carefully in its lap is a large white ceramic bowl with the words "Bon Appétit" elegantly scripted in gold. The bowl is filled to the brim with fresh, deep-blue blueberries. With a determined but gentle motion, the bear uses one of its soft paws to pick out a single plump berry, brings it up to its stitched smile, and then dips its paw back in for another, enjoying its healthy and delicious treat one by one.

Under a canopy of glowing paper lanterns at a crowded night market, a young woman with long dark hair, wrapped in an elegant white cape with toggle fasteners, pauses before a curious stall. Her gaze is fixed on a row of miniature, corked glass bottles. The proprietor, a bizarre, spiky-haired character in a dark suit and a flimsy party hat, leans against a wooden post with an air of profound boredom, listlessly holding an empty martini glass. The woman lifts a hand and points a slender finger towards a small bottle containing a swirling, purple mist. With a theatrical sigh, the character straightens up, places his glass on the counter, and plucks the indicated bottle from its shelf. He holds it out to her between his thumb and forefinger. As she takes it, her fingers gently brush against his. He flinches, his weary eyes widening for a moment. She offers a small, curious smile, before carefully uncorking the bottle. A wisp of shimmering purple light escapes, dancing in the air for a second before vanishing, and a look of quiet wonder spreads across her face. The character watches her, his cynical expression momentarily replaced by one of faint intrigue.

On a city rooftop at dusk, a young man in a blue t-shirt leans against a railing, overlooking the urban landscape. A gentle breeze lifts his dark hair as he turns his head slightly. The fading sunlight glints across the lenses of his black rectangular sunglasses before he looks forward again into the twilight.

In a dimly lit garage, a young woman with long, dark hair kneels on the concrete floor. Wearing a black top and delicate gold earrings, she reaches under a dusty workbench and pulls out a small, black fabric utility bag. She sets it down in front of her, and with a focused gaze, her fingers move to unclip the two plastic buckles on the front flap.

Inside a cozy living room, a young man with a friendly smile sits on a plush grey sofa, the afternoon light warming the space. Wearing a t-shirt and a black baseball cap turned backwards, he holds a soft, brown sloth plushie in his hands. A wide grin spreads across his face as he playfully animates the toy, making its long, furry arms wave slowly in the air as if in a lazy greeting. He chuckles softly to himself, gently bouncing the sloth on his knee. His dark eyes sparkle with amusement as he continues his lighthearted game, pulling the soft toy into a gentle hug against his chest before setting it down on the cushion beside him.

In a sunlit corner of a cluttered antique shop, a man with a wild blonde afro carefully polishes a large, silver gramophone horn. A young woman in denim overalls weaves through stacks of old furniture and stops at a crowded table. She picks up a small, wooden music box and gently turns its crank. The man pauses, setting down his cloth, and turns his head towards her. She looks up from the music box, meeting his curious gaze across the room.

A soft brown sloth plushie sits on the carpeted floor of a bedroom next to a magenta schoolbag. With a determined look on its stitched face, the sloth fumbles with the zipper on the backpack's front pocket, using its soft claws to pull it open. After a moment of rummaging, it triumphantly pulls out a bright yellow, twin-bell alarm clock. Holding the clock in its paws, the plushie's expression seems to brighten, and it suddenly performs a small, joyful hop, its fluffy body bouncing with happiness.

In an endless sea of golden sunflowers swaying gently in the breeze, a man with a warm smile and kind eyes leans forward slightly, reaching out his hand. Opposite him, a young woman with long, dark hair lifts her gaze, the corners of her mouth curving upward in a smile of pure joy and contentment. She gently places her hand in his palm, and their fingers intertwine tightly, a gesture conveying a profound warmth and love that radiates between them. As they lock hands, she leans slightly towards him, her expression softening as she savors the sweetness and peace of the moment. Together, they begin to stroll serenely through the sun-drenched field, their silhouettes casting long, elegant shadows that stretch across the vibrant yellow blossoms.

Perched quietly on a delicate, spiky branch of a majestic Joshua tree, a little yellow bird begins to flap its wings with increasing vigor. Its wings create a gentle rustling sound, cutting through the stillness of the warm desert air. With its neck elongated and proudly poised, the bird lifts off from its perch, darting gracefully into the open vastness of the brilliant blue sky. The air serenely tugs at its soft, vivid feathers, crafting fluid patterns as it spirals upward. Below, a vibrant blue butterfly emerges, its iridescent wings catching the sunlight like polished jewels as it dances through the air. In the expansive landscape, the singular Joshua tree stands tall, its angular branches stretching towards the sun. As the bird soars higher, it notices the butterfly and embarks on a playful interaction—the two elegantly weave between the tree's reaching arms, catching glimpses of each other's vibrant colors. Together, they perform a breathtaking aerial ballet, a flash of yellow and a shimmer of blue, enjoying the vast expanse of their shared domain, free under the warmth of the sun's golden embrace.

In a bustling urban boba shop, a young man leans against a polished concrete counter. He is dressed in a maroon t-shirt and vibrant purple wide-leg shorts. With one hand, he casually holds a cup of bubble tea, gently swirling it, causing the dark pearls to dance within the milky liquid. He lifts his gaze from the cup, looking thoughtfully through the shop's large glass window at the passing city life, a moment of quiet contemplation amidst the afternoon rush.

Abstract

We present Kaleido, a subject-to-video (S2V) generation framework, which aims to synthesize subject-consistent videos conditioned on multiple reference images of target subjects. Despite recent progress in S2V generation models, existing approaches remain inadequate at maintaining multi-subject consistency and at handling background disentanglement, often resulting in lower reference fidelity and semantic drift under multi-image conditioning. These shortcomings can be attributed to several factors. Primarily, the training dataset suffers from a lack of diversity and high-quality samples, as well as cross-paired data, i.e., paired samples whose components originate from different instances. In addition, the current mechanism for integrating multiple reference images is suboptimal, potentially resulting in the confusion of multiple subjects. To overcome these limitations, we propose a dedicated data construction pipeline, incorporating low-quality sample filtering and diverse data synthesis, to produce consistency-preserving training data. Moreover, we introduce Reference Rotary Positional Encoding (R-RoPE) to process reference images, enabling stable and precise multi-image integration. Extensive experiments across numerous benchmarks demonstrate that Kaleido significantly outperforms previous methods in consistency, fidelity, and generlization, marking an advance in S2V generation.

User Study

Dataset Construction Pipeline

Subject-to-video evaluation (left) and user study results comparing Kaleido with VACE, Kling, and Vidu-Q1 (right).

Dataset Construction Pipeline

Dataset Construction Pipeline

Multi-stage scalable S2V pipeline: video slicing/captioning, subject grounding, quality filtering, bg disentanglement & pose-motion augmentation

Framework

Framework Illustration

Illustration of our subject-to-video framework. (a) Multiple reference images are injected for guided video generation. (b) Video tokens use 3D RoPE positional encoding, while (c) reference images utilize R-RoPE for distinct spatial-temporal positioning.

BibTeX

@article{DBLP:journals/corr/abs-2510-18573,
  author       = {Zhenxing Zhang and
                  Jiayan Teng and
                  Zhuoyi Yang and
                  Tiankun Cao and
                  Cheng Wang and
                  Xiaotao Gu and
                  Jie Tang and
                  Dan Guo and
                  Meng Wang},
  title        = {Kaleido: Open-Sourced Multi-Subject Reference Video Generation Model},
  journal      = {CoRR},
  volume       = {abs/2510.18573},
  year         = {2025},
  url          = {https://doi.org/10.48550/arXiv.2510.18573},
  doi          = {10.48550/ARXIV.2510.18573},
  eprinttype    = {arXiv},
  eprint       = {2510.18573},
  timestamp    = {Sat, 15 Nov 2025 15:31:50 +0100},
  biburl       = {https://dblp.org/rec/journals/corr/abs-2510-18573.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}