Unified In-Context Framework
UNIC unifies video editing by processing all inputs—noisy video latents, reference video tokens, and varied multi-modal condition tokens—as a combined sequence. This allows the native attention mechanisms of a Diffusion Transformer (DiT) to learn complex editing tasks "in-context," offering flexibility and simplicity.
- Unified model for diverse tasks.
- Define input tokens into three types.
- No task-specific adapter modules.