www.microsoft.com
Every day, countless videos are uploaded and processed online, putting enormous strain on computational resources. The problem isnt just the sheer volume of dataits how this data is structured. Videos consist of raw pixel data, where neighboring pixels often store nearly identical information. This redundancy wastes resources, making it harder for systems to process visual content effectively and efficiently.To tackle this, weve developed a new approach to compress visual data into a more compact and manageable form. In our paper VidTok: A Versatile and Open-Source Video Tokenizer, we introduce a method that converts video data into smaller, structured units, or tokens. This technique provides researchers and developers in visual world modelinga field dedicated to teaching machines to interpret images and videoswith a flexible and efficient tool for advancing their work.How VidTok worksVidTok is a technique that converts raw video footage into a format that AI can easily work with and understand, a process called video tokenization. This process converts complex visual information into compact, structured tokens, as shown in Figure 1.Figure 1. An overview of how video tokenizers work, which form the basis of VidTok.By simplifying videos into manageable chunks, VidTok can enable AI systems to learn from, analyze, and generate video content more efficiently. VidTok offers several potential advantages over previous solutions:Supports both discrete and continuous tokens. Not all AI models use the same language for video generation. Some perform best with continuous tokensideal for high-quality diffusion modelswhile others rely on discrete tokens, which are better suited for step-by-step generation, like language models for video. VidTok is a tokenizer that has demonstrated seamless support for both, making it adaptable across a range of AI applications.Operates in both causal and noncausal modes. In some scenarios, video understanding depends solely on past frames (causal), while in others, it benefits from access to both past and future frames (noncausal). VidTok can accommodate both modes, making it suitable for real-time use cases like robotics and video streaming, as well as for high-quality offline video generation.Efficient training with high performance. AI-powered video generation typically requires substantial computational resources. VidTok can reduce training costs by half through a two-stage training processdelivering high performance and lowering costs.About Microsoft ResearchAdvancing science and technology to benefit humanityView our storyOpens in a new tab ArchitectureThe VidTok framework builds on a classic 3D encoder-decoder structure but introduces 2D and 1D processing techniques to handle spatial and temporal information more efficiently. Because 3D architectures are computationally intensive, VidTok combines them with less resource-intensive 2D and 1D methods to reduce computational costs while maintaining video quality.Spatial processing. Rather than treating video frames solely as 3D volumes, VidTok applies 2D convolutionspattern-recognition operations commonly used in image processingto handle spatial information within each frame more efficiently.Temporal processing. To model motion over time, VidTok introduces the AlphaBlender operator, which blends frames smoothly using a learnable parameter. Combined with 1D convolutionssimilar operations applied over sequencesthis approach captures temporal dynamics without abrupt transitions.Figure 2 illustrates VidToks architecture in detail.Figure 2. VidToks architecture. It uses a combination of 2D and 1D operations instead of solely relying on 3D techniques, improving efficiency. For smooth frame transitions, VidTok employs the AlphaBlender operator in its temporal processing modules. This approach strikes a balance between computational speed and high-quality video output.QuantizationTo efficiently compress video data, AI systems often use quantization to reduce the amount of information that needs to be stored or transmitted. A traditional method for doing this is vector quantization (VQ), which groups values together and matches them to a fixed set of patterns (known as a codebook). However, this can lead to an inefficient use of patterns and lower video quality.For VidTok, we use an approach called finite scalar quantization (FSQ). Instead of grouping values, FSQ treats each value separately. This makes the compression process more flexible and accurate, helping preserve video quality while keeping the file size small. Figure 3 shows the difference between the VQ and FSQ approaches.Figure 3. VQ (left) relies on learning a codebook, while FSQ (right) simplifies the process by independently grouping values into fixed sets, making optimization easier. VidTok adopts FSQ to enhance training stability and reconstruction quality.TrainingTraining video tokenizers requires significant computing power. VidTok uses a two-stage process:It first trains the full model on low-resolution videos.Then, it fine-tunes only the decoder using high-resolution videos.This approach cuts training costs in halffrom 3,072 to 1,536 GPU hourswhile maintaining video quality. Older tokenizers, trained on full-resolution videos from the start, were slower and more computationally intensive.VidToks method allows the model to quickly adapt to new types of videos without affecting its token distribution. Additionally, it trains on lower-frame-rate data to better capture motion, improving how it represents movement in videos.Evaluating VidTokVidToks performance evaluation using the MCL-JCV benchmarka comprehensive video quality assessment datasetand an internal dataset demonstrates its superiority over existing state-of-the-art models in video tokenization. The assessment, which covered approximately 5,000 videos of various types, employed four standard metrics to measure video quality:Peak Signal-to-Noise Ratio (PSNR)Structural Similarity Index Measure (SSIM)Learned Perceptual Image Patch Similarity (LPIPS)Frchet Video Distance (FVD)The following table and Figure 4 illustrate VidToks performance:Table 1The results indicate that VidTok outperforms existing models in both discrete and continuous tokenization scenarios. This improved performance is achieved even when using a smaller model or a more compact set of reference patterns, highlighting VidToks efficiency.Figure 4. Quantitative comparison of discrete and continuous tokenization performance in VidTok and state-of-the-art methods, evaluated using four metrics: PSNR, SSIM, LPIPS, and FVD. Larger chart areas indicate better overall performance.VidTok represents a significant development in video tokenization and processing. Its innovative architecture and training approach enable improved performance across various video quality metrics, making it a valuable tool for video analysis and compression tasks. Its capacity to model complex visual dynamics could improve the efficiency of video systems by enabling AI processing on more compact units rather than raw pixels.VidTok serves as a promising foundation for further research in video processing and representation. The code for VidTok is available on GitHub (opens in new tab), and we invite the research community to build on this work and help advance the broader field of video modeling and generation.Opens in a new tab