索引 [Visual Generation Pretrain] Towards Scalable Pre-training of Visual Tokenizers for Generation [Visual Pretrain] Next-Embedding Prediction Makes Strong Vision Learners DEMYSTIFYING CLIP DATA [LLM] [Base] Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling [Verification && Self-Critique] Enhancing LLM Planning Capabilities through Intrinsic Self-Critique [Video Action] Modeling Video Evolution For Action Recognition Convolutional Two-Stream Network Fusion for Video Action Recognition Written on February 3, 2026