快速进阶 LLM / AI 的必读系列 | 特爱

Tokenization 分词处理

Byte-pair Encoding https://arxiv.org/pdf/1508.07909
Byte Latent Transformer: Patches Scale Better Than Tokens https://arxiv.org/pdf/2412.09871

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding https://arxiv.org/pdf/1810.04805
IMAGEBIND: One Embedding Space To Bind Them All https://arxiv.org/pdf/2305.05665
SONAR: Sentence-Level Multimodal and Language-Agnostic Representations https://arxiv.org/pdf/2308.11466
FAISS library https://arxiv.org/pdf/2401.08281
Facebook Large Concept Models https://arxiv.org/pdf/2412.08821v2

TensorFlow https://arxiv.org/pdf/1605.08695
Deepseek filesystem https://github.com/deepseek-ai/3FS/blob/main/docs/design_notes.md
Milvus DB https://www.cs.purdue.edu/homes/csjgwang/pubs/SIGMOD21_Milvus.pdf
Billion Scale Similarity Search : FAISS https://arxiv.org/pdf/1702.08734
Ray https://arxiv.org/abs/1712.05889

Attention is All You Need https://papers.neurips.cc/paper/7181-attention-is-all-you-need.pdf
FlashAttention https://arxiv.org/pdf/2205.14135
Multi Query Attention https://arxiv.org/pdf/1911.02150
Grouped Query Attention https://arxiv.org/pdf/2305.13245
Google Titans outperform Transformers https://arxiv.org/pdf/2501.00663
VideoRoPE: Rotary Position Embedding https://arxiv.org/pdf/2502.05173

Deep Reinforcement Learning with Human Feedback https://arxiv.org/pdf/1706.03741
Fine-Tuning Language Models with RHLF https://arxiv.org/pdf/1909.08593
Training language models with RHLF https://arxiv.org/pdf/2203.02155

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models https://arxiv.org/pdf/2201.11903
Chain of thought https://arxiv.org/pdf/2411.14405v1/
Demystifying Long Chain-of-Thought Reasoning in LLMs https://arxiv.org/pdf/2502.03373

Transformer Reasoning Capabilities https://arxiv.org/pdf/2405.18512
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling https://arxiv.org/pdf/2407.21787
Scale model test times is better than scaling parameters https://arxiv.org/pdf/2408.03314
Training Large Language Models to Reason in a Continuous Latent Space https://arxiv.org/pdf/2412.06769
DeepSeek R1 https://arxiv.org/pdf/2501.12948v1
A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods https://arxiv.org/pdf/2502.01618
Latent Reasoning: A Recurrent Depth Approach https://arxiv.org/pdf/2502.05171
Syntactic and Semantic Control of Large Language Models via Sequential Monte Carlo https://arxiv.org/pdf/2504.13139

The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits https://arxiv.org/pdf/2402.17764
FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision https://arxiv.org/pdf/2407.08608
ByteDance 1.58 https://arxiv.org/pdf/2412.18653v1
Transformer Square https://arxiv.org/pdf/2501.06252
Inference-Time Scaling for Diffusion Models beyond Scaling Denoising Steps https://arxiv.org/pdf/2501.09732
1b outperforms 405b https://arxiv.org/pdf/2502.06703
Speculative Decoding https://arxiv.org/pdf/2211.17192

Distilling the Knowledge in a Neural Network https://arxiv.org/pdf/1503.02531
BYOL - Distilled Architecture https://arxiv.org/pdf/2006.07733
DINO https://arxiv.org/pdf/2104.14294

RWKV: Reinventing RNNs for the Transformer Era https://arxiv.org/pdf/2305.13048
Mamba https://arxiv.org/pdf/2312.00752
Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality https://arxiv.org/pdf/2405.21060
Distilling Transformers to SSMs https://arxiv.org/pdf/2408.10189
LoLCATs: On Low-Rank Linearizing of Large Language Models https://arxiv.org/pdf/2410.10254
Think Slow, Fast https://arxiv.org/pdf/2502.20339

Google Math Olympiad 2 https://arxiv.org/pdf/2502.03544
Competitive Programming with Large Reasoning Models https://arxiv.org/pdf/2502.06807
Google Math Olympiad 1 https://www.nature.com/articles/s41586-023-06747-5

Can AI be made to think critically https://arxiv.org/pdf/2501.04682
Evolving Deeper LLM Thinking https://arxiv.org/pdf/2501.09891
LLMs Can Easily Learn to Reason from Demonstrations Structure https://arxiv.org/pdf/2502.07374

Hype Breakers
Separating communication from intelligence https://arxiv.org/pdf/2301.06627
Language is not intelligence https://gwern.net/doc/psychology/linguistics/2024-fedorenko.pdf

Image is 16x16 word https://arxiv.org/pdf/2010.11929
CLIP https://arxiv.org/pdf/2103.00020
deepseek image generation https://arxiv.org/pdf/2501.17811

Video Transformers 视频转换器
ViViT: A Video Vision Transformer https://arxiv.org/pdf/2103.15691
Joint Embedding abstractions with self-supervised video masks https://arxiv.org/pdf/2404.08471
Facebook VideoJAM ai gen https://arxiv.org/pdf/2502.02492

Automated Unit Test Improvement using Large Language Models at Meta https://arxiv.org/pdf/2402.09171
Retrieval-Augmented Generation with Knowledge Graphs for Customer Service Question Answering https://arxiv.org/pdf/2404.17723v1
OpenAI o1 System Card https://arxiv.org/pdf/2412.16720
LLM-powered bug catchers https://arxiv.org/pdf/2501.12862
Chain-of-Retrieval Augmented Generation https://arxiv.org/pdf/2501.14342
Swiggy Search https://bytes.swiggy.com/improving-search-relevance-in-hyperlocal-food-delivery-using-small-language-models-ecda2acc24e6
Swarm by OpenAI https://github.com/openai/swarm
Netflix Foundation Models https://netflixtechblog.com/foundation-model-for-personalized-recommendation-1a0bd8e02d39
Model Context Protocol https://www.anthropic.com/news/model-context-protocol
uber queryGPT https://www.uber.com/en-IN/blog/query-gpt/