Timeline of transformers
Jump to navigation
Jump to search
This is a timeline of transformers.
Big picture
| Time period | Development summary | Key challenges and transitions |
|---|---|---|
| 1990–2016: The RNN era and the road to attention | Sequence modelling is dominated by recurrent neural networks (RNNs), with LSTM becoming the standard architecture for language tasks after 1997. The core problem of the era is the vanishing gradient: RNNs struggle to propagate information across long sequences and cannot be parallelized, limiting both the quality and scale of what can be trained. The seq2seq encoder–decoder paradigm emerges in 2014 as the dominant approach to machine translation, and the attention mechanism is introduced the same year to solve the fixed-size vector bottleneck that limited seq2seq quality. By 2016, attention-based models are achieving state-of-the-art results with far fewer parameters than LSTMs, setting up the hypothesis that recurrence could be removed entirely. | The central unsolved problem is parallelization: RNNs process tokens sequentially and cannot exploit GPU hardware effectively, placing a practical ceiling on model scale. The development of attention mechanisms by Bahdanau et al. (2014) and Luong et al. (2015) provides the key missing ingredient. The decomposable attention paper (2016) demonstrates that attention alone — without recurrence — can match LSTMs on NLP tasks, directly motivating the transformer. |
| 2017–2019: The transformer is introduced and the pretraining paradigm is established | The transformer architecture is introduced in June 2017, replacing recurrence with multi-head self-attention and enabling full parallelization over token sequences. Within a year, the pretraining-then-finetuning paradigm is established by GPT-1 (2018) and BERT (2018), demonstrating that a single large model pretrained on unlabeled text can be fine-tuned to achieve state-of-the-art results across a wide range of NLP tasks. GPT-2 (2019) demonstrates zero-shot generalization, RoBERTa (2019) shows that training procedure matters as much as architecture, and T5 (2019) unifies all NLP tasks into a single text-to-text framework. By the end of 2019, the transformer has displaced all prior architectures as the dominant approach to NLP. | The transformer's quadratic attention complexity with sequence length emerges as the next fundamental bottleneck. Scaling models beyond a single GPU's memory requires new parallelism techniques, addressed by Megatron-LM (2019). The question of how much capability can be obtained by scaling alone — without architectural changes — begins to be asked. |
| 2020–2021: Scaling, generalization, and the expansion beyond language | GPT-3 (2020) demonstrates that scaling alone produces striking few-shot and zero-shot capabilities, establishing the scaling hypothesis as the dominant research direction. The transformer expands decisively beyond NLP: the Vision Transformer (2020) applies transformers to image recognition, DALL-E (2021) extends the architecture to image generation, AlphaFold 2 (2021) applies a transformer variant to protein structure prediction with Nobel Prize-winning results, and the Decision Transformer (2021) extends transformers to reinforcement learning. Efficient attention research (Longformer, Reformer, Sparse Transformers) addresses the quadratic scaling bottleneck. RoPE (2021) introduces a positional encoding that would later become the field's dominant standard. | Transformer training at scale requires enormous GPU clusters and specialized infrastructure, concentrating frontier development at a small number of well-resourced organizations. The field begins to grapple with questions of access, safety, and the environmental cost of large model training. The gap between closed commercial models and publicly available models begins to widen. |
| 2022–2023: The public breakthrough and the open-weight ecosystem | ChatGPT (November 2022) brings transformer-based AI to global public attention for the first time, reaching 100 million users in two months and triggering a wave of investment and competitive development. GPT-4 (2023) achieves human-level performance on professional benchmarks. FlashAttention (2022) and Whisper (2022) become standard infrastructure components. The open-weight ecosystem emerges as a counterweight to closed commercial models: LLaMA (February 2023) and LLaMA 2 (July 2023) from Meta AI, and Mistral 7B (September 2023) from Mistral AI, demonstrate that carefully trained smaller models can match or approach much larger proprietary ones. Multimodal transformers become standard, with GPT-4V (2023) and Gemini (2023) integrating vision and language in widely deployed systems. | The concentration of frontier capability in a small number of closed systems raises concerns about access, competition, and safety. The EU AI Act, US executive orders on AI, and other regulatory responses are accelerated by the public impact of ChatGPT and GPT-4. The open-weight movement creates a bifurcated ecosystem of closed frontier models and open community models, with ongoing tension between the two. |
| 2024–present: Reasoning models, multimodality, and geopolitical disruption | Inference-time scaling emerges as a new axis of capability improvement alongside training-time scaling: OpenAI's o1 (September 2024) demonstrates that allocating more compute at inference time for chain-of-thought reasoning substantially improves performance on complex tasks. Natively multimodal transformers become standard with GPT-4o (2024) and Gemini 2.0, processing text, audio, and images in unified architectures. DeepSeek-V3 and DeepSeek-R1 (late 2024/early 2025) demonstrate that frontier-class transformer training is achievable at a small fraction of previously assumed costs, challenging assumptions about the economics and geopolitics of AI development and establishing China as a peer competitor in frontier transformer research. The open-weight ecosystem matures with Llama 3 (2024) and its derivatives, making frontier-class models freely available for the first time. | The field faces fundamental questions about the limits of both training-time and inference-time scaling, the sustainability of the current pace of capability improvement, and the geopolitical implications of transformer development becoming a global rather than predominantly American enterprise. Safety and alignment research struggles to keep pace with capability gains. |
Full timeline
| Year (month and date) | Event type | Details |
|---|---|---|
| 1990 | Prelude | The Elman network, a type of recurrent neural network (RNN), becomes a well-cited early example of sequence modelling. In theory, information from one token can propagate arbitrarily far down the sequence, but in practice the vanishing-gradient problem leaves the model's state at the end of a long sentence without precise, extractable information about preceding tokens. Addressing this limitation would motivate subsequent work on gating mechanisms and, eventually, the transformer architecture. |
| 1997 (November) | Prelude | Schmidhuber and Hochreiter publish long short-term memory (LSTM), a type of recurrent neural network (RNN) that introduces gating mechanisms to mitigate the vanishing-gradient problem, allowing efficient learning of long-sequence modelling. A key architectural element is the use of multiplicative gating units, in which the outputs of some neurons modulate the outputs of others. LSTM becomes the standard architecture for long-sequence modelling and remains so until the publication of the transformer in 2017. Like all RNNs, however, LSTM still processes tokens sequentially — one at a time from first to last — and cannot operate in parallel over all tokens in a sequence, a limitation that the transformer would later overcome.[1] |
| 2014 (September–October) | Prelude | Two concurrently published papers introduce the sequence-to-sequence (seq2seq) encoder–decoder architecture for machine translation, commonly cited as the originators of the paradigm. The first, by Sutskever, Vinyals, and Le at Google Brain, uses two long short-term memories (LSTMs) with 380 million parameters: an encoder LSTM that converts an input sequence of tokens into a fixed-size vector, and a decoder LSTM that converts that vector into an output sequence.[2] The second, by Cho, van Merriënboer, Bahdanau, and Bengio at Université de Montréal, uses gated recurrent units (GRUs) instead of LSTMs, with 130 million parameters.[3] Both models suffer from a fundamental bottleneck: without an attention mechanism, the entire input sequence must be compressed into a single fixed-size vector before decoding begins, degrading output quality for long inputs. Addressing this bottleneck would motivate the attention mechanism introduced the following year. |
| 2014 (September) | Prelude | Bahdanau, Cho, and Bengio introduce an attention mechanism for seq2seq models, solving the fixed-size vector bottleneck that degraded translation quality for long inputs. Dubbed the "RNN search" model — because it emulates searching through a source sentence during decoding — the mechanism allows the decoder to dynamically attend to different parts of the input sequence at each decoding step, rather than relying on a single compressed vector. This is the first additive attention mechanism, conceptually distinct from the multiplicative gating units used in LSTM. The paper is later recognized as one of the foundational precursors to the transformer: one of the co-authors of the 2017 "Attention Is All You Need" paper, Jakob Uszkoreit, would cite attention-based seq2seq work as the direct motivation for asking whether recurrence could be removed entirely.[4] |
| 2015 (August) | Prelude | Luong, Pham, and Manning at Stanford compare global and local attention architectures for machine translation, building on the additive attention mechanism introduced by Bahdanau et al. the previous year. Their "global" attention model attends to all source tokens at each decoding step, while their "local" attention model uses a sliding window over the source, reducing computation. They find that local attention reduces translation time while mixed attention yields higher quality than global attention alone. The paper establishes attention as a modular, composable component of seq2seq architectures, and the dot-product formulation it introduces would directly influence the scaled dot-product attention used in the 2017 transformer.[5] |
| 2016 (September) | Prelude | Google revamps Google Translate to use Google Neural Machine Translation, replacing the previous model based on statistical machine translation. The new model is a seq2seq model where both the encoder and decoder are 8-layer bidirectional LSTMs. It takes nine months to develop, yet outperforms the statistical approach that took ten years to build, demonstrating the rapid progress enabled by neural sequence modelling. The system would itself be superseded in 2020 when Google Translate upgrades to a transformer-based encoder.[6][7] |
| 2016 (September) | Prelude | Parikh, Täckström, Das, and Uszkoreit publish "A Decomposable Attention Model for Natural Language Inference", applying a self-attention mechanism to feedforward networks — which are easy to parallelize — and achieving state-of-the-art results on natural language inference with an order of magnitude fewer parameters than LSTMs. Uszkoreit suspects from this result that attention without recurrence would be sufficient for language translation — the hypothesis that would become the title "Attention Is All You Need" the following year. His father, Hans Uszkoreit, a well-known computational linguist, is skeptical of the idea at the time, illustrating how the transformer's core premise ran against conventional wisdom in the field.[8][9] |
| 2017 (June 12) | Model launch | Researchers at Google publish "Attention Is All You Need", introducing the original transformer architecture. The model is an encoder–decoder with 100 million parameters, motivated by improving seq2seq for machine translation by removing recurrence entirely and processing all tokens in parallel via a multi-head attention mechanism. Unlike prior RNN-based models, the transformer's lack of recurrence makes it straightforwardly parallelizable on GPUs, enabling training at previously impractical scales — a property that would prove critical to the scaling of large language models in subsequent years. As early as spring 2017, even before the preprint is published, one of the co-authors applies the decoder-only variant of the architecture to generate fictitious Wikipedia articles, foreshadowing the decoder-only designs that would later dominate the field. Four days after publication, most of the same authors publish MultiModel, a multimodal transformer architecture, demonstrating the generality of the design beyond text.[10][11][12] |
| 2018 (June 11) | Model launch | OpenAI releases "Improving Language Understanding by Generative Pre-Training", introducing the Generative Pre-trained Transformer (GPT-1), a decoder-only transformer with 117 million parameters trained by unsupervised language modelling on the BooksCorpus dataset, then fine-tuned on specific downstream tasks. The paper is authored by Radford, Narasimhan, Salimans, and Sutskever. GPT-1 establishes the pretraining-then-finetuning paradigm for NLP that BERT would extend the same year and that would become the dominant framework for large language model development. Starting with GPT-1, the OpenAI GPT series of decoder-only transformers becomes state of the art in natural language generation, a line of work that would culminate in GPT-3 (2020) and eventually ChatGPT (2022).[13] |
| 2018 (October) | Model launch | Google researchers Devlin, Chang, Lee, and Toutanova publish "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding", introducing BERT (Bidirectional Encoder Representations from Transformers), an encoder-only transformer pretrained on English Wikipedia (12 GB) and BookCorpus (4 GB) using two self-supervised tasks: masked token prediction, in which random tokens are hidden and the model must recover them from context, and next-sentence prediction. Unlike the decoder-only GPT-1, BERT processes context from both left and right simultaneously, making it better suited for representation learning and classification tasks than for text generation. In its large configuration BERT has 345 million parameters. BERT achieves state-of-the-art results across a wide range of NLP benchmarks upon release and is open-sourced by Google. In October 2019, Google begins using BERT to process almost every English search query, marking one of the first large-scale deployments of a transformer model in a consumer product used by billions of people.[14][15] |
| 2019 (February 14) | Model launch | OpenAI releases Generative Pre-trained Transformer 2 (GPT-2), a decoder-only transformer with 1.5 billion parameters in its largest configuration — roughly 10 times larger than GPT-1. GPT-2 is trained on WebText, a dataset of 40 GB of text scraped from outbound links on Reddit with at least three upvotes, comprising approximately 8 million documents. The model demonstrates strong zero-shot performance across a range of language tasks without task-specific fine-tuning, suggesting that sufficiently large language models begin to acquire general capabilities from pretraining alone. OpenAI initially releases only a smaller version of the model, citing concerns about potential misuse for generating disinformation, making it one of the first high-profile cases in which an AI lab delays a model release on safety grounds — a decision that attracts both praise and criticism from the research community.[16][17] |
| 2019 (July) | Model launch | Facebook AI Research (FAIR) and the University of Washington introduce RoBERTa (Robustly Optimized BERT Approach), a retraining of BERT with improved training methodology rather than a new architecture. The key findings are that BERT was significantly undertrained: training for longer, on more data, with larger batches, removing the next-sentence prediction task, and training on longer sequences all yield consistent improvements. RoBERTa has 125 million parameters in its base configuration and is trained on 161 GB of text — ten times more than BERT — comprising English Wikipedia, BookCorpus, CC-News, OpenWebText, and Stories. RoBERTa achieves state-of-the-art results on several NLP benchmarks, demonstrating that careful attention to training procedure can be as important as architectural innovation.[18] |
| 2019 (August) | Model launch | NVIDIA introduces Megatron-LM, a transformer language model with 8.3 billion parameters — at the time the largest transformer ever trained, 24 times the size of BERT and 5.6 times the size of GPT-2. Megatron-LM is trained on 512 GPUs using 8-way model parallelism and 64-way data parallelism on a 37 GB WebText dataset. The key technical contribution is a model parallelism approach that splits parameters across multiple GPUs without requiring a new compiler or code re-wiring, solving a fundamental obstacle to training transformer models beyond the memory capacity of a single GPU. Megatron-LM surpasses state-of-the-art results on wikitext perplexity and Lambada accuracy. The work establishes model parallelism as a viable path to scaling transformers to hundreds of billions of parameters, a technique that NVIDIA and Microsoft would extend two years later with the 530-billion-parameter Megatron-Turing NLG.[19] |
| 2019 (October) | Model launch | Raffel, Shazeer, Roberts, Lee, Narang, Matena, Zhou, Li, and Liu at Google publish "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer", introducing T5 (Text-to-Text Transfer Transformer), an encoder–decoder transformer that reframes every NLP task — translation, summarization, question answering, classification — as a text-to-text problem, with both inputs and outputs always being strings of text. This unification allows a single model architecture and training objective to be applied across all tasks without task-specific output heads or modifications. T5 is pretrained on the Colossal Clean Crawled Corpus (C4), a 750 GB filtered subset of Common Crawl, and released in multiple sizes ranging from 60 million to 11 billion parameters. The paper conducts one of the most systematic studies of transfer learning design choices published to date, evaluating the effect of pretraining objectives, model architectures, dataset sizes, and fine-tuning procedures across more than 50 NLP benchmarks. T5 achieves state-of-the-art results on many of these benchmarks and the text-to-text framing it establishes becomes widely adopted, influencing the design of subsequent instruction-tuned models including ChatGPT.[20] |
| 2020 (June) | Model launch | Google Translate upgrades from an RNN-encoder–RNN-decoder model to a transformer-encoder–RNN-decoder model. A purely transformer-based decoder does not appear to significantly increase translation quality over the RNN decoder, while the RNN decoder is substantially faster, motivating the hybrid approach. The upgrade demonstrates that the transformer's gains are most pronounced on the encoding side. The system had previously been upgraded in 2016 from statistical machine translation to a purely RNN-based seq2seq model; this second upgrade completes the transition of one of the world's most widely used translation systems to transformer-based encoding.[21] |
| 2020 (June 11) | Model launch | OpenAI releases Generative Pre-trained Transformer 3 (GPT-3) in beta, a decoder-only transformer with 175 billion parameters — more than 100 times larger than GPT-2 — trained on a 570 GB filtered subset of Common Crawl, WebText2, Books1, Books2, and English Wikipedia. GPT-3 demonstrates striking few-shot and zero-shot performance across a wide range of tasks including translation, question answering, and arithmetic, without any gradient updates or fine-tuning at inference time. The results suggest that scaling model size and training data alone, without architectural changes, produces qualitative leaps in capability — a finding that would strongly influence the subsequent direction of large language model research. GPT-3 is accessed via API rather than released as open weights, establishing a pattern of controlled commercial access that several other labs would follow. The model becomes the foundation for ChatGPT two years later.[22] |
| 2020 (October) | Model launch | Dosovitskiy, Beyer, Kolesnikov, and colleagues at Google Brain publish "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale", introducing the Vision Transformer (ViT). The model adapts the standard transformer encoder to computer vision by dividing input images into fixed-size 16×16 pixel patches, linearly embedding each patch into a vector, and treating the resulting sequence of vectors as token embeddings — allowing the transformer's self-attention mechanism to operate over image regions rather than words. When pretrained on sufficiently large datasets, ViT matches or exceeds the performance of state-of-the-art convolutional neural networks (CNNs) on image classification benchmarks while requiring substantially less compute to train. The result challenges the long-held assumption that inductive biases specific to image data — such as translation equivariance built into CNNs — are necessary for strong vision performance. ViT's success in turn stimulates a new wave of CNN architectural improvements, as convolutional network designers incorporate design principles from transformers, and spurs the development of transformer-based approaches across object detection, image segmentation, and video understanding.[23] |
| 2020 | Research | A wave of research addresses the quadratic scaling of transformer attention with sequence length — a fundamental bottleneck preventing transformers from processing long documents efficiently. Beltagy, Peters, and Cohan at the Allen Institute for AI introduce the Longformer, which replaces full all-to-all attention with a combination of local sliding-window attention and global attention on selected tokens, reducing complexity from O(n²) to O(n) and enabling processing of documents up to 4,096 tokens — eight times longer than BERT. Concurrently, Kitaev, Kaiser, and Levskaya at Google introduce the Reformer, which reduces attention complexity to O(n log n) using locality-sensitive hashing to identify which tokens are most likely to attend to each other, and reversible layers to reduce memory consumption. Child, Gray, Radford, and Sutskever at OpenAI introduce Sparse Transformers, which use fixed sparse attention patterns to generate long coherent text and images. Together these papers establish efficient attention as a major research direction, motivating subsequent work including FlashAttention (2022) and the long-context models that would become standard by 2024.[24][25][26] |
| 2021 (January) | Model launch | OpenAI introduces DALL-E, a decoder-only transformer that generates images from text prompts by treating both text and image data as sequences of tokens. The model first autoregressively generates a text token sequence, followed by a token representation of an image produced by a variational autoencoder (VAE), which is then decoded into a pixel-level image. With 12 billion parameters, DALL-E is trained on 250 million text–image pairs scraped from the internet. Unlike later image generation models based on diffusion, DALL-E is a pure autoregressive transformer, demonstrating that the architecture developed for NLP can be extended to image generation by reframing pixels as tokens. The model can combine unrelated concepts in plausible ways — generating images of "an armchair in the shape of an avocado" or "a snail made of harp" — suggesting that large-scale transformer pretraining on paired data produces flexible compositional representations extending beyond language.[27] |
| 2021 (April) | Research | Su, Lu, Pan, Murtadha, Wen, and Liu introduce RoPE (Rotary Position Embedding), a novel positional encoding method for transformers published in the RoFormer paper. Unlike the original sinusoidal absolute positional encoding, which adds a fixed position vector to each token embedding before the attention mechanism, RoPE encodes position by rotating token embeddings in a high-dimensional space by an angle proportional to their position, applied directly within the attention computation. The key mathematical property is that the dot product between two rotated vectors depends only on their relative positions rather than their absolute positions, allowing the model to generalize to sequence lengths longer than those seen during training. RoPE is initially overlooked relative to other positional encoding methods but is later adopted by the Llama series, PaLM, and most subsequent open-weight large language models, becoming the dominant positional encoding in the field by 2024 and displacing the original sinusoidal encoding used in the 2017 transformer.[28] |
| 2021 (June) | Model launch | Chen, Lu, Rajeswaran, Lee, Grover, Laskin, Abbeel, Srinivas, and Mordatch introduce the Decision Transformer, reframing offline reinforcement learning as a sequence modelling problem solvable by a standard transformer. Rather than training a policy by optimizing a reward function directly, the Decision Transformer conditions on a desired future return, the current state, and the action history, and autoregressively predicts the next action. This formulation allows the transformer to be applied to RL without any modifications to the architecture, using the same pretraining-then-finetuning paradigm established by GPT and BERT. The result demonstrates that transformers are not limited to language and vision but constitute a general-purpose sequence modelling framework applicable across domains, a finding that would motivate subsequent work applying transformers to robotics, protein folding, and other sequential decision-making problems.[29] |
| 2021 (July) | Research (application) | DeepMind publishes AlphaFold 2, a transformer-based system for protein structure prediction that achieves accuracy competitive with experimental methods such as X-ray crystallography and cryo-EM, effectively solving a problem that had challenged structural biologists for fifty years. AlphaFold 2 uses a transformer variant called Evoformer, which operates on pairwise representations of amino acid residues rather than token sequences, applying attention over both sequence and structural dimensions simultaneously. The system is pretrained on the Protein Data Bank and fine-tuned using multiple sequence alignments of related proteins. DeepMind subsequently releases predicted structures for over 200 million proteins — nearly every known protein — through the AlphaFold Protein Structure Database, making the results freely available to the scientific community. AlphaFold 2 is widely regarded as one of the most significant scientific applications of the transformer architecture, demonstrating that the attention mechanism generalizes far beyond its origins in natural language processing to fundamental problems in biology. Its lead author Demis Hassabis and John Jumper are awarded the Nobel Prize in Chemistry in 2024 partly for this work.[30] |
| 2021 (October) | Model launch | NVIDIA and Microsoft jointly introduce Megatron-Turing NLG 530B (MT-NLG), the successor to Microsoft Turing NLG 17B and NVIDIA Megatron-LM 8.3B. With 530 billion parameters — three times larger than GPT-3 — and trained on over 4,000 GPUs, MT-NLG is at the time the largest transformer model ever trained. Its training dataset includes The Pile v1 plus 14 additional sources such as Books3, OpenWebText2, Stack Exchange, PubMed Abstracts, and GitHub, totalling over 825 GB. The model demonstrates state-of-the-art performance on several NLP benchmarks and establishes that collaborative infrastructure between major technology companies can push the scale of transformer training beyond what either could achieve independently.[31] |
| 2022 (April) | Model launch | OpenAI introduces DALL-E 2, a text-to-image generation system that combines a CLIP text encoder — itself a transformer — with a diffusion model decoder, replacing the autoregressive transformer decoder used in the original DALL-E. CLIP (Contrastive Language–Image Pretraining) is a transformer trained on 400 million text–image pairs to align text and image representations in a shared embedding space; DALL-E 2 uses the CLIP text encoder to convert a text prompt into an embedding, which is then used to condition a diffusion model that generates the image. The result produces substantially higher quality and more photorealistic images than the original DALL-E while being more controllable and supporting image editing operations such as inpainting and outpainting. DALL-E 2 demonstrates that transformers need not be the generative backbone of image synthesis — they can instead serve as powerful semantic encoders conditioning non-transformer generative models — and contributes to a broader shift in the field toward multimodal architectures combining transformers with diffusion models. The release triggers significant public and commercial interest in AI image generation, prompting competitive responses including Stable Diffusion and Midjourney.[32] |
| 2022 (June 14) | Model launch | Google releases all Switch Transformer models in T5X/JAX, including the 1.6-trillion-parameter Switch-C and the 395-billion-parameter Switch-XXL. The Switch Transformer replaces the standard feedforward neural network (FFN) layer in the transformer architecture with a mixture of experts layer containing multiple FFNs, routing each token to a single expert. This achieves greater model capacity without a proportional increase in compute per token.[33] |
| 2022 (June) | Model launch | Dao, Fu, Ermon, Rudra, and Ré at Stanford University introduce FlashAttention, an algorithm that implements the transformer attention mechanism efficiently on GPUs by exploiting the memory hierarchy of modern hardware. Standard attention implementations materialize the full attention matrix in GPU high-bandwidth memory (HBM), which is slow; FlashAttention instead computes attention in blocks sized to fit within the much faster on-chip SRAM cache, minimizing data movement between memory levels. The algorithm is mathematically exact — producing identical results to standard attention — while being significantly faster and more memory-efficient, enabling training on substantially longer context windows than were previously practical. FlashAttention achieves up to 7.6× speedup over the standard PyTorch attention implementation on NVIDIA A100 GPUs. An improved version, FlashAttention-2, released in 2023, achieves up to 2× further speedup over FlashAttention and up to 9× over standard PyTorch attention through better work partitioning and parallelism over the sequence length dimension. FlashAttention is rapidly adopted across the large language model ecosystem and becomes a standard component of transformer training infrastructure.[34][35] |
| 2022 (November 30) | Model launch | OpenAI releases ChatGPT, a chatbot based on a fine-tuned variant of GPT-3.5, itself a transformer-based large language model. ChatGPT is fine-tuned using reinforcement learning from human feedback (RLHF), a technique that aligns the model's outputs with human preferences by training a reward model on human comparisons and using it to guide further fine-tuning. The release is unexpected in its scale of public impact: ChatGPT reaches one million users within five days and one hundred million users within two months, making it the fastest-growing consumer application in history at the time. The launch triggers a wave of investment and development around large language models, prompts rapid competitive responses from Google, Meta, and Microsoft, and brings transformer-based AI to widespread public attention for the first time. It is widely regarded as a turning point marking the beginning of a new era in public engagement with AI, with transformers at its technological core.[36][37] |
| 2022 (December) | Model launch | Radford, Kim, Xu, Brockman, McLeavey, and Sutskever at OpenAI introduce Whisper, a transformer-based model for speech recognition trained on 680,000 hours of multilingual and multitask supervised data scraped from the internet — orders of magnitude more than prior speech recognition systems. Whisper follows the same patch-based tokenization strategy as the Vision Transformer: the input audio is converted to a log-Mel spectrogram, which is divided into 30-second chunks and treated as a sequence of patches fed into a transformer encoder, with a transformer decoder generating the output text autoregressively. The large and diverse training set allows Whisper to perform robustly across a wide range of accents, recording conditions, and languages without task-specific fine-tuning, approaching human-level performance on English speech recognition benchmarks. Whisper demonstrates that the transformer's strength in NLP and computer vision extends naturally to audio when the signal is reframed as a visual representation, reinforcing the architecture's generality across modalities.[38] |
| 2023 (February) | Model launch | Touvron, Lavril, Izacard, Martinet, Lachaux, Lacroix, Rozière, Goyal, Hambro, Azhar, Rodriguez, Joulin, Grave, and Lample at Meta AI release LLaMA (Large Language Model Meta AI), a family of decoder-only transformer language models ranging from 7 billion to 65 billion parameters, trained exclusively on publicly available data including Common Crawl, Wikipedia, GitHub, Books, ArXiv, and StackExchange. Unlike contemporaneous large language models such as GPT-4 and PaLM, which are accessible only via API, LLaMA's weights are released to the research community under a non-commercial license, making it the first high-quality large language model widely available for local deployment and fine-tuning. The 13-billion-parameter LLaMA model outperforms GPT-3 on most benchmarks despite being ten times smaller, demonstrating that careful data curation and training procedure can substitute for raw parameter scale. LLaMA incorporates several architectural improvements over the original transformer, including RoPE positional embeddings, RMSNorm normalization, and SwiGLU activation functions. The release triggers an explosion of community fine-tuned variants — including Alpaca, Vicuna, and WizardLM — and establishes open-weight large language models as a viable alternative to closed commercial systems, significantly democratizing access to capable transformer-based AI.[39] |
| 2023 (March 14) | Model launch | OpenAI releases GPT-4, a large multimodal transformer-based language model accepting both text and image inputs and producing text outputs. GPT-4 achieves human-level performance on a wide range of professional and academic benchmarks, scoring in approximately the 90th percentile on the Uniform Bar Examination, the 88th percentile on the LSAT, and passing the United States Medical Licensing Examination, benchmarks on which prior models including GPT-3 performed at or below the 10th percentile. OpenAI does not disclose the model's architecture, parameter count, or training data, marking a significant departure from the research transparency that characterized earlier GPT releases and reflecting intensifying commercial competition in the large language model space. GPT-4 is deployed in ChatGPT and via API, and is integrated into Microsoft's Bing search engine and Microsoft 365 productivity suite, representing one of the largest commercial deployments of a transformer model to date. The release cements the transformer as the dominant architecture underlying frontier AI systems and accelerates regulatory and policy discussions around AI safety and governance worldwide.[40][41] |
| 2023 (July) | Model launch | Meta AI releases LLaMA 2, the successor to LLaMA, a family of decoder-only transformer language models ranging from 7 billion to 70 billion parameters. Unlike its predecessor, LLaMA 2 is released under a permissive commercial license, making it freely available for most commercial use cases and significantly expanding access beyond the research community. LLaMA 2 is trained on 2 trillion tokens — 40% more data than LLaMA — and incorporates reinforcement learning from human feedback (RLHF) fine-tuning to produce LLaMA 2-Chat, a dialogue-optimised variant. The Chat variant achieves performance competitive with proprietary models including ChatGPT on several benchmarks, demonstrating that openly licensed models can approach the capability of closed commercial systems. LLaMA 2 becomes the most widely adopted open-weight large language model of 2023, spawning a large ecosystem of fine-tuned variants and downstream applications, and cementing Meta's position as the primary driver of open-weight transformer development.[42] |
| 2023 (September) | Model launch | Mistral AI, a French AI startup founded by former DeepMind and Meta AI researchers, releases Mistral 7B, a 7-billion-parameter decoder-only transformer language model. Mistral 7B outperforms Meta's LLaMA 2 13B model on most benchmarks despite having nearly half the parameters, demonstrating that architectural efficiency improvements can substitute for raw scale. The model incorporates two techniques not present in the original transformer: grouped-query attention (GQA), which reduces inference memory requirements by sharing key-value heads across query heads, and sliding window attention (SWA), which allows efficient processing of long sequences by limiting each token's attention to a fixed window of preceding tokens. Mistral 7B is released under the Apache 2.0 open-source license with no usage restrictions, making it one of the most permissively licensed high-performance language models available. The release establishes Mistral AI as a significant new entrant in the open-weight transformer ecosystem and demonstrates that small, well-engineered models can challenge much larger ones — a finding that would influence subsequent model development across the industry.[43] |
| 2023 (October) | Model launch | OpenAI releases GPT-4V (GPT-4 with Vision), extending GPT-4 with the ability to accept image inputs alongside text, making it one of the first widely deployed multimodal large language models accessible to the general public. GPT-4V can describe image contents, answer questions about images, interpret charts and diagrams, read handwritten text, and reason about spatial relationships within images. The release marks a significant expansion of the transformer's role beyond text: rather than requiring separate specialist models for vision and language tasks, a single transformer can now handle both within a unified interface. GPT-4V is deployed in ChatGPT for Plus subscribers and via API, and is used in applications ranging from accessibility tools for visually impaired users to scientific image analysis. The release intensifies competitive pressure on Google, whose Gemini multimodal model is announced two months later partly in response.[44] |
| 2023 (December) | Model launch | Google DeepMind releases Gemini, a family of natively multimodal transformer models trained from the ground up to understand and generate text, images, audio, video, and code simultaneously — in contrast to GPT-4V, which added vision capabilities to a primarily text-based model. Gemini is released in three sizes: Ultra, Pro, and Nano, targeting data centre, general deployment, and on-device use cases respectively. Gemini Ultra becomes the first model to surpass human expert performance on the Massive Multitask Language Understanding (MMLU) benchmark, scoring 90.0% against a human expert score of 89.8%. The natively multimodal training approach, in which the model processes all modalities through a unified transformer architecture rather than stitching together separate encoders, represents a significant architectural departure from prior multimodal systems. Gemini Pro is integrated into Google Search, Google Workspace, and the Bard chatbot, which is subsequently renamed Gemini. The release marks Google's most direct competitive response to ChatGPT and signals the consolidation of Google Brain and DeepMind into a single research organisation focused on transformer-based frontier AI.[45] |
| 2024 (March) | Model launch | Anthropic releases the Claude 3 family of transformer-based large language models, comprising three tiers: Haiku (fastest and most compact), Sonnet (balanced), and Opus (most capable). Claude 3 Opus becomes the first model to surpass GPT-4 on multiple standard benchmarks simultaneously, including MMLU, HumanEval, and GSM8K, marking the first time a non-OpenAI model leads across a broad range of capability evaluations. All three models support a 200,000-token context window — at the time among the longest available in a commercially deployed model — enabling processing of book-length documents in a single prompt. The Claude 3 release establishes Anthropic as a leading frontier AI lab alongside OpenAI and Google, and demonstrates that the transformer architecture can be pushed to new capability levels by organisations outside the original GPT lineage. Anthropic emphasises AI safety and constitutional AI training methods as distinguishing features of the Claude model family, reflecting the company's founding mission of developing safe and interpretable AI systems.[46] |
| 2024 (April) | Model launch | Meta AI releases Llama 3, a family of decoder-only transformer language models in 8-billion and 70-billion parameter configurations, with a 405-billion parameter version released subsequently in July 2024. Llama 3 is trained on 15 trillion tokens — more than seven times the training data used for Llama 2 — sourced from a significantly expanded and more carefully curated dataset that includes a higher proportion of code, mathematics, and multilingual content. The 70-billion parameter Llama 3 model achieves performance competitive with GPT-4 on several benchmarks, representing a substantial narrowing of the gap between open-weight and closed proprietary models. Llama 3 incorporates grouped-query attention (GQA) across all model sizes, a 128,000-token vocabulary — four times larger than Llama 2's — and is trained with an improved instruction-tuning pipeline. The release further consolidates the open-weight transformer ecosystem around the Llama architecture, with Llama 3 rapidly becoming the dominant base model for fine-tuning, research, and downstream application development across the AI community.[47] |
| 2024 (May) | Model launch | OpenAI releases GPT-4o ("o" for "omni"), a natively multimodal transformer model that processes and generates text, audio, and images within a single unified architecture — in contrast to prior multimodal systems that stitched together separate specialist models for each modality. GPT-4o accepts any combination of text, audio, and image inputs and can produce any combination of text, audio, and image outputs in a single forward pass, enabling real-time voice conversation with response latencies as low as 232 milliseconds, comparable to human conversational response times. The model matches GPT-4 Turbo performance on text and coding benchmarks while being significantly faster and less expensive to run, and substantially outperforms prior OpenAI voice systems by processing audio directly rather than routing it through a speech-to-text transcription step that discarded tonal and emotional information. GPT-4o is made available to free ChatGPT users as well as paying subscribers, representing the first time a frontier-class multimodal transformer is accessible without a subscription. The release demonstrates that the transformer architecture can unify previously separate modalities — text, vision, and audio — into a single end-to-end model, pointing toward a new generation of truly omnimodal AI systems.[48] |
| 2024 (September) | Model launch | OpenAI releases o1, a transformer-based large language model that introduces a new paradigm of inference-time computation: rather than generating an answer immediately, o1 is trained to spend variable amounts of time reasoning through problems before producing a response, using a process OpenAI calls "chain-of-thought" reasoning at inference time. Unlike prior language models that apply a fixed amount of compute per token regardless of problem difficulty, o1 allocates more computation to harder problems and less to simpler ones, effectively trading inference speed for accuracy on complex tasks. o1 achieves performance in the 89th percentile on competitive programming contests, scores 500 out of 500 on the American Mathematics Competition (AMC) 12, and passes the United States Medical Licensing Examination with high scores — benchmarks on which prior models including GPT-4 performed substantially worse. The release represents a significant departure from the prevailing scaling paradigm, in which capability improvements were primarily driven by larger models and more training data; o1 demonstrates that scaling inference-time computation is a complementary and potentially more efficient path to improved capability. The model establishes reasoning as a new competitive axis in large language model development, prompting rapid responses from Google DeepMind, Anthropic, and Meta AI.[49] |
| 2024 (December)–2025 (January) | Model launch | DeepSeek, a Chinese AI research laboratory, releases DeepSeek-V3 in December 2024 and DeepSeek-R1 in January 2025, two transformer-based large language models that attract significant international attention for their combination of frontier-class performance and exceptionally low training cost. DeepSeek-V3 is a mixture of experts (MoE) decoder-only transformer with 671 billion total parameters but only 37 billion active parameters per forward pass, trained on 14.8 trillion tokens at a reported cost of approximately $5.6 million — roughly 50 times cheaper than comparable frontier models from OpenAI and Google DeepMind. DeepSeek-R1 is a reasoning model trained using reinforcement learning without supervised fine-tuning on chain-of-thought data, achieving performance competitive with OpenAI's o1 on mathematics, coding, and reasoning benchmarks. Both models are released as open-weight under a permissive MIT license, making frontier-class reasoning transformers freely available for the first time. The releases cause significant disruption in technology markets and policy discussions, raising questions about the resource requirements for frontier AI development and the effectiveness of export controls on AI hardware. DeepSeek demonstrates that transformer models of frontier capability can be trained outside the United States at a fraction of the cost previously assumed necessary, fundamentally challenging assumptions about the economics and geopolitics of large language model development.[50][51] |
Meta information on the timeline
How the timeline was built
The initial version of the timeline was written by FIXME.
Funding information for this timeline is available.
Feedback and comments
Feedback for the timeline can be provided at the following places:
- FIXME
What the timeline is still missing
- https://www.youtube.com/watch?v=iH-wmtxHunk
- Generative pre-trained transformer
- https://arxiv.org/search/?query=transformer&searchtype=all&source=header&start=250
Timeline update strategy
See also
External links
References
- ↑ Hochreiter, Sepp; Schmidhuber, Jürgen (November 1997). "Long Short-Term Memory". Neural Computation. 9 (8): 1735–1780. doi:10.1162/neco.1997.9.8.1735. PMID 9377276.
- ↑ Sutskever, Ilya; Vinyals, Oriol; Le, Quoc Viet (14 December 2014). "Sequence to sequence learning with neural networks". arXiv.
- ↑ Cho, Kyunghyun; van Merriënboer, Bart; Gulcehre, Caglar; Bahdanau, Dzmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (October 2014). "Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation". Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). pp. 1724–1734. doi:10.3115/v1/D14-1179.
- ↑ Bahdanau, Dzmitry; Cho, Kyunghyun; Bengio, Yoshua (1 September 2014). "Neural Machine Translation by Jointly Learning to Align and Translate". arXiv.
- ↑ Luong, Minh-Thang; Pham, Hieu; Manning, Christopher D. (17 August 2015). "Effective Approaches to Attention-based Neural Machine Translation". arXiv.
- ↑ Wu, Yonghui; Schuster, Mike; Chen, Zhifeng; Le, Quoc V.; Norouzi, Mohammad; Macherey, Wolfgang; Krikun, Maxim; Cao, Yuan; Gao, Qin (2016-09-01). "Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation". arXiv.
- ↑ Lewis-Kraus, Gideon (2016-12-14). "The Great A.I. Awakening". The New York Times. Archived from the original on 24 May 2023. Retrieved 2023-06-22.
- ↑ Parikh, Ankur P.; Täckström, Oscar; Das, Dipanjan; Uszkoreit, Jakob (25 September 2016). "A Decomposable Attention Model for Natural Language Inference". arXiv.
- ↑ Levy, Steven. "8 Google Employees Invented Modern AI. Here's the Inside Story". Wired. Archived from the original on 20 March 2024. Retrieved 2024-08-06.
- ↑ Vaswani, Ashish; Shazeer, Noam; Parmar, Niki; Uszkoreit, Jakob; Jones, Llion; Gomez, Aidan N; Kaiser, Łukasz; Polosukhin, Illia (2017). "Attention is All you Need" (PDF). Advances in Neural Information Processing Systems. 30.
- ↑ Marche, Stephen (2024-08-23). "Was Linguistic A.I. Created by Accident?". The New Yorker. Retrieved 2024-08-27.
- ↑ Kaiser, Lukasz; Gomez, Aidan N.; Shazeer, Noam; Vaswani, Ashish; Parmar, Niki; Jones, Llion; Uszkoreit, Jakob (2017-06-16). "One Model To Learn Them All". arXiv.
- ↑ Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. Archived from the original (PDF) on 26 January 2021. Retrieved 23 January 2021.
- ↑ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (11 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv.
- ↑ "Google: BERT now used on almost every English query". Search Engine Land. 2020-10-15. Retrieved 2020-11-24.
- ↑ Radford, Alec; Wu, Jeffrey; Child, Rewon; Luan, David; Amodei, Dario; Sutskever, Ilya (14 February 2019). "Language Models are Unsupervised Multitask Learners" (PDF). OpenAI. Retrieved 2023-03-18.
- ↑ "Better Language Models and Their Implications". OpenAI. 2019-02-14. Archived from the original on 2020-12-19. Retrieved 2019-08-25.
- ↑ Liu, Yinhan; Ott, Myle; Goyal, Naman; Du, Jingfei; Joshi, Mandar; Chen, Danqi; Levy, Omer; Lewis, Mike; Zettlemoyer, Luke; Stoyanov, Veselin (26 July 2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". arXiv.
- ↑ Shoeybi, Mohammad; Patwary, Mostofa; Puri, Raul; LeGresley, Patrick; Casper, Jared; Catanzaro, Bryan (2019-09-17). "Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism". arXiv.
- ↑ Raffel, Colin; Shazeer, Noam; Roberts, Adam; Lee, Katherine; Narang, Sharan; Matena, Michael; Zhou, Yanqi; Li, Wei; Liu, Peter J. (2020). "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer". Journal of Machine Learning Research. 21 (140): 1–67.
- ↑ Caswell, Isaac; Liang, Bowen (8 June 2020). "Recent Advances in Google Translate". Google Research. Archived from the original on 4 July 2024. Retrieved 2024-08-07.
- ↑ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (28 May 2020). "Language Models are Few-Shot Learners". arXiv.
- ↑ Dosovitskiy, Alexey; Beyer, Lucas; Kolesnikov, Alexander; Weissenborn, Dirk; Zhai, Xiaohua; Unterthiner, Thomas; Dehghani, Mostafa; Minderer, Matthias; Heigold, Georg; Gelly, Sylvain; Uszkoreit, Jakob; Houlsby, Neil (2020-10-22). "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale". arXiv.
- ↑ Beltagy, Iz; Peters, Matthew E.; Cohan, Arman (2020-04-10). "Longformer: The Long-Document Transformer". arXiv.
- ↑ Kitaev, Nikita; Kaiser, Łukasz; Levskaya, Anselm (2020). "Reformer: The Efficient Transformer". arXiv.
- ↑ Child, Rewon; Gray, Scott; Radford, Alec; Sutskever, Ilya (2019-04-23). "Generating Long Sequences with Sparse Transformers". arXiv.
- ↑ Ramesh, Aditya; Pavlov, Mikhail; Goh, Gabriel; Gray, Scott; Voss, Chelsea; Radford, Alec; Chen, Mark; Sutskever, Ilya (2021-02-26). "Zero-Shot Text-to-Image Generation". arXiv.
- ↑ Su, Jianlin; Lu, Yu; Pan, Shengfeng; Murtadha, Ahmed; Wen, Bo; Liu, Yunfeng (2021-04-01). "RoFormer: Enhanced Transformer with Rotary Position Embedding". arXiv.
- ↑ Chen, Lili; Lu, Kevin; Rajeswaran, Aravind; Lee, Kimin; Grover, Aditya; Laskin, Michael; Abbeel, Pieter; Srinivas, Aravind; Mordatch, Igor (2021-06-24). "Decision Transformer: Reinforcement Learning via Sequence Modeling". arXiv.
- ↑ Jumper, John; Evans, Richard; Pritzel, Alexander; Green, Tim; Figurnov, Michael; Ronneberger, Olaf; Tunyasuvunakool, Kathryn; Bates, Russ; Žídek, Augustin; Potapenko, Anna; Bridgland, Alex; Meyer, Clemens; Kohl, Simon A. A.; Ballard, Andrew J.; Cowie, Andrew; Romera-Paredes, Bernardino; Nikolov, Stanislav; Jain, Rishub; Adler, Jonas; Back, Trevor; Petersen, Stig; Reiman, David; Clancy, Ellen; Zielinski, Michal; Steinegger, Martin; Pacholska, Michalina; Berghammer, Tamas; Bodenstein, Sebastian; Silver, David; Vinyals, Oriol; Senior, Andrew W.; Kavukcuoglu, Koray; Kohli, Pushmeet; Hassabis, Demis (2021). "Highly accurate protein structure prediction with AlphaFold". Nature. 596: 583–589. doi:10.1038/s41586-021-03819-2. PMC 8371605. PMID 34265844.
{{cite journal}}: CS1 maint: PMC format (link) - ↑ Smith, Shaden; Patwary, Mostofa; Norick, Brandon; LeGresley, Patrick; Rajbhandari, Samyam; Casper, Jared; Liu, Zhun; Prabhumoye, Shrimai; Zeng, George; Shaikh, Mostofa; Ryabinin, Mikhail; Ruwase, Olatunji; Smith, Logan; Serban, Liviu; Mostofa, Mohammad; Song, Samyam; Shoeybi, Mohammad; Catanzaro, Bryan (2022-01-28). "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model". arXiv.
- ↑ Ramesh, Aditya; Dhariwal, Prafulla; Nichol, Alex; Chu, Casey; Chen, Mark (2022-04-13). "Hierarchical Text-Conditional Image Generation with CLIP Latents". arXiv.
- ↑ Fedus, William; Zoph, Barret; Shazeer, Noam (2022-06-16). "Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity". arXiv.
- ↑ Dao, Tri; Fu, Dan; Ermon, Stefano; Rudra, Atri; Ré, Christopher (2022). "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness". Advances in Neural Information Processing Systems 35. pp. 16344–16359. doi:10.52202/068431-1189. ISBN 978-1-7138-7108-8.
- ↑ "FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning". Princeton NLP. 2023-06-17. Retrieved 2023-07-18.
- ↑ "Introducing ChatGPT". OpenAI. 2022-11-30. Retrieved 2026-05-16.
- ↑ "The inside story of how ChatGPT was built from the people who made it". MIT Technology Review. Retrieved 2024-08-06.
- ↑ Radford, Alec; Kim, Jong Wook; Xu, Tao; Brockman, Greg; McLeavey, Christine; Sutskever, Ilya (2022). "Robust Speech Recognition via Large-Scale Weak Supervision". arXiv.
- ↑ Touvron, Hugo; Lavril, Thibaut; Izacard, Gautier; Martinet, Xavier; Lachaux, Marie-Anne; Lacroix, Timothée; Rozière, Baptiste; Goyal, Naman; Hambro, Eric; Azhar, Faisal; Rodriguez, Aurelien; Joulin, Armand; Grave, Edouard; Lample, Guillaume (2023-02-27). "LLaMA: Open and Efficient Foundation Language Models". arXiv.
- ↑ OpenAI (2023-03-15). "GPT-4 Technical Report". arXiv.
- ↑ "GPT-4 Technical Report". OpenAI. 2023-03-14. Retrieved 2026-05-27.
- ↑ Touvron, Hugo; Martin, Louis; Stone, Kevin; Albert, Peter; Almahairi, Amjad; Babaei, Yasmine; Bashlykov, Nikolay; Batra, Soumya; Bhargava, Prajjwal; Bhosale, Shruti; Bikel, Dan; Blecher, Lukas; Ferrer, Cristian Canton; Chen, Moya; Cucurull, Guillem; Esiobu, David; Fernandes, Jude; Fu, Jeremy; Fu, Wenyin; Fuller, Brian (2023-07-19). "Llama 2: Open Foundation and Fine-Tuned Chat Models". arXiv.
- ↑ Jiang, Albert Q.; Sablayrolles, Alexandre; Mensch, Arthur; Bamford, Chris; Chaplot, Devendra Singh; de las Casas, Diego; Bressand, Florian; Lengyel, Gianna; Lample, Guillaume; Saulnier, Lucile; Renard Lavaud, Lélio; Lachaux, Marie-Anne; Stock, Pierre; Le Scao, Teven; Lacroix, Timothée; Louf, Romain; Rozière, Baptiste; Broseit, Naman; Roberts, Adam; Sachan, Manu (2023-10-10). "Mistral 7B". arXiv.
- ↑ "GPT-4V(ision) System Card". OpenAI. 2023-09-25. Retrieved 2026-05-28.
- ↑ Gemini Team; Google (2023-12-19). "Gemini: A Family of Highly Capable Multimodal Models". arXiv.
{{cite web}}:|last2=has generic name (help) - ↑ "Claude 3 Model Card" (PDF). Anthropic. 2024-03-07. Retrieved 2026-05-28.
- ↑ Meta AI (2024-07-31). "The Llama 3 Herd of Models". arXiv.
- ↑ "Hello GPT-4o". OpenAI. 2024-05-13. Retrieved 2026-05-28.
- ↑ "Learning to Reason with LLMs". OpenAI. 2024-09-12. Retrieved 2026-05-28.
- ↑ DeepSeek-AI (2024-12-27). "DeepSeek-V3 Technical Report". arXiv.
- ↑ DeepSeek-AI (2025-01-22). "DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning". arXiv.