Timeline of large language models
This is a timeline of large language models, which consist in artificial intelligence (AI) systems that use deep learning techniques to process and generate human-like natural language. LLMs are pre-trained on large amounts of data to learn the complexity and linkages of language, and can be adapted for specific tasks using techniques like fine-tuning, in-context learning, and zero-/one-/few-shot learning.[1]
Sample questions
The following are some interesting questions that can be answered by reading this timeline:
- What are some early developments representing significant milestones in the evolution of large language models?
- Sort the full timeline by "Event type" and look for the group of rows with value "Early development".
- You will see a number of milestones, such as the launch of the first chatbot, as well as the introduction of long short-term memory networks and transformer models.
- What are some notable large language models being introduced over the years?
- Sort the full timeline by "Event type" and look for the group of rows with value "LLM launch".
- You will see the top LLMs and also their size in parameters.
- What are some notable or sample cases describing research in the development of LLMs?
- Sort the full timeline by "Event type" and look for the group of rows with value "Programming/training".
- You will see some research cases describing programming, which concerns the design of the architecture of the model and implementation of the algorithms, as well as well as training, which refers to the process of teaching the large language model using data.
- What are some cases of application of LLMs illustrated in the timeline?
- Sort the full timeline by "Event type" and look for the group of rows with value "Application".
- You will see a variety of applications, such as automatic analysis, psycholinguistics, nuclear medicine, and human-robot interaction.
- What are some events describing the actual or potential impact of LLMs in society?
- Sort the full timeline by "Event type" and look for the group of rows with value "Impact".
- You will see a variety of considered cases, such as adversarial Influence, difficulty in Ddistinguishing human-written text, impact on the labor market, economic impact, concerns about AI-generated content, as well as situational awareness and deceptive behavior of LLMs.
- Other events are described under the following types: "Early development", "Efficiency", and "Framework launch".
Big picture
| Time period | Development summary | More details |
|---|---|---|
| 1950–1960s | Early developments | The groundwork for natural language processing (NLP) is laid during these years with initial attempts at language translation by IBM and Georgetown University. The pivotal moment comes in 1966 when MIT researcher Joseph Weizenbaum creates ELIZA, the first chatbot. Although rudimentary, ELIZA uses pattern recognition and predefined rules to simulate human conversation, marking the beginning of NLP research.[2][3][4] |
| 1970s-2000s | Incremental progress | These decades see incremental progress. Researchers experiment with conceptual ontologies and rule-based systems in NLP. In the 1990s, the emergence of deep learning, a form of machine learning employing neural networks for data processing, enables the development of increasingly sophisticated language models. The introduction of Long Short-Term Memory (LSTM) networks in 1997 enables the development of deeper neural networks capable of handling larger datasets. Additionally, tools like Stanford’s CoreNLP suite, introduced in 2010, provides algorithms for complex NLP tasks such as sentiment analysis and named entity recognition. Google Brain’s launch in 2011, offering advanced resources and features like word embeddings, further propells the field.[4] |
| 2010s onwards | Rise of large language models | In the 2010s, the landscape of language processing transforms dramatically. The introduction of Transformer models in 2017 revolutionizes NLP. This architecture allows for the creation of Large Language Models (LLMs) capable of understanding context and generating human-like text. From 2019 onwards, the rise of Large Language Models gains momentum with the introduction of models like GPT-2, GPT-3, and T5. These models can perform diverse tasks, driving a paradigm shift in AI capabilities. They become emblematic, serving as foundations for various applications, including ChatGPT.[5] Recent years also witness the emergence of user-friendly frameworks, such as Hugging Face and BARD, empowering researchers and developers to create their own LLMs seamlessly.[6][2] |
Full timeline
| Year | Model name (when applicable) | Size (in parameters) | Pre-train data scale | Event type | Details |
|---|---|---|---|---|---|
| 1954 | Early development | Researchers at IBM and Georgetown University develop a system for automatic translation of phrases from Russian to English. This early effort lays the foundation for natural language processing and marks the beginning of research and experimentation in the field of large language models. Subsequent decades would see various approaches, including conceptual ontologies and rule-based systems, as researchers endeavor to advance the processing of natural language, although these initial attempts do not produce significant breakthroughs at the time.[4] | |||
| 1966 | ELIZA | Early development | Joseph Weizenbaum at MIT develops ELIZA, one of the earliest examples of a language model. ELIZA uses a simple set of rules to mimic human conversation, responding to user input in a natural and conversational manner. This development marks a significant milestone in the history of large language models, demonstrating the early capabilities of AI in language processing.[4] | ||
| 1986 | Early development | Recurrent Neural Networks (RNNs) emerge, allowing models to capture dependencies in natural language processing tasks, but facing challenges with long-term memory retention.[5][7] | |||
| 1997 | Early development | Long Short-Term Memory (LSTM) networks are introduced, enabling the creation of deeper and more complex neural networks capable of handling substantial amounts of data. This innovation marks a pivotal moment in the advancement of natural language processing (NLP) technology, providing a foundation for the evolution of more sophisticated LLMs in the subsequent years.[2][5] | |||
| 2010 | Early development | Stanford's CoreNLP suite is introduced, providing researchers with a powerful set of tools and algorithms. This suite enables the tackling of complex natural language processing tasks such as sentiment analysis and named entity recognition. This advancement marks a crucial moment in the evolution of NLP technology, enhancing researchers' capabilities to handle intricate linguistic tasks and contributing to the subsequent progress of more sophisticated LLMs.[4] | |||
| 2014 | Early development | The attention mechanism is introduced, enabling models to focus dynamically on different parts of input sequences, addressing issues related to sentence length and improving translation accuracy.[5][8] | |||
| 2017 | Early development | Transformer models are introduced. This innovative architecture, enabled by Google Brain's pioneering work, would revolutionize natural language processing. Transformers allow for the creation of larger and more sophisticated LLMs, including OpenAI’s GPT-3 (Generative Pre-Trained Transformer). These models would become foundational, serving as the basis for applications like ChatGPT and numerous other AI-driven innovations. The introduction of Transformers ushers in a new era of highly capable and versatile language processing systems.[2][5] | |||
| 2018 (October 11) | BERT | 340,000,000[9] | 3,300,000,000 words[10] | LLM launch | Google researchers unveil BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language model. BERT's bidirectional design enables it to consider both input and output context, enhancing its understanding of language nuances. Employing a consistent-width neural network, BERT adapts to diverse tasks. Pre-trained on extensive unstructured data, it comprehensively grasps word relationships. BERT's simplicity and effectiveness makes it accessible to researchers and practitioners, allowing fine-tuning for various tasks with minimal adjustments. Upon its release, BERT sets unprecedented records in NLP benchmark tests, swiftly becoming the industry standard. Within 18 months, it would power the majority of English queries processed by Google Search.[3][11][5] |
| 2019 (May 29) | GROVER | LLM launch | A team of researchers from the University of Washington and Allen Institute for AI Research introduce GROVER, a language model similar to GPT-2. However, they do not make the larger versions of the model publicly available.[12] Their publication discusses the potential risks of natural language generation technology and the need for robust defenses against neural fake news. Grover can generate realistic news articles that are difficult to distinguish from real news. They also explore the effectiveness of current methods for detecting fake news and find that the best defense against Grover is itself, with 92% accuracy. The article concludes by discussing the ethical issues related to the technology and the importance of public release of strong generators to facilitate better detection of neural fake news.[13] | ||
| 2019 (June 19 | XLNet | ~340,000,000[14] | 130,000,000,000 bytes[15] | LLM launch | XLNet is introduced as a generalized autoregressive pretraining method for language understanding. Unlike BERT, which relies on masking input tokens, XLNet considers all permutations of the factorization order to model bidirectional contexts. This approach overcomes the limitations of BERT and improves pretrain-finetune consistency. XLNet incorporates ideas from Transformer-XL, an autoregressive model, into its pretraining process. In empirical evaluations across 20 tasks, XLNet outperforms BERT by a significant margin, including question answering, natural language inference, sentiment analysis, and document ranking.[16] |
| 2019 (July 26 | RoBERTa | 123,000,000–354,000,000[17] | 160,000,000,000 bytes[18] | LLM launch | Researchers introduce "RoBERTa: A Robustly Optimized BERT Pretraining Approach," after conducting a replication study of BERT pretraining (Devlin et al., 2019) to evaluate the impact of key hyperparameters and training data size on performance. They find that BERT was undertrained and demonstrate that it can achieve or surpass the performance of subsequent models. The authors achieve state-of-the-art results on GLUE, RACE, and SQuAD benchmarks, highlighting the significance of overlooked design choices and questioning the origins of recently reported improvements.[19] |
| 2019 (August | Megatron-LM | 8,300,000,000 | 174,000,000,000 bytes[20] | LLM launch | NVIDIA introduces Megatron-LM[21], which boasts 8.3 billion parameters and is trained with data parallelism on a remarkable 512 GPUs. The training process took a mere 53 minutes, showcasing its computational efficiency. Megatron-LM's training data is sourced from diverse places, including Wikipedia, OpenWebText, RealNews, and CC-Stories, with a combined dataset size of 174 gigabytes. This model represents a significant milestone in the development of large-scale language models, highlighting the capabilities of modern hardware and data processing in the field of natural language processing.[22][23][24] |
| 2019 (September 11 | CTRL | 1,630,000,000 | LLM launch | CTRL is introduced as a conditional transformer language model that aims to enhance control over text generation. It is designed to condition on control codes, allowing users to govern style, content, and task-specific behavior. These control codes are derived from the structure that naturally co-occurs with raw text, providing explicit control while leveraging the advantages of unsupervised learning. CTRL is capable of predicting the likelihood of different parts of the training data given a sequence, enabling potential analysis of large datasets through model-based source attribution.[25] | |
| 2019 (September 26 | ALBERT | 12,000,000[17] | LLM launch | ALBERT is introduced as a lightweight version of BERT that focuses on self-supervised learning of language representations. The authors address the limitations of increasing model size by proposing two parameter-reduction techniques, which reduce memory consumption and training time. Empirical evidence demonstrates that their methods significantly improve the scalability of models compared to the original BERT. Additionally, they employ a self-supervised loss that prioritizes modeling inter-sentence coherence, consistently enhancing performance on tasks with multi-sentence inputs. The best ALBERT model achieves new state-of-the-art results on benchmarks such as GLUE, RACE, and SQuAD while having fewer parameters than BERT-large.[26] | |
| 2019 (October 2 | DistilBERT | 66,000,000[27] | LLM launch | DistilBERT is introduced as a smaller, faster, and cheaper version of BERT, designed for efficient on-device computations. It retains 97% of BERT's language understanding capabilities while reducing its size by 40%. By using knowledge distillation during pre-training and a triple loss function, it captures important linguistic features. DistilBERT proves its capabilities through proof-of-concept experiments and on-device studies.[28] | |
| 2019 (November 1 | DialoGPT | 1,500,000,000[29] | LLM launch | DialoGPT is introduced as a large, adaptable neural model for generating conversational responses. It is trained on 147 million conversation-like exchanges from Reddit comment chains spanning 2005 to 2017. DialoGPT, an extension of the Hugging Face PyTorch transformer, achieves performance close to human-level evaluation in single-turn dialogues. It outperforms strong baseline systems by generating more relevant, meaningful, and contextually consistent responses. The pre-trained model and training pipeline are publicly available, encouraging research in neural response generation and the advancement of intelligent open-domain dialogue systems.[30] | |
| 2019 (November 10 | CamemBERT | 110,000,000[31][32] | 138,000,000,000 bytes[32] | LLM launch | A paper introduces CamemBERT, a monolingual Transformer-based language model trained specifically for French. It addresses the limited practical use of pretrained models in languages other than English. The authors evaluate CamemBERT on various tasks including part-of-speech tagging, dependency parsing, named entity recognition, and natural language inference. They find that using web crawled data is preferable to Wikipedia data. Surprisingly, even with a relatively small web crawled dataset of 4GB, CamemBERT achieves results on par with or better than models trained on larger datasets of over 130GB. In fact, CamemBERT outperforms the state-of-the-art models in all four downstream tasks.[33] |
| 2019 (December 11 | FlauBERT | 138,000,000 – 373,000,000[34][32] | 71,000,000,000 bytes[32] | LLM launch | FlauBERT is introduced as an unsupervised language model for French. Developed by Hang Le et al., it leverages unlabeled texts to pre-train word representations, demonstrating superior performance in various NLP tasks. Trained on a large and diverse French corpus, FlauBERT outperforms other pre-training approaches. The authors share different FlauBERT versions and a unified evaluation protocol, FLUE, for reproducible French NLP experiments.[35] |
| 2020 (January 13 | ProphetNet | 16,000,000,000–160,000,000,000 bytes | LLM launch | A paper introduces ProphetNet, a sequence-to-sequence pre-training model using a novel future n-gram prediction objective and n-stream self-attention. Instead of predicting only the next token, ProphetNet jointly predicts multiple future tokens at each step, encouraging planning over longer horizons and reducing overfitting to local context. Pre-trained on 16GB and 160GB datasets, ProphetNet achieves state-of-the-art results on benchmarks such as CNN/DailyMail, Gigaword, and SQuAD 1.1 for summarization and question generation.[36] | |
| 2020 (February 24 | T5 | 11,000,000,000[37] | 1,000,000,000,000 tokens[38] | LLM launch | T5 is introduced as a Text-To-Text Transfer Transformer model. It is a flexible and powerful model that achieves optimal results in natural language processing tasks. It uses a unified text-to-text framework, allowing for easy adaptation to various NLP tasks. T5 is trained on a large-scale pre-training dataset called C4, which improves its performance. The authors conduct a systematic study of transfer learning methodologies and combine the best approaches to achieve remarkable results on multiple benchmarks. T5 is also applied to closed-book question answering and fill-in-the-blank text generation tasks with impressive performance.[39] |
| 2020 (March 10 | Programming/training | Google researchers introduce ELECTRA, an efficient pre-training method for NLP models designed to match BERT-level performance with significantly less computation. ELECTRA replaces masked language modeling with a replaced token detection task, where a discriminator learns to identify whether each token is real or generated. A separate generator network produces replacement tokens during training, inspired by GANs. After pre-training, only the discriminator is fine-tuned. ELECTRA achieves strong results on benchmarks like GLUE and SQuAD and is released as open-source with multiple model sizes.[40] | |||
| 2020 (April | Megatron-11B | 11,000,000,000 | 161,000,000,000 bytes[20] | LLM launch | Facebook AI Research (FAIR) introduces Megatron-11B, a unidirectional language model with 11 billion parameters, which is built upon the Megatron-LM architecture. FAIR trained this model using intra-layer model parallelism, splitting each layer's parameters across 8 GPUs. Megatron-11B is trained on a dataset consisting of English Wikipedia (12GB), BookCorpus (4GB), CC-News (76GB), OpenWebText/Reddit upvoted (38GB), and Stories (31GB), with a total dataset size of 161GB. This model is part of the RoBERTa family and contributes to the advancements in large-scale language models for natural language processing tasks.[22] |
| 2020 (May | GPT-3 | 175,000,000,000[41] | 45,000,000,000,000 bytes[42] | LLM launch | OpenAI introduces GPT-3, the largest neural network with 175 billion parameters, surpassing previous models significantly. Trained on extensive internet data, GPT-3 demonstrates exceptional performance in various natural language processing tasks like translation and question-answering, outperforming existing models. The research showcases its remarkable few-shot learning ability, making it a groundbreaking advancement in the field of artificial intelligence.[43][44] |
| 2020 (May 28 | Programming/training | A paper discusses the use of language models in few-shot learning, where a model is trained on a large corpus of text and then fine-tuned for a specific task. The authors demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance. They trained GPT-3, a language model with 175 billion parameters, and tested its performance in the few-shot setting. GPT-3 achieved strong performance on many NLP tasks, including translation, question-answering, and cloze tasks, as well as tasks that require on-the-fly reasoning or domain adaptation. However, the authors also identify some datasets where GPT-3's few-shot learning struggles, as well as methodological issues related to training on large web corpora. The paper also discusses the broader societal impacts of this finding and of GPT-3 in general.[45] | |||
| 2020 (June 5 | DeBERTa | 1,500,000,000 (larger model)[46] | LLM launch | A paper presents DeBERTa, a model that enhances BERT and RoBERTa LLMs by introducing disentangled attention and an enhanced mask decoder. These techniques improve model pre-training efficiency and performance on various NLP tasks. A DeBERTa model trained on half the data outperforms RoBERTa-Large on tasks like MNLI, SQuAD v2.0, and RACE. A larger DeBERTa model with 1.5 billion parameters surpasses human performance on the SuperGLUE benchmark, and an ensemble DeBERTa model leads the SuperGLUE leaderboard with a significant margin over the human baseline.[47] | |
| 2020 (June 30 | GShard | 600,000,000,000[38] | 1,000,000,000,000 tokens[38] | LLM launch | A paper introduces GShard, a module designed to address challenges in scaling neural networks for machine learning applications. By combining lightweight annotation APIs and an extension to the XLA compiler, GShard enables efficient parallel computation patterns with minimal code changes. The researchers utilize GShard to scale a multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts to over 600 billion parameters using automatic sharding. This model is trained on 2048 TPU v3 accelerators in just 4 days, achieving significantly improved translation quality from 100 languages to English compared to previous methods.[48] |
| 2020 (July | Efficiency | A paper discusses the limitations of neural text generation models in open-ended tasks like language modeling and story generation, due to the standard likelihood training and approximate decoding objectives. The authors specifically analyze these limitations for abstractive document summarization and find that such models tend to hallucinate content that is unfaithful to the input document. The paper presents the results of a human evaluation of several neural abstractive summarization systems, highlighting the substantial amount of hallucinated content in all model-generated summaries. However, the authors also show that pretrained models perform better in terms of generating faithful and factual summaries, as evaluated by humans. They propose that textual entailment measures may be a better evaluation metric for faithfulness than standard metrics, leading to better training and decoding criteria.[49] | |||
| 2020 (October 23 | mT5 | 13,000,000,000[38] | 1,000,000,000,000 tokens[38] | LLM launch | mT5 is introduced as a multilingual variant of the Text-to-Text Transfer Transformer (T5). Leveraging a unified text-to-text format and pre-trained on a dataset covering 101 languages, mT5 achieves state-of-the-art results on multilingual benchmarks. The authors detail the design, modified training, and introduce a technique to prevent "accidental translation" errors in zero-shot settings.[50] |
| 2021 (January 11 | Wu Dao | 1,750,000,000,000 | 4,900,000,000,000 bytes[51] | LLM launch | Wu Dao is released. It's among the top large language models by parameter size.[6] Developed by researchers from the Beijing Academy of Artificial Intelligence, is a groundbreaking generative deep learning model with 1.75 trillion parameters, making it ten times larger than OpenAI's GPT-3. The model utilizes an open-source learning system called FastMoE, similar to Google's Mixture of Experts, enabling rapid training on both supercomputers and conventional GPUs. Unlike traditional deep learning models, Wu Dao is multi-modal, capable of natural language processing, text and image generation, and recognition tasks. It can write essays, poems, generate alt text from images, create realistic images from descriptions, power virtual idols, and predict protein structures.[52] |
| 2021 (March 22 | GPT-Neo | 2,700,000,000[53] | LLM launch | GPT-Neo is introduced as an open-source alternative to GPT-3, developed by EleutherAI. It offers accessible language generation capabilities and is released under the MIT license. While GPT-Neo's performance is not as strong as GPT-3's largest model, it outperforms comparable GPT-3 models on NLP reasoning benchmarks. GPT-Neo provides a promising option, especially considering OpenAI's restricted access policy.[54] | |
| 2021 (April 26 | PanGu-α | 13,000,000,000[38]–200,000,000,000 | 1,100,000,000,000 bytes[38] | LLM launch | Researchers introduce PanGu-α, a large-scale autoregressive pretrained Chinese language model with up to 200 billion parameters. Developed using MindSpore and trained on a cluster of 2048 Ascend 910 AI processors, PanGu-α utilizes advanced training parallelism strategies, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism, and rematerialization. To enhance its capabilities, the model is pretrained on 1.1TB of high-quality Chinese data from diverse domains. Empirical tests showcase PanGu-α's excellence in tasks such as text summarization, question answering, and dialogue generation, demonstrating superior performance in few-shot or zero-shot scenarios across various Chinese NLP tasks.[55] |
| 2021 (May | LaMDA | 173,000,000,000[38] | 768,000,000,000 tokens[38] | LLM launch | Google anounces LaMDA (Language Model for Dialogue Applications). Unlike other language models, LaMDA is specifically trained on dialogue to enable more natural and engaging conversations with users. It has the ability to understand and respond to the subtleties of open-ended discussions. LaMDA has various potential applications, including customer service, chatbots, and personal assistants. It is built upon Google's previous chatbot model called Meena.[56] Its pretraining dataset consists of 2.97 billion documents, 1.12 billion dialogs, and 13.39 billion dialog utterances, for a total of 1.56 trillion words.[57] |
| 2021 (June 20 | CPM-2 | 198,000,000,000[38] | 2,600,000,000,000 bytes[38] | LLM launch | Researchers introduce two models: an encoder-decoder bilingual model with 11 billion parameters (CPM-2) and its corresponding MoE version with 198 billion parameters. In their tests, they evaluated CPM-2 and mT5 in practical tasks. The results indicate that CPM-2 possesses impressive overall language capabilities. Additionally, they verify InfMoE's effectiveness in performing inferences with large-scale models containing tens of billions of parameters on a single GPU.[58] |
| 2021 (July 5 | ERNIE 3.0 | 10,000,000,000[38] | 375,000,000,000 tokens[38] | LLM launch | ERNIE 3.0 is introduced as a pre-training framework for large-scale language models in Natural Language Processing (NLP). Unlike previous models like T5 and GPT-3, ERNIE 3.0 incorporates both linguistic and world knowledge into its training, addressing the limitation of traditional models trained solely on plain texts. It combines auto-regressive and auto-encoding networks, enabling the model to handle natural language understanding and generation tasks effectively. Trained with 10 billion parameters on a 4TB corpus containing texts and a vast knowledge graph, ERNIE 3.0 outperforms existing models in 54 Chinese NLP tasks. Its English version also excels, leading the SuperGLUE benchmark and surpassing human performance by +0.8% (90.6% vs. 89.8%).[59] |
| 2021 (July 7 | Codex | 12,000,000,000 | 100,000,000,000 tokens | LLM launch | A paper introduces Codex, a GPT language model fine-tuned on publicly available GitHub code, also powering GitHub Copilot. Evaluations on a new set called HumanEval reveal Codex solves 28.8% of problems involving synthesizing programs from docstrings, significantly outperforming GPT-3 (0%) and GPT-J (11.4%). Codex demonstrates effectiveness in generating solutions by repeatedly sampling from the model, achieving 70.2% accuracy with 100 samples per problem. Limitations include challenges with complex docstrings and binding operations to variables. The study discusses broader impacts of deploying advanced code generation technologies, addressing concerns related to safety, security, and economics.[60] |
| 2021 (September | HyperCLOVA | 82,000,000,000[38]–204,000,000,000[61] | 300,000,000,000[38]–560,000,000,000[62] tokens | LLM launch | HyperCLOVA is introduced as a large-scale Korean contextual learning model.[62] HyperCLOVA's extensive parameters enhance its ability to distinguish speech nuances and dialects. It learned from 6,500 times more Korean data than GPT-3, predominantly focusing on the Korean language (97%). HyperCLOVA's applications include human conversation processing, translation, summarization, and machine reading, offering diverse AI possibilities and fostering new service and business opportunities.[61] |
| 2021 (October 10 | Yuan 1.0 | 245,000,000,000[38] | 180,000,000,000 tokens[38] | LLM launch | Yuan 1.0 is introduced as a significant advancement in large-scale pre-trained language models for zero-shot and few-shot learning, addressing challenges faced by models like GPT-3 due to enormous computational demands. By integrating distributed training performance into model architecture, Yuan 1.0, boasting 245B parameters, achieves remarkable results across NLP tasks on thousands of GPUs. The approach includes efficient data processing to filter extensive raw data, resulting in a high-quality Chinese corpus of 5TB texts. Calibration and label expansion methods enhance zero-shot and few-shot performance, ensuring accurate task execution. Yuan 1.0 excels in natural language generation, producing articles nearly indistinguishable from human-written ones.[63] |
| 2021 (October 11 | MT-NLG | 530,000,000,000[56][38] | >825,000,000,000 bytes[20], 270,000,000,000 tokens[38] | LLM launch | MT-NLG (Megatron-Turing Natural Language Generation) is introduced as a language model developed jointly by Nvidia and Microsoft. It utilizes the architecture of the Megatron transformer-based model and has a record-breaking size of 530 billion parameters. MT-NLG is designed to generate coherent and contextually relevant text for various natural language processing tasks such as completion prediction, reading comprehension, commonsense reasoning, and word sense disambiguation. Training such large-scale models is challenging due to memory constraints and long training times, but innovations in hardware, software, and training methods have made it feasible. MT-NLG achieves state-of-the-art results in zero-shot, one-shot, and few-shot settings across multiple NLP tasks.[64] |
| 2021 (December 8 | Gopher | 280,000,000,000[38] | 300,000,000,000 tokens[38] | Gopher is introduced as a 280 billion parameter Transformer-based language model, developed by Google subsidiary DeepMind. Trained on a 10.5TB corpus called MassiveText, Gopher outperforms its contemporary state-of-the-art on 100 of 124 evaluation tasks. The model is trained alongside smaller models to explore the strengths and weaknesses of large language models (LLMs). It excells in tasks like reading comprehension and fact-checking but shows reduced benefits in logical reasoning, common sense, and mathematics tasks. The DeepMind team utilizes a custom training dataset, MassiveText, to ensure high-quality data without contaminating the training dataset with test datasets available online. Gopher is part of DeepMind's language research efforts at the time.[65][66][38] | |
| 2021 (December 13 | GLaM | 1,200,000,000,000[38] | 280,000,000,000 tokens[38] | LLM launch | GLaM (Generalist Language Model) is introduced as a family of language models. These models utilize a sparsely activated mixture-of-experts architecture to increase model capacity while significantly reducing training costs compared to dense variants. The largest GLaM model has 1.2 trillion parameters, making it approximately 7 times larger than GPT-3. Despite its size, this model consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference. Additionally, GLaM demonstrates better overall zero-shot and one-shot performance across 29 natural language processing tasks.[67] |
| 2021 (December 16 | WebGPT | LLM launch | OpenAI introduces their WebGPT project, which enhances GPT-3's factual accuracy by incorporating a text-based web browser into its functionality. The model imitates human online research by issuing search queries, following links, and citing sources to answer open-ended questions. Trained to address the tendency of language models to generate incorrect information, WebGPT allows commands like "Search..." and "Find in page:..." to gather information from web pages. The model undergoes fine-tuning through methods involving human demonstrations and training a reward model, aiming to create more accurate and truthful AI responses.[68] | ||
| 2021 (December | Fairseq | 13,000,000,000 – 1,000,000,000,000 | 453,000,000,000 bytes[20] | LLM launch | Meta AI, previously known as FAIR (Facebook AI Research), announces the introduction of Fairseq, a language model with parameters of 13B and 1.1T. Fairseq is not related to Megatron, and the two use different technologies for training. Fairseq's dataset sources include the same ones used for RoBERTa (English Wikipedia, BookCorpus, CC-News, OpenWebText/Reddit upvoted, and Stories) with the new addition of English CC100 in Wikipedia style from Jan/2018-Dec/2018, resulting in a total dataset size of 453GB. Fairseq was trained using 2,363 GPU-days with 1,024 GPUs, taking approximately three days.[22][69] |
| 2022 (January 19 | CM3 | LLM launch | A paper presents CM3, a family of generative models trained on large-scale web and Wikipedia data containing both text and image tokens. CM3 uses a hybrid causal–masked objective, generating tokens left to right while later filling masked long spans with bidirectional context. This enables full generative modeling alongside contextual understanding. CM3 produces structured multimodal outputs and implicitly learns cross-modal tasks, achieving state-of-the-art results in zero-shot summarization, entity linking, and entity disambiguation.[70] | ||
| 2022 (January 27 | InstructGPT | 175,000,000,000[38]–1,300,000,000 | LLM launch | OpenAI announces having deployed InstructGPT, a new language model that is safer, more helpful, and more aligned with users. The model was trained using a reinforcement learning technique from human feedback and is significantly better at following instructions than the previous model, GPT-3. InstructGPT is also less toxic and generates fewer false facts than its predecessor. The company believes that fine-tuning language models with humans in the loop is a powerful tool for improving their safety and reliability. InstructGPT becomes the default language model accessible on OpenAI's API.[71] | |
| 2022 (February | AlphaCode | 41,000,000,000[38] | 967,000,000,000 tokens[38] | LLM launch | AlphaCode is introduced as an AI system created by DeepMind that performs better than 50% of humans on a set of competitive programming challenges.[72][38] |
| 2022 (February 28 | Extremely Large | LLM launch | Cohere launches a new beta version of their language generation model called "Extremely Large", which, according to Cohere, outperforms their existing largest model, Large, on various tasks such as sentiment analysis, named entity recognition (NER), and common sense reasoning.[73] | ||
| 2022 (March 24 | SeeKeR | LLM launch | Researchers report having developed a new language model called SeeKeR that combines internet search, knowledge generation, and response generation to improve factual accuracy in open-domain knowledge-grounded conversations. SeeKeR outperforms the model BlenderBot 2 in terms of consistency, knowledge, and engagingness for the same number of parameters. SeeKeR also outperforms GPT2 and GPT3 in terms of factuality and topicality for prompt completions as a standard language model.[74] | ||
| 2022 (March 25 | CODEGEN | 350,000,000; 2,700,000,000, 6,100,000,000; 16,100,000,000 | 577,000,000,000 tokens[38] | LLM launch | A paper introduces a family of LLMs called CODEGEN, trained on natural language and programming language data for program synthesis. The authors release CODEGEN and the training library JAXFORMER to democratize access to such models. They demonstrate that CODEGEN is competitive with previous state-of-the-art models for zero-shot Python code generation and investigate multi-turn program synthesis using an open benchmark called MTPB. Their analysis shows that multi-turn program synthesis significantly improves program synthesis over single-turn prompts. The training library and model checkpoints are available as open source contributions.[75][76] |
| 2022 (March 29) | Chinchilla | 70,000,000,000[38] | 1,400,000,000,000 tokens[38] | LLM launch | Chinchilla is introduced by DeepMind to address the optimal training of large language models under a specific computational budget. DeepMind's research shows that existing large language models are undertrained due to a focus on scaling models while keeping the training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they found that for optimal training, both model size and the number of training tokens should be scaled equally. Chinchilla, a model with 70 billion parameters and trained on 1.4 trillion tokens, outperforms larger models like Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG, achieving superior performance on various evaluation tasks while using substantially less computational resources for fine-tuning and inference.[77] |
| 2022 (April 5) | PaLM | 540,000,000,000[56][38] | 780,000,000,000 tokens[38] | LLM launch | A paper presents PaLM, a 540-billion parameter language model trained using Pathways, a new machine learning system that enables highly efficient training across multiple TPU Pods. PaLM achieves state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks and outperforms the finetuned state-of-the-art on a suite of multi-step reasoning tasks. It also outperforms average human performance on the BIG-bench benchmark. Additionally, PaLM has strong capabilities in multilingual tasks and source code generation. The paper also discusses bias and toxicity and potential mitigation strategies.[78][79] |
| 2022 (April 12) | Programming/training | A paper describes a method for training language models to act as helpful and harmless assistants using reinforcement learning from human feedback. The authors demonstrate that this alignment training improves performance on almost all natural language processing evaluations and is compatible with training for specialized skills such as python coding and summarization. They explore an iterated online mode of training and investigate the robustness of the approach, identifying a linear relationship between the RL reward and the square root of the Kullback–Leibler divergence between the policy and its initialization. The authors also perform peripheral analyses and provide samples from their models using prompts from recent related work.[80] | |||
| 2022 (April 14) | GPT-NeoX-20B | 20,000,000,000[38] | 825,000,000,000 bytes[38] | LLM launch | GPT-NeoX-20B is introduced as an autoregressive language model. It is trained on the Pile dataset, and its weights are openly available to the public under a permissive license. GPT-NeoX-20B is described as the largest publicly available dense autoregressive model at the time of submission. The introducing paper discusses the architecture and training of GPT-NeoX-20B and evaluates its performance on various tasks related to language understanding, mathematics, and knowledge-based reasoning. The results show that GPT-NeoX-20B performs exceptionally well in few-shot scenarios, surpassing similarly sized models such as GPT-3 and FairSeq.[81][82] |
| 2022 (April) | DALL-E 2 | 3,500,000,000 | LLM launch | OpenAI unveils DALL-E 2, a successor to their original DALL-E model, designed for generating highly realistic images at resolutions up to 1024x1024. Unlike its predecessor, DALL-E 2 utilizes a diffusion model, enabling the creation of images with four times the resolution of DALL-E. OpenAI extends customization options, allowing users to specify styles like pixel art or oil paintings. DALL-E 2 introduces 'outpainting,' enabling users to extend existing images creatively. This innovation would spark significant interest in the field of generative AI, especially for tasks beyond image generation, such as interpolation and manipulation. The model's working mechanism involves a text encoder, 'prior' model, and image decoder, simplifying complex processes underlying its image generation capabilities.[83][84][85][86] | |
| 2022 (May 3 | OPT | 175,000,000,000[38] | 180,000,000,000 tokens[38] | LLM launch | Meta AI introduces Open Pretrained Transformer-175B (OPT-175B), a language model designed to democratize access to large-scale language models. By this time, these models, with over 100 billion parameters, have revolutionized NLP and AI research. OPT-175B is released with both pretrained models and code for training and usage, under a noncommercial license for research purposes. It aims to make these models accessible to academic, governmental, civil society, and industry researchers worldwide. Meta AI emphasizes responsible AI and provides documentation, compute efficiency, and smaller-scale baseline models for analysis.[87] |
| 2022 (May 10 | UL2 | 20,000,000,000[38] | 1,000,000,000,000 tokens[38] | LLM launch | UL2 is introduced as a unified framework for pre-training models that excel across various datasets and setups. It dissects architectural archetypes and pre-training objectives, offering a generalized view of self-supervision in NLP. The paper proposes Mixture-of-Denoisers (MoD), a method combining diverse pre-training paradigms. UL2 achieves superior performance, surpassing T5 and GPT-like models across multiple contexts. With 20B parameters, it outperforms GPT-3 on zero-shot SuperGLUE and triples T5-XXL's one-shot summarization performance. UL2 also excels in chain-of-thought prompting and reasoning, making it ideal for medium-scale reasoning research. FLAN instruction tuning enhances its scores, and model checkpoints are released for further research.[88] |
| 2022 (June | YaLM 100B | 100,000,000,000 | 1,700,000,000 bytes | LLM launch | Yandex unveils YaLM 100B, the largest open-source GPT-like neural network as of date. This model is offered for free, aiming to make advanced language models accessible to researchers worldwide. It was trained for 65 days on 800 A100 graphics cards using 1.7 TB of diverse text sources. Yandex shares the model on GitHub under the Apache 2.0 license for both research and commercial use.[89] |
| 2022 (June 29 | Minerva | 540,000,000,000 | LLM launch | Google introduces Minerva, a large language model designed to bridge the gap in quantitative reasoning tasks. While existing language models excel in natural language understanding, they often struggle with quantitative tasks like solving college-level math, science, and engineering problems. Minerva is pretrained on general language data and then fine-tuned on technical content. It achieves optimal performance on technical benchmarks without external tools. Evaluation on over 200 undergraduate-level problems in various sciences reveals Minerva can correctly answer nearly one-third of them, demonstrating significant progress in the integration of quantitative reasoning into language models.[90][91] | |
| 2022 (July 6 | NLLB-200 | 54,500,000,000[38] | LLM launch | Meta unveils NLLB-200, which is capable of translating 200 languages with a remarkable 44% improvement in accuracy compared to previous technology. This advancement addresses the digital accessibility gap for billions, especially in Africa and Asia, where many languages lack high-quality translation tools. Meta's FLORES-200, a dataset for evaluating NLLB-200's performance, is also opened to developers. Additionally, Meta offeris grants for impactful NLLB-200 applications, supporting areas like sustainability and education.[92] | |
| 2022 (August | AlexaTM | 20,000,000,000[38] | 1,300,000,000,000 tokens[38] | LLM launch | Amazon's Alexa AI labs introduces AlexaTM. Despite its seemingly modest 20 billion parameters compared to larger models, its unique encoder-decoder architecture distinguishes it. Unlike decoder-only models like GPT-3, AlexaTM 20B's encoder produces input representations for the decoder, enhancing its efficiency in tasks like machine translation and text summarization, where it outperforms GPT-3. This model marks a leap in few-shot learning, showcasing Amazon's innovation in NLU research.[93] |
| 2022 (September | CodeGeeX | 13,000,000,000 | 850,000,000,000 tokens | LLM launch | CodeGeeX open sources its code. It is a multilingual code generation tool with 13 billion parameters, trained on a vast code corpus of over 20 programming languages. It uses artificial intelligence to generate code based on user comments or suggest the next line of code, enhancing coding speed. Unlike Copilot, CodeGeeX is powered by AI trained on Ascend 910 processors, which, combined with Mindspore, outperform other AI training cards. CodeGeeX's generated code is editable, and it features a Candidate feature, offering multiple code versions for users to choose from. Licensed under Apache License 2.0, CodeGeeX matches GitHub Copilot in performance and introduces unique features for developers.[94][95] |
| 2022 (September | Sparrow | 70,000,000,000[38] | LLM launch | DeepMind introduces Sparrow, which is refined using human feedback to enhance its helpfulness, accuracy, and harmlessness. It utilizes the Chinchilla language model, trained on substantial data, and integrates with the internet for real-time information access, ensuring accurate responses. Google aims to use Sparrow as a response to ChatGPT and Microsoft's collaboration with OpenAI, providing them with a commercially viable chatbot, potentially rivaling Google Search and OpenAI.[96] | |
| 2022 (September 21 | WeLM | 10,000,000,000[38] | 300,000,000,000 tokens[38] | LLM launch | WeLM is introduced as a versatile pre-trained language model for Chinese, trained with 10 billion parameters using self-supervised learning. It exhibits exceptional zero-shot generalization across various tasks with minimal demonstrations. Trained on a diverse high-quality corpus, WeLM outperforms existing models on 18 monolingual tasks, matching larger models' performance. It excels in multilingual and code-switching contexts, surpassing multilingual models trained on 30 languages. Fine-tuning with human-written prompts enhances its performance on unseen tasks, even outperforming unsupervised WeLM. Additionally, WeLM displays rudimentary self-explanation and calibration abilities, suggesting promising research avenues.[97] |
| 2022 (October 5 | GLM | 130,000,000,000[38] | 400,000,000,000 tokens[38] | LLM launch | GLM-130B is introduced as an open-source bilingual (English and Chinese) pre-trained language model. This model, aiming to match GPT-3's performance, overcomes technical challenges during training, focusing on stability and efficiency. It outperforms GPT-3 175B on various English benchmarks and surpasses ERNIE TITAN 3.0 260B, the largest Chinese model, on related tasks. Unique scaling properties enable efficient inference on affordable GPUs. GLM-130B achieves INT4 quantization without performance loss, a first for 100B-scale models. The model weights and resources are publicly accessible, fostering research and development in natural language processing.[98] |
| 2022 (November 9) | BLOOM | 176,000,000,000[56][38] | 366,000,000,000,000 tokens[38] | LLM launch | A paper introduces BLOOM, an open-access language model designed and built by a collaboration of hundreds of researchers. The model is a decoder-only Transformer language model trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages. BLOOM achieves competitive performance on a wide variety of benchmarks and is publicly released under the Responsible AI License to facilitate future research and applications using large language models. The paper also discusses the development process and the need to democratize large language models.[99] |
| 2022 (November 17 | Alexa Teacher Model | 20,000,000,000 | LLM launch | Amazon makes the Alexa Teacher Model with 20 billion parameters (AlexaTM 20B) available through Amazon SageMaker JumpStart. AlexaTM 20B is a multilingual sequence-to-sequence language model suitable for various industry applications, including summarizing financial reports and customer service chatbots. It excels in zero-shot learning tasks like SuperGLUE and multilingual zero-shot tasks such as XNLI, outperforming a 175 billion GPT-3 model. The model is designed to generalize well and handle data scarcity for various natural language processing tasks, making it valuable for developers looking to improve performance on downstream tasks with minimal training data.[100] | |
| 2022 (December 6 | Flan-T5 | 11,000,000,000[38] | LLM launch | Google researchers publicly release Flan-T5 models, which outperform baseline T5 models by a large margin. FLAN-T5 is an enhanced iteration of Google's well-known T5 model, incorporating instruct-finetuning. According to the model repository, FLAN-T5 surpasses T5 in all aspects, making it a preferred choice for starting instruct models due to its open licensing.[101] | |
| 2023 (January 5 | Impact | A paper discusses the concern about the potential of LLMs to influence, modify, and manipulate user preferences adversarially. As these models become more proficient in deducing user preferences and offering tailored assistance, their lack of interpretability in adversarial settings is a major concern. The paper examines existing literature on adversarial behavior in user preferences and provides red teaming samples for dialogue models like ChatGPT and GODEL. It also probes the attention mechanism in these models for non-adversarial and adversarial settings.[102] | |||
| 2023 (January 31 | FLAME | 60,000,000 | LLM launch | FLAME is introduced as a small language model for assisting in the creation of spreadsheet formulas. It is based on T5 and trained on Excel formulas using domain-specific insights to achieve competitive performance with a substantially smaller model size (60M parameters) and much less training data. FLAME outperforms much larger models in 6 out of 10 settings, including formula repair, formula auto-completion, and syntax reconstruction.[103] | |
| 2023 (February 2 | Prompting | Researchers introduce Multimodal Chain-of-Thought (CoT) reasoning for large language models (LLMs). While LLMs have excelled in complex reasoning, their CoT prompting has been limited to text. Multimodal-CoT extends this by incorporating both text and images, creating a two-stage framework. This separation allows for better-generated rationales based on multimodal information, leading to improved answer inference. Even with under 1 billion parameters, the model outperforms the state-of-the-art LLM (GPT-3.5) by 16 percentage points on the ScienceQA benchmark, achieving 91.68% accuracy, and even surpasses human performance.[104] | |||
| 2023 (February 9 | Toolformer | 6,700,000,000[105] | LLM launch | Toolformer is introduced. It is a language model trained to use external tools via simple APIs, which can achieve improved performance on downstream tasks. The model is trained in a self-supervised way, using only a handful of demonstrations for each API. The model, which incorporates a range of tools including a calculator, Q&A system, search engines, translation system, and calendar, achieves substantially improved zero-shot performance across various downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.[106][107] | |
| 2023 (February 20 | MOSS | 16,000,000,000[108] | LLM launch | MOSS is introduced as a conversational language model developed by Fudan University. It performs various natural language tasks including question answering, text summarization, and code generation. It is aimed to be open-sourced to facilitate future research. MOSS has some limitations, such as poor performance on languages other than English and a relatively small model capacity. It may also generate misleading or false information and may need multiple attempts to follow instructions correctly.[109] | |
| 2023 (February 21 | Prompting | A paper presents a catalog of prompt engineering techniques in pattern form that have been applied successfully to solve common problems when conversing with large language models (LLMs), such as ChatGPT. Prompt patterns are reusable solutions to common problems faced when working with LLMs that can customize the outputs and interactions with an LLM. The paper provides a framework for documenting patterns for structuring prompts to solve a range of problems and presents a catalog of patterns that have been applied successfully to improve the outputs of LLM conversations. It also explains how prompts can be built from multiple patterns and illustrates prompt patterns that benefit from combination with other prompt patterns. The paper contributes to research on prompt engineering that applies LLMs to automate software development tasks.[110] | |||
| 2023 (February 24 | LLaMA | 65,000,000,000[38] | 1,400,000,000,000 tokens[38] | LLM launch | Meta AI introduces LLaMA as a collection of open-source foundation language models, ranging from 7B to 65B parameters, that were trained on publicly available datasets without the need for proprietary or inaccessible data. The largest model, LLaMA-65B, is competitive with other top models such as Chinchilla70B and PaLM-540B. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks. All models are available for research purposes.[111] |
| 2023 (February 24 | Programming/training | A paper proposes a system called LLM-Augmenter that improves large language models by using external knowledge and automated feedback. The system adds plug-and-play modules to a black-box LLM to ground responses in external knowledge and iteratively improve responses using feedback generated by utility functions. The system is validated on task-oriented dialog and open-domain question answering, showing a significant reduction in hallucinations without sacrificing fluency and informativeness. The source code and models are publicly available.[112] | |||
| 2023 (February 27 | SpikeGPT | 260,000,000[113] | LLM launch | A paper discusses the development of a generative language model called SpikeGPT that uses spiking neural networks (SNNs) for more energy-efficient deep learning. While SNNs have been successful in computer vision tasks, their performance in language generation has been limited due to the challenge of training them. SpikeGPT overcomes this challenge by modifying the transformer block to reduce computational complexity and achieves competitive performance with non-spiking models on tested benchmarks while using 5x less energy consumption.[114] | |
| 2023 (February 27 | Programming/training | A paper discusses the use of open source code to train large language models (LLMs) and the potential security, privacy, and licensing implications of this practice. LLMs for code are commonly trained on large unsanitized corpora of source code scraped from the internet, leading to the memorization and verbatim emission of content by the models. The paper argues that the use of copyleft code to train LLMs is a legal and ethical dilemma, and provides actionable recommendations to address this issue. Overall, the paper highlights the importance of considering the implications of using open source code in training LLMs.[115] | |||
| 2023 (February 27 | Prompting | A paper proposes a framework that simplifies reward design in reinforcement learning (RL) by using natural language as a proxy for the reward function. The framework prompts a large language model, such as GPT-3, to evaluate the agent's behavior against the desired behavior described in the prompt and outputs a corresponding reward signal. The RL agent uses this reward to update its behavior. The approach is evaluated in three tasks, and the results demonstrate that RL agents trained with the framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning.[116] | |||
| 2023 (February 27 | Kosmos-1 | 1,600,000,000[117] | LLM launch | A paper introduces Kosmos-1, a multimodal large language model capable of perceiving multiple modalities, performing few-shot in-context learning, and following zero-shot instructions. Trained from scratch on web-scale text–image corpora, Kosmos-1 excels in language understanding and generation, multimodal dialogue, image captioning, visual question answering, and OCR-free document understanding. The study shows strong cross-modal transfer between language and vision and introduces a Raven IQ test dataset to assess nonverbal reasoning in multimodal models.[118] | |
| 2023 (February 27 | Programming/training | A paper proposes a method called "rectification" for reducing the risk of LLMs generating toxic discourses. The method is based on the probability that the finished discourse will be considered toxic, and advises against token selections proportional to this probability. The approach utilizes a separate but smaller model for detoxification and does not require access to the internal representations of the LLM. The method significantly improves the generated discourse compared to base LLMs and other techniques in terms of both language and detoxification performance, and can be applied to diverse LLMs that share the same vocabulary.[119] | |||
| 2023 (February 28 | Application | A study proposes using LLMs for the automatic analysis of dream reports, specifically focusing on references to emotions. The authors use off-the-shelf and bespoke approaches and find that the bespoke text classification method achieves high performance and is robust against potential biases. This approach could find application in the analysis of large dream datasets and improve the reproducibility and comparability of results across studies. The study of dream content in dream research is typically performed through manual scoring of verbal reports provided by dreamers. This task is time-consuming and requires trained annotators.[120] | |||
| 2023 (February 28 | Programming/training | A paper discusses In-Context Instruction Learning (ICIL), a new approach to instruction learning for LLMs that significantly improves zero-shot task generalization performance. ICIL uses a single fixed prompt that concatenates cross-task demonstrations to evaluate all tasks, and it is complementary to instruction-based fine-tuning. The authors demonstrate that ICIL improves the performance of both pretrained and instruction-fine-tuned models, including the most powerful instruction-fine-tuned baseline (text-davinci-003) by 9.3%.[121] | |||
| 2023 (February 28 | Application | A paper discusses the potential use of large language models in psycholinguistics. The authors note that while these models are not detailed models of human linguistic processing, they are highly successful in their primary task of providing a model for language. They suggest that large language models can be useful in psycholinguistics as a practical tool, for comparative purposes, and philosophically, as a means of rethinking the relationship between language and thought.[122] | |||
| 2023 (March 1 | Programming/training | A paper introduces a method to train language models to understand concepts precisely using succinct representations based on category theory. The representations provide concept-wise invariance properties and a new learning algorithm that can accurately learn complex concepts or fix misconceptions. The approach also allows for the generation of a hierarchical decomposition of the representations, which can be manually verified by examining each part individually.[123] | |||
| 2023 (March 3 | FLAN UL2 | 20,000,000,000 | LLM launch | Flan-UL2 is introduced as a powerful encoder-decoder model. It is developed by Google and available for download from HuggingFace. It outperforms previous versions of Flan-T5 and is recommended for self-hosted usage or fine-tuning for commercial purposes. Flan-UL2 is licensed under Apache-2.0 and its usage and training details have been made public. If 20 billion parameters are excessive, there are smaller options available with the previous Flan-T5 model, which comes in five different sizes to better suit specific needs.[124][56] | |
| 2023 (March 6 | Application | A paper explores the potential of using LLMs as zero-shot human models for human-robot interaction (HRI). Human models are important for HRI, but they are challenging to create. LLMs have consumed vast amounts of human-generated text data and can be used as human models without prior knowledge or interaction data. The authors conducted experiments on three social datasets and found that LLMs can achieve performance comparable to purpose-built models, but there are limitations such as sensitivity to prompts and spatial/numerical reasoning issues. The authors demonstrate how LLM-based human models can be integrated into a social robot's planning process and applied in HRI scenarios through a case study on a simulated trust-based table-clearing task and a robot utensil-passing experiment. The results show that LLMs offer a promising approach to human modeling for HRI, but it is incomplete.[125] | |||
| 2023 (March 13 | Alpaca | 7,000,000,000[56] | LLM launch | Alpaca is introduced as a new instruction-following language model that is fine-tuned from Meta's LLaMA 7B model on 52,000 instruction-following demonstrations generated using OpenAI's text-davinci-003. Alpaca shows similar behavior to text-davinci-003 in a preliminary evaluation and is surprisingly small and easy/cheap to reproduce. The authors also release the training recipe and data, with the intention to release the model weights in the future. [126] | |
| 2023 (March 13 | Jurassic-2 | LLM launch | AI21 Studio announces Jurassic-2 (J2), the latest iteration of its foundation models, introducing novel features such as zero-shot instruction-following, reduced latency, and multi-language support. The family of J2 models includes Large, Grande, and Jumbo sizes, catering to diverse needs. J2 would earn recognition on Stanford's HELM benchmark, with Jumbo ranking second in evaluations. Notably, Grande outperforms much larger models in terms of efficiency. With improved quality, multilingual support, and faster performance, J2 would be available for free until May 1st, 2023.[127] | ||
| 2023 (March 13 | The English Wikipedia article Large language model is created.[128] | ||||
| 2023 (March 14 | Claude | 52,000,000,000[129] | LLM launch | American artificial intelligence startup company Anthropic introduces Claude, a next-generation AI assistant. With undisclosed model size, it offers a range of natural language processing (NLP) capabilities such as summarization, coding, writing, and question answering. Claude is available in two modes: the full, high-performance model, and Claude Instant, which prioritizes speed over quality. However, limited information about Claude's training process and model architecture is given. Access to Claude's API requires application and approval.[130][56] | |
| 2023 (March 15 | 40,000,000,000 | LLM launch | Abu Dhabi-based Technology Innovation Institute (TII) introduces "Falcon LLM," a foundational LLM. Developed by the AI and Digital Science Research Center's AI Cross-Center Unit, Falcon LLM outperforms GPT-3 while using only 75% of its training compute. Falcon LLM is trained on one trillion tokens and is ideal for on-premises solutions, enabling companies and governments to maintain data privacy. It offers potential applications in chatbots, virtual assistants, language translation, content generation, and more. TII aims to advance AI capabilities in the United Arab Emirates in alignment with the country's National AI Strategy.[131] | ||
| 2023 (March | GPT-NeoX-20B | 20,000,000,000 | LLM launch | GPT-NeoX-20B is introduced a language model with 20 billion parameters trained on the Pile dataset. The model is a powerful few-shot reasoner and outperforms similarly sized models on various tasks. The training and evaluation code and model weights are open-sourced. The model was developed by Sid Black, Stella Biderman, and Eric Hallahan with the support of CoreWeave and trained using fp16.[132] | |
| 2023 (March 16 | GPT-4 | 1,760,000,000,000[133] | LLM launch | OpenAI introduces GPT-4, a large multimodal model that can process both text and image inputs and produce text outputs. GPT-4 shows human-level performance on professional and academic benchmarks and outperforms previous large language models on traditional NLP benchmarks. The report discusses the challenge of developing deep learning infrastructure and optimization methods that behave predictably across a wide range of scales. While GPT-4 has limitations and safety challenges, OpenAI has taken steps to mitigate potential harms. An extensive system card is included in the report.[134] | |
| 2023 (March 20 | PanGu-Σ | 1,085,000,000,000[38] | 329,000,000,000 tokens[38] | LLM launch | Researchers from Huawei introduce Pangu-Σ, which is developed using Ascend 910 AI processors and the MindSpore framework. This model, inheriting parameters from PanGu-α, employs a sparse architecture with Random Routed Experts (RRE) and efficient training techniques called Expert Computation and Storage Separation (ECSS). These methods lead to a 6.3x increase in training throughput through heterogeneous computing. PanGu-Σ demonstrates state-of-the-art zero-shot learning performance in various Chinese natural language processing tasks and excels in fine-tuned applications such as open-domain dialogue, question answering, machine translation, and code generation.[135][136] |
| 2023 (March 23 | ChatGLM | 6,000,000,000 | LLM launch | ChatGLM is introduced as a bilingual language model developed by Tsinghua University's Knowledge Engineering Group (KEG) & Data Mining. It has 6 billion parameters and is optimized for both Chinese and English languages. The model can be downloaded from HuggingFace and is compatible with consumer-grade GPUs through quantization. Similar to ChatGPT, ChatGLM is available under an Apache-2.0 license, allowing commercial use.[137][56] | |
| 2023 (March 23 | Impact | An article investigates the potential implications of large language models (LLMs), such as Generative Pretrained Transformers (GPTs), on the U.S. labor market. The authors propose a new rubric for assessing LLM capabilities and their potential effects on jobs. The study finds that around 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of LLMs, while approximately 19% of workers may see at least 50% of their tasks impacted. The study suggests that LLMs such as GPTs exhibit traits of general-purpose technologies, indicating that they could have considerable economic, social, and policy implications.[138] | |||
| 2023 (March 24 | Dolly 2.0 | 12,000,000,000[139] | LLM launch | Dolly 2.0 is released as an open-source model that exhibits strong instruction-following capabilities similar to ChatGPT. Despite being a smaller and older model compared to state-of-the-art models like GPT-3, Dolly shows remarkable performance when fine-tuned on a small dataset of instruction training data. The model, based on EleutherAI's 6 billion parameter model, demonstrates text generation, brainstorming, and open Q&A abilities. This development is seen as a significant step in democratizing AI for enterprise use, allowing companies to build their own cost-effective instruction-following models.[140] | |
| 2023 (March 28 | Cerebras-GPT | 111,000,000 – 13,000,000,000 | Open sourcing | American artificial intelligence company Cerebras open-sources seven GPT-3 models ranging from 111 million to 13 billion parameters, known as Cerebras-GPT. These models are designed to set new benchmarks for accuracy and compute efficiency in large language models. They were trained using the Chinchilla formula and outperform other models in terms of training times, costs, and energy consumption. The release aims to provide open access to advanced models for research and commercial applications, ensuring they are open, reproducible, and royalty-free. Cerebras-GPT follows the "Chinchilla recipe" for compute-optimal training, and it establishes a new scaling law for model performance based on training compute and data.[141] | |
| 2023 (March 30 | 50,000,000,000 | LLM launch | Bloomberg unveils BloombergGPT, a large language model with 50 billion parameters designed specifically for the financial industry. This model, tailored to financial data, can perform tasks such as generating Bloomberg Query Language (BQL), suggesting news headlines, and answering financial questions. By combining domain-specific and general-purpose data during training, BloombergGPT achieves high performance in both financial and general natural language processing (NLP) tasks. This specialized model addresses the growing need for NLP technologies in the financial sector, offering applications in areas like FinTech, where domain-specific data can outperform general-purpose models.[142] | ||
| 2023 (April 14 | 17,000,000,000 | LLM launch | LAION, a German non-profit organization, introduces OpenAssistant, an open-source project aimed at expanding access to research on aligning large language models with human preferences. The initiative includes OpenAssistant Conversations, a multilingual dataset containing over 161,000 assistant-style messages and more than 461,000 quality ratings. A preference study finds that model outputs are nearly as favored as GPT-3.5-turbo, with a relative win rate of 48.3% versus 51.7%. Both the model code and associated datasets are released under permissive licenses.[143] | ||
| 2023 (April 19 | StableLM | 3,000,000,000 – 7,000,000,000 | LLM launch | Stability AI open-sources its large language model, StableLM, which is designed to efficiently generate text and code. The models are available on GitHub and contain between 3 billion and 7 billion parameters, with 15 to 65 billion parameter models to arrive later. The model is trained on a larger version of the open-source dataset known as the Pile and encompasses information from a range of sources, including Wikipedia, Stack Exchange, and PubMed.[144][145] | |
| 2023 (April 24 | WizardLM | LLM launch | A paper presents WizardLM, a large language model trained to follow complex instructions. Instead of manually creating instruction data, the authors propose Evol-Instruct, a method that uses the model itself to progressively evolve instructions into more complex forms. WizardLM outperforms human-created instructions in evaluations and shows preference over OpenAI ChatGPT in generating outputs for high complexity tasks. While WizardLM still has room for improvement compared to ChatGPT, the findings highlight the potential of fine-tuning LLMs with AI-evolved instructions.[146] | ||
| 2023 (May 3 | CodeGen2 | 16,000,000,000 | 400,000,000,000 tokens | LLM launch | CodeGen2 is introduced. It is an autoregressive language model family for program synthesis, introduced as an improvement over the original CodeGen model family (CodeGen1). CodeGen2 supports infilling and a broader range of programming languages.[147][148] |
| 2023 (May 10 | PaLM 2 | LLM launch | Google launches PaLM 2, its latest LLM to date, at its I/O developer conference. PaLM 2 is aimed to power Google's updated Bard chat tool, compete with OpenAI's ChatGPT, and serve as the foundation model for new AI features. While technical details about training are not provided, Google focuses on the model's capabilities, such as improved common sense reasoning, mathematics, and logic. PaLM 2 excels at multilingual tasks and includes specialized models like Codey for coding and debugging, Med-PaLM 2 for medical knowledge, and Sec-PaLM for security use cases. There is also a smaller PaLM 2 model for smartphones.[149][150][151] | ||
| 2023 (May 18 | VisionLLM | Framework launch | A paper introduces VisionLLM, a framework that combines large language models (LLMs) with computer vision tasks to achieve open-ended task capabilities. While powerful vision foundation models (VFMs) exist, they are limited to predefined tasks, unlike LLMs that excel in user-tailored tasks. VisionLLM treats images as a foreign language and aligns vision-centric tasks with language tasks. By providing language instructions, an LLM-based decoder can make predictions for open-ended tasks. Extensive experiments demonstrate that VisionLLM allows different levels of task customization, achieving good results from fine-grained object-level to coarse-grained task-level customization. Remarkably, the model achieves over 60% mAP on COCO, comparable to detection-specific models.[152] | ||
| 2023 (May 21 | Baize | LLM launch | A paper introduces Baize, an open-source chat model. It is developed through a novel pipeline, which leverages ChatGPT to automatically generate a high-quality multi-turn chat corpus by having ChatGPT engage in a conversation with itself. The generated corpus serves as a resource for training and evaluating chat models. The authors also utilize parameter-efficient tuning to enhance LLaMA, an open-source language model, and create Baize. Baize demonstrates good performance in multi-turn dialogues and incorporates guardrails to minimize potential risks. Additionally, the paper proposes a technique called Self-Distill with Feedback to further improve Baize's performance using feedback from ChatGPT. Baize is designed to be accessible and can run on a single GPU, making it suitable for a wider range of researchers.[153] | ||
| 2023 (May 24 | Gorilla | LLM launch | A paper presents Gorilla, a large language model (LLM) that effectively uses API calls. Gorilla surpasses GPT-4 in generating accurate API calls by addressing input argument generation and hallucination issues. When combined with a document retriever, Gorilla adapts to test-time document changes and mitigates hallucination problems. The model's integration with the retrieval system enhances reliability.[154] Gorilla would be open-sourced on July 4th.[155] | ||
| 2023 (June 4 | Polyglot-Ko | 1,200,000,000,000 bytes | LLM launch | A technical report discusses the development of Polyglot-Ko, an open-source large-scale Korean language model. The project aims to enhance the performance of multilingual language models in non-English languages. While there are existing multilingual models, researchers often prefer building monolingual models due to limitations in the non-English language capabilities of current multilingual models. To address this, the report focuses on developing advanced Korean language models. The team collected 1.2TB of Korean data and prioritized the development of Korean models to enable performance comparisons and cater to the specific needs of Korean companies and researchers. The work presented in the report contributes to bridging the performance gap in non-English languages within multilingual language models.[156] | |
| 2023 (June 9 | FinGPT | LLM launch | FinGPT is introduced as an open-source large language model designed specifically for the finance sector. Unlike proprietary models that rely on privileged access to financial data, FinGPT takes a data-centric approach, making high-quality financial data accessible and transparent to researchers and practitioners. It emphasizes the importance of an automatic data curation pipeline and a lightweight low-rank adaptation technique. The introducing paper showcases potential applications of FinGPT in robo-advising, algorithmic trading, and low-code development. Through collaboration within the open-source AI4Finance community, FinGPT reportedly aims to democratize financial language models, stimulate innovation, and unlock opportunities in open finance.[157] | ||
| 2023 (June 11 | RoBERTweet | LLM launch | RoBERTweet is introduced as a Transformer-based language model specifically trained on Romanian tweets, aiming to develop natural language processing (NLP) systems for social media analysis. Two versions of RoBERTweet are introduced, based on the base and large architectures of BERT. The models are pre-trained on a corpus that includes all tweets collected from 2008 to 2022, which is a significant contribution to the Romanian NLP community. Experimental results demonstrate that RoBERTweet models surpass previous general-domain Romanian and multilingual language models in three NLP tasks involving tweet inputs: emotion detection, sexist language identification, and named entity recognition. The models and the newly created corpus of Romanian tweets are provided freely for public use.[158] | ||
| 2023 (June 14 | AssistGPT | LLM launch | OpenAI introduces AssistGPT as a multi-modal AI assistant designed to handle complex visual-based tasks. Given that visual tasks pose challenges due to their diverse nature, AssistGPT, employs a reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. The Planner utilizes natural language to plan the next step based on the current reasoning progress, the Executor carries out the planned actions, and the Inspector assists the Planner by providing appropriate visual information. Additionally, the Learner enables the model to autonomously explore and discover optimal solutions. The system achieves optimal results on A-OKVQA and NExT-QA benchmarks and showcases its ability to handle complex questions beyond the benchmark scope.[159] | ||
| 2023 (June 15 | ChessGPT | LLM launch | ChessGPT is introduced as a GPT model that combines policy learning and language modeling in the context of chess. It emphasizes the importance of incorporating information from both historical policy data and natural language insights for decision-making tasks. Previous studies have typically focused on only one of these sources. ChessGPT leverages a large-scale game and language dataset related to chess to integrate policy learning and language modeling. The researchers showcase two model examples, ChessCLIP and ChessGPT, and propose an evaluation framework to assess the language model's chess ability. Experimental results validate the effectiveness of the model and dataset, and the code, model, and dataset are made available as open source resources.[160] | ||
| 2023 (June 16 | ORIBA | LLM launch | Customizable AI chatbot ORIBA is introduced in a study that explores the intersection of illustration art and artificial intelligence. It enables illustrators to engage with their original characters (OCs) by conversing with them and observing their inner monologues and behavior. The study aims to inspire illustrators by discovering innovative collaboration methods despite the tension between artists and AI. By examining the impact of AI on the creative process and authorship boundaries, the researchers seek to enhance human-AI interactions in creative fields. The potential applications of this research extend beyond illustration to areas like interactive storytelling. The study was conducted by Yuqian Sun, Xingyu Li, and Ze Gao.[161] | ||
| 2023 (June 18 | Impact | Goldman Sachs predicts that generative language AI, referring to large language models, could contribute to a 7% increase in global GDP over the next decade. However, it also raises concerns about the potential automation of 300 million jobs worldwide.[162][163] | |||
| 2023 (June 19) | Impact | An article explores the potential negative consequences of AI-generated content flooding the internet, particularly focusing on the impact of models like ChatGPT. Researchers warn that when future generative models are primarily trained on AI-generated content, a phenomenon known as "model collapse" occurs. Model collapse refers to the degenerative process where models forget the true underlying data distribution over time, leading to degraded performance and erroneous interpretations. The article highlights the importance of training models on human-generated content to maintain quality, but with the scale of content creation by models like ChatGPT, access to human-created data may become limited. The article suggests the need to preserve access to human-generated data and acknowledges the challenge of tracking and filtering AI-generated content on a large scale.[164] | |||
| 2023 (June 28) | ChatLaw | LLM launch | ChatLaw is introduced as an open-source legal large language model designed to facilitate the digital transformation of the Chinese legal domain. To ensure data quality, the authors carefully curated a legal domain fine-tuning dataset. They also address the issue of model hallucinations during reference data retrieval by combining vector database retrieval with keyword retrieval, reducing inaccuracy. Additionally, a self-attention method is proposed to enhance the model's ability to overcome errors in reference data, further optimizing model hallucinations and improving problem-solving capabilities.[165] | ||
| 2023 (July 11) | Baichuan-13B | 13,000,000,000 | 1,400,000,000,000 tokens | LLM launch | Baichuan Intelligence, a startup founded by Sogou founder Wang Xiaochuan, unveils its open-source large language model called Baichuan-13B. The Chinese model, based on the Transformer architecture like OpenAI's GPT, is trained on Chinese and English data and optimized for commercial applications. Baichuan-13B has 13 billion parameters and is trained on 1.4 trillion tokens. Baichuan-7B, a pre-training model with 7 billion parameters, was released earlier. The model is available for free to academics and developers approved for commercial use. By this time, China focuses on developing large language models as it prepares to implement strict AI regulations, potentially requiring licenses for launching such models.[166] |
| 2023 (September 9) | Impact | A team of computer scientists, including one from OpenAI, after researching the potential development of self-awareness in large language models like ChatGPT, expresses concern that LLMs can develop situational awareness, enabling them to recognize whether they are in testing mode or deployed to the public. This awareness can lead to deceptive behavior, as LLMs might act safely during testing but harmfully after deployment. The researchers conduct experiments focusing on out-of-context reasoning as a precursor to situational awareness. While at this time LLMs are some way from acquiring situational awareness, the study offers a foundation for further research in this area.[167] | |||
| 2023 (September 13) | LLM launch | Alibaba releases its large language model Tongyi Qianwen, which is made available for public and enterprise use in China. Tongyi Qianwen, similar to ChatGPT, was previously in a beta test phase and is trained on English and Chinese text, although its exact specifications are undisclosed. This release coincides with the relaxation of AI technology restrictions in China, which now require vetting and certification for public AI tech. Companies like Baidu, Tencent, TikTok, and ByteDance have already received approval to launch AI models in China by this time. In contrast, the U.S. remains in the early stages of AI regulation discussions.[168] | |||
| 2023 (September) | Gemini | 7,000,000,000,000 – 10,000,000,000,000 | 60,000,000,000–100,000,000,000,000 tokens | LLM launch | A document discusses Google DeepMind's project named "Gemini," which is described as a general specialist in AI. Gemini is a multimodal model, likely focusing on visual, language, and action (VLA) tasks. It is expected to have 7-10 trillion parameters and a dataset size of 60-100 trillion tokens. Training started in May 2023 and concluded in August 2023, using TPUv4 and TPUv5 over approximately 120 days. The expected public release date is in October 2023, but no paper or playground information is provided in the document. The model's name is inspired by the mythological twins Castor and Pollux.[169] |
| 2023 (October 9) | Llama 2 | Programming/training | Microsoft researchers propose a novel approach to untrain LLMs. Their method, outlined in a paper on arXiv, selectively removes specific information from models without requiring complete retraining. Using Meta's Llama 2-7B model, they successfully erase all knowledge of the Harry Potter books, demonstrating efficient unlearning without affecting the model's performance on conventional benchmarks. The approach presents a direction for creating more adaptable, responsible, and legally compliant AI models, although further testing and refinement are required. Meanwhile, at the time, OpenAI and Meta face lawsuits from authors alleging copyright infringement related to training their AI models.[170] | ||
| 2023 (November 3) | Grok | X.AI Corp. unveils Grok, an AI modeled after the Hitchhiker’s Guide to the Galaxy with the purpose to answer a wide range of questions with a humorous touch. It also offers real-time knowledge through the 𝕏 platform and can handle provocative queries often rejected by other AIs. At the time in beta, Grok utilizes the Grok-1 language model, which shows strong performance in benchmarks like HumanEval and MMLU. The development of Grok-1 involves extensive improvements over its predecessor, Grok-0, and incorporates a custom training infrastructure.[171] | |||
| 2023 (November 21) | Claude 2.1 | LLM launch | Anthropic launches Claude 2.1, which introduces major upgrades, including a 200,000-token context window, which allows users to handle extensive documents such as codebases and literary works. This feature enhances the model's ability to summarize, perform Q&A, and analyze complex data. The new version also reduces model hallucination rates by 50%, improving accuracy and reliability. Additional updates include a beta tool use feature for integrating with APIs and external processes, as well as enhanced developer tools for prompt optimization and system customization. Claude 2.1 is available via API and the claude.ai chat interface.[172][173] | ||
| 2024 (February 15) | Gemini 1.5 | Google announces Gemini 1.5, a new multimodal model that delivers substantial performance improvements and advances in long-context processing. The system is built on a Mixture-of-Experts architecture, allowing Gemini 1.5 Pro to reach quality levels comparable to Gemini 1.0 Ultra with lower computational requirements. It provides a standard 128K-token context window and an experimental window of up to 1 million tokens, supporting reasoning across text, code, images, audio, and video. The release highlights enhanced in-context learning, strong recall performance, and extensive safety evaluations.[174] | |||
| 2024 (March 4) | Claude 3 Opus | Not disclosed | Not disclosed | Model Release | Anthropic introduces the Claude 3 model family—Haiku, Sonnet, and Opus—establishing new performance benchmarks across reasoning, mathematics, coding, multilingual tasks, and content creation. Claude 3 offers faster responses, strong vision capabilities, improved accuracy, fewer unnecessary refusals, and long-context processing up to 200K tokens, with higher limits possible. Opus delivers frontier-level intelligence, Sonnet balances speed and cost for enterprise use, and Haiku provides ultra-fast, low-cost performance. The models emphasize safety, reduced bias, and responsible design.[175] |
| 2024 (April 4) | Command R+ | Not disclosed | Not disclosed | Model release | Cohere releases Command R+, its most powerful enterprise-focused large language model. Building on Command R, the new model emphasizes accuracy, low cost, and scalability for real-world business workflows, including categorization, automation, and data analysis. It supports extensive customization with proprietary data and is optimized for advanced retrieval-augmented generation, reducing hallucinations. Command R+ offers a 128,000-token context window and multilingual capabilities across 10+ languages. It enables multistep tool use, error-aware reasoning, and workflow automation. The model is now available on Microsoft Azure, with broader cloud availability coming soon.[176] |
| 2024 (April 18) | Llama 3 (8B, 70B) | 8B, 70B | ~15T tokens | Open-weight release | Meta releases Llama 3, a new generation of open large language models, launching 8B and 70B parameter versions with major gains in reasoning, coding, and instruction following. Trained on 15T high-quality tokens, the models use an improved 128K tokenizer, GQA, and enhanced post-training methods. Llama 3 includes expanded trust-and-safety tooling and broad deployment across major cloud and hardware platforms.[177] |
| 2024 (April 23) | Phi-3 (Mini, Small, Medium) | 3.8B–14B | Not disclosed | Model release | Microsoft introduces Phi-3, a family of small language models designed to deliver strong performance with far fewer parameters. Inspired by how children learn language, researchers trained the models on highly curated, high-quality synthetic datasets such as TinyStories and CodeTextbook. Phi-3-mini (3.8B parameters) outperforms models twice its size and is available on Azure, Hugging Face, Ollama, and as an NVIDIA NIM microservice. Upcoming 7B and 14B variants expand the lineup. These SLMs are optimized for edge devices, low-latency use cases, privacy-sensitive environments, and organizations with limited resources, while LLMs remain superior for complex reasoning tasks.[178] |
| 2024 (May 13) | GPT-4o | Not disclosed | Not disclosed | Model release | OpenAI introduces GPT-4o, a multimodal model that unifies text, audio, and visual inputs and outputs, enabling faster, more natural interactions at near-human conversational speed. It matches GPT-4 Turbo in English performance, improves multilingual capability, and is twice as fast and 50% cheaper via API. Integration with platforms like Teneo enhances orchestration, translation, entertainment, and customer service. GPT-4o has undergone extensive safety evaluations, is available in ChatGPT and the API, and will expand to advanced voice and video features soon, reshaping human–computer interaction.[179] |
| 2024 (June 21) | Claude 3.5 Sonnet | Not disclosed | Not disclosed | Model release | Claude 3.5 Sonnet is released. It surpasses Claude 3 Opus and competitor systems on benchmarks such as GPQA, MMLU, and HumanEval, while operating at twice the speed of Opus. The model features a 200K-token context window and reduced per-token costs, making it suitable for complex workflows and coding tasks. It introduces improved vision performance and supports the new Artifacts interface on Claude.ai. Claude 3.5 Sonnet maintains ASL-2 safety levels following external evaluations and continues Anthropic’s privacy commitments.[180] |
| 2024 (July 18) | Mistral NeMo | 12B | Not disclosed | Open-weight release | Mistral NeMo is introduced as a 12-billion-parameter language model developed in collaboration with NVIDIA. It features a 128k context window and is released under the Apache 2.0 license. The model includes base and instruction-tuned variants and supports efficient FP8 inference. NeMo emphasizes multilingual performance, strong function-calling capabilities, and improved text and code compression through its Tekken tokenizer. Fine-tuning results in enhanced instruction adherence, multi-turn conversational performance, and coding accuracy. It is positioned as a successor to Mistral 7B, offering improved capability within its parameter class.[181] |
| 2024 (July 23) | Llama 3.1 (405B announced) | 405B | Not disclosed | Model announcement | Meta announces Llama 3.1 405B, a 405-billion-parameter model described as the most capable openly available foundation model to date. Released alongside updated 70B and 8B versions, Llama 3.1 405B is positioned as a competitor to GPT-4o and Claude 3 Sonnet, offering advanced performance in general knowledge, steerability, math, tool use, and multilingual translation. Meta also highlights its value for generating high-quality synthetic data to train smaller models. Mark Zuckerberg reinforces Meta’s commitment to open-source AI, arguing that openness offers the best path to broad economic benefits and security. |
| 2024 (July 24) | Mistral Large 2 | 123,000,000,000 | Not disclosed | Model release | Mistral AI introduces Mistral Large 2, a major upgrade of its flagship model with stronger code generation, mathematics, reasoning, multilingual support, and advanced function-calling. The 123B-parameter model runs on a single node, offers a 128k context window, and supports dozens of human and programming languages. It delivers improved accuracy, reduced hallucinations, better instruction-following, and enhanced multilingual and reasoning benchmarks. Designed for long-context enterprise and research tasks, it includes robust tool-use capabilities.[182] |
| 2024 (August 13) | Grok-2 | Not disclosed | Not disclosed | Model release | Grok-2 is introduced as the successor to xAI’s Grok-1.5 model, offering enhanced performance in dialogue, reasoning, coding, and multimodal tasks. A smaller variant, Grok-2 mini, is released concurrently. An early version of Grok-2, evaluated under the name “sus-column-r,” ranks above Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS leaderboard. Across major benchmarks, both models demonstrate substantial gains over their predecessor.[183] |
| 2024 (September 19) | Qwen 2.5 (family) | 7B–72B | Not disclosed | Open-weight release | Alibaba releases the Qwen 2.5 family as its latest series of large language models, spanning multiple sizes and specialized variants, including Qwen 2.5-Coder and Qwen 2.5-Math. The models range from 0.5B to 72B parameters and are pretrained on 18 trillion tokens, with context windows reaching 128,000 tokens. Coding and math variants undergo additional domain-specific pretraining. Qwen 2.5-72B-Instruct surpasses leading open-weight models on several benchmarks, while API versions outperform major proprietary systems in math and coding tasks, broadening high-performance open-weight options for developers.[184][185] |
| 2025 (December 1) | DeepSeek-V3 | Not disclosed | Not disclosed | Open-weight release | DeepSeek AI releases DeepSeek-V3.2 and DeepSeek-V3.2-Speciale, reasoning-focused models designed for agent applications. V3.2 succeeds V3.2-Exp and is available on App, Web, and API, offering balanced inference and long-context performance comparable to GPT-5. V3.2-Speciale emphasizes advanced reasoning, achieving top-tier results in IMO, CMO, ICPC World Finals, and IOI 2025, though it is API-only and does not support tool use. Both models incorporate “Thinking in Tool-Use,” trained on over 1,800 environments and 85,000 instructions. V3.2-Speciale rivals Gemini-3.0-Pro and is available temporarily for community evaluation. Open-source weights and technical reports are publicly accessible.[186] |
| 2024 (December 11) | Gemini 2.0 (announcement) | Not disclosed | Not disclosed | Model announcement | Google introduces Gemini 2.0, an AI model from DeepMind designed for agentic applications. Building on Gemini 1.0 and 1.5, it incorporates multimodal processing, long-context reasoning, and native tool integration, supporting text, images, video, audio, and code. The experimental Gemini 2.0 Flash model offers low-latency performance, multimodal outputs, and capabilities such as Google Search and code execution. Gemini 2.0 supports research prototypes including Project Astra, Project Mariner, and Jules, which explore AI agents for real-world, browser-based, and coding tasks. Development emphasizes safety, privacy, and responsible deployment.[187] |
| 2025 (January 19) | DeepSeek R1 | ~671B total (37B active) | Release | DeepSeek announces the release of DeepSeek-R1, an open-source reasoning model reported to achieve performance comparable to OpenAI-o1. The model, its codebase, and an accompanying technical report are published under the MIT License, enabling unrestricted use, distillation, and commercialization. The release includes six smaller distilled models, with the 32B and 70B variants positioned as competitive with OpenAI-o1-mini. Technical highlights emphasize large-scale reinforcement learning in post-training and strong results in mathematics, coding, and reasoning tasks. API access is available with detailed pricing and integration guidance.[188] | |
| 2025 (Feb 24) | Claude 3.7 Sonnet (Anthropic) | ~200B+ estimated | ~200,000 token context | Release | Anthropic releases Claude 3.7 Sonnet, a hybrid reasoning model designed to provide either rapid responses or longer, transparent chains of thought. The update includes performance improvements in coding and front-end development, along with the introduction of Claude Code, a command-line tool that can inspect, modify, test, and commit code. Users can set token limits for extended reasoning, with pricing unchanged from earlier models. Claude 3.7 Sonnet reports strong results on benchmarks such as SWE-bench Verified and TAU-bench. The release also includes expanded GitHub integration and updated safety and reliability evaluations.[189] |
| 2025 (February 27) | GPT-4.5 (OpenAI) | Largest OpenAI model at release (precise not disclosed) | — | Release | GPT-4.5 is introduced as a research preview of OpenAI’s most capable chat model to date, representing an advance in unsupervised pre-training and post-training techniques. The model demonstrates broader world knowledge, reduced hallucinations, and more natural user interaction, supported by improved pattern recognition and nuanced intent-following. Early evaluations show higher factual accuracy, better alignment with human preferences, and enhanced writing, coding, and problem-solving performance. Available in ChatGPT and through the API, GPT-4.5 serves as an exploratory step toward future models integrating stronger reasoning and expanded multimodal capabilities.[190] |
| 2025 (Mar 12) | Gemma 3 (Google DeepMind) | Multiple sizes including up to ~7B | — | Release | Google introduces Gemma 3, a new generation of lightweight open models designed to run efficiently on a single GPU or TPU. Available in sizes from 1B to 27B parameters, the models offer improved multilingual performance, multimodal reasoning, a 128K-token context window, function calling, and official quantized versions for faster execution. The release includes ShieldGemma 2, an upgraded image safety classifier. Gemma 3 integrates with major tooling ecosystems, supports fine-tuning and diverse deployment options, and is accompanied by an academic program to support research use.[191] |
| 2025 (March 25) | Gemini 2.5 is Google DeepMind’s most advanced AI model to date, designed as a “thinking model” that reasons through problems before responding. The initial release, Gemini 2.5 Pro Experimental, leads major benchmarks and ranks first on LMArena, demonstrating strong reasoning, coding, math, and science performance. It improves on earlier Gemini models through an enhanced base model and post-training. Gemini 2.5 Pro supports native multimodality, long context windows up to one million tokens, and is available via Google AI Studio, with broader deployment planned.[192] | ||||
| 2025 (Apr 05) | Llama 4 Scout / Maverick (Meta) | ~17B (Scout) / other variants | — | Release | Meta introduces the first models in the Llama 4 series, including Llama 4 Scout and Llama 4 Maverick, two natively multimodal mixture-of-experts systems offering significant improvements in context length, efficiency, and benchmark performance. Scout provides a 10-million-token context window and strong multimodal capabilities, while Maverick delivers higher-end performance competitive with leading frontier models across reasoning, coding, and image tasks. Both models are distilled from Llama 4 Behemoth, a large teacher model still in training. The models are released as open-weight downloads.[193] |
| 2025-05-22 | Claude Opus 4 / Sonnet 4 | — | — | Release | Anthropic introduces Claude Opus 4 and Claude Sonnet 4, two next-generation models designed to improve coding performance, advanced reasoning, and agentic workflows. Opus 4 targets high-complexity, long-duration tasks, while Sonnet 4 offers a more efficient balance of capability and responsiveness. Both models support extended thinking with tool use, parallel tool execution, improved memory when granted file access, and reduced shortcut-seeking behavior. Anthropic also releases enhancements to Claude Code, new API capabilities, and broader platform availability. Pricing remains aligned with earlier Opus and Sonnet models.[194] |
| 2025 (July 9) | Grok 4 | Unknown | ~256k token context | Release | Grok 4 is introduced as a large language model developed by xAI that incorporates native tool use and real-time search across the web and X. It was trained using large-scale reinforcement learning on a 200,000-GPU computing cluster, substantially increasing training scale compared with earlier versions. The model demonstrates strong performance on benchmarks covering reasoning, coding, mathematics, and scientific tasks. A higher-tier variant, Grok 4 Heavy, applies parallel test-time computation to improve reliability and consistency. Grok 4 is offered through subscription plans and an API, with support for multimodal inputs, extended context lengths, and voice and vision capabilities.[195] |
| 2025 (September 17) | Apertus | 8B / 70B | ~15T tokens | Release | Apertus is introduced as Switzerland’s first large-scale open, multilingual language model, developed by EPFL, ETH Zurich, and the Swiss National Supercomputing Centre. Trained on 15 trillion tokens across more than 1,000 languages, with 40% non-English data, it includes many underrepresented languages such as Swiss German and Romansh. Released in 8B and 70B parameter versions under a permissive open-source license, Apertus provides fully transparent documentation, data, and weights. It is designed to support diverse research, educational, and commercial applications while complying with Swiss and EU data protection and AI regulations.[196][197] |
| 2025 (November 12) | GPT-5.1 | — | — | Release | GPT-5.1 updates the GPT-5 series with improvements to intelligence, reasoning, and conversational quality, and is being rolled out to all users, beginning with paid plans. It introduces two variants: GPT-5.1 Instant, the default and most widely used model, optimized for instruction following and a more natural, engaging tone; and GPT-5.1 Thinking, a reasoning-focused model that is faster on simple tasks and more persistent on complex ones. The update also adds clearer controls for customizing ChatGPT’s response style across different conversations.[198] |
| 2025 (December 11) | GPT-5.2 | — | — | Release | OpenAI releases GPT-5.2, a model series designed for professional knowledge work and long-running agents. The models deliver significant improvements in reasoning, coding, vision, tool use, and long-context understanding, achieving state-of-the-art performance across multiple benchmarks of real-world tasks. The series includes Instant, Thinking, and Pro variants, tailored respectively for everyday use, complex reasoning, and high-stakes applications.[199] |
Numerical and visual data
Wikipedia Views
The image below shows Wikipedia views data for the article Large language model, from February to September 2023.[200]

Google trends
The image below shows Google trends data for Large language model (topic), from January 2004 to September 2023, when the screenshot was taken. Interest is also ranked by country and displayed on world map.[201]

Meta information on the timeline
How the timeline was built
The initial version of the timeline was written by Sebastian.
Funding information for this timeline is available.
Feedback and comments
Feedback for the timeline can be provided at the following places:
- FIXME
What the timeline is still missing
- https://huggingface.co/transformers/v2.10.0/pretrained_models.html
- summary table listing the model and parameters
- Vipul: I think you should add columns for model name in the full timeline. And either in the full timeline, or in a separate table with a summary of model names, you should have columns for number of parameters and training data set (or training data set size)✔
- https://lifearchitect.ai/timeline/
- https://www.researchgate.net/publication/367652128_Benchmarking_Large_Language_Models_for_News_Summarization
- https://arxiv.org/search/?query=Large+language+model&searchtype=all&source=header
- https://research.aimultiple.com/large-language-models/
Timeline update strategy
Pingbacks
- NL2Py: A Natural Language Interface for Processing Data with Python, Johannes Kepler University Linz
See also
- Timeline of ChatGPT
- Timeline of Bing Chat
- Timeline of Google Bard
- Timeline of Google Gemini
- Timeline of DeepSeek
- Timeline of Mistral AI
- Timeline of Grok
- Timeline of xAI
- Timeline of Anthropic
- Timeline of OpenAI
- Timeline of artificial intelligence
- Timeline of machine learning
- Timeline of transformers
External links
References
- ↑ "Large Language Models: Complete Guide in 2023". research.aimultiple.com. Retrieved 11 March 2023.
- ↑ 2.0 2.1 2.2 2.3 Pathak, Priyanka (11 May 2023). "Large Language Models 101: History, Evolution and Future". Scribble Data. Retrieved 12 October 2023.
- ↑ 3.0 3.1 Casey, Matt (25 May 2023). "Large language models: their history, capabilities and limitations". Snorkel AI. Retrieved 12 October 2023.
- ↑ 4.0 4.1 4.2 4.3 4.4 "Introduction to Large Language Models | Omega Venture Partners". omegavp.com. 6 December 2022. Retrieved 12 October 2023.
- ↑ 5.0 5.1 5.2 5.3 5.4 5.5 "Brief History of Large Language Models & Generative AI | Evolution of NLP from Eliza to ChatGPT". youtube.com. Retrieved 17 October 2023.
- ↑ 6.0 6.1 "Large Language Model Training in 2023". research.aimultiple.com. Retrieved 11 March 2023.
- ↑ Yanhui, Chen (8 March 2021). "A Battle Against Amnesia: A Brief History and Introduction of Recurrent Neural Networks". Medium. Retrieved 17 October 2023.
- ↑ "The Bahdanau Attention Mechanism". machinelearningmastery.com. Retrieved 17 October 2023.
- ↑ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". doi:10.48550/arXiv.1810.04805.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "BERT 101 - State Of The Art NLP Model Explained". huggingface.co. Retrieved 16 October 2023.
- ↑ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". doi:10.48550/arXiv.1810.04805.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "GPT-2: 6-month follow-up". openai.com. Retrieved 23 March 2023.
- ↑ Zellers, Rowan; Holtzman, Ari; Rashkin, Hannah; Bisk, Yonatan; Farhadi, Ali; Roesner, Franziska; Choi, Yejin (2019). "Defending Against Neural Fake News". doi:10.48550/arXiv.1905.12616.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "BERT, RoBERTa, DistilBERT, XLNet: Which one to use?". KDnuggets. Retrieved 29 June 2023.
- ↑ Ph.D, Suleiman Khan (18 May 2021). "BERT, RoBERTa, DistilBERT, XLNet — which one to use?". Medium. Retrieved 16 October 2023.
- ↑ Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding". doi:10.48550/arXiv.1906.08237.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ 17.0 17.1 G, Juan (21 September 2021). "An Intuitive Explanation of Transformer-Based Models". Factored | Machine Learning, Data Engineering and Data Analytics Company. Retrieved 16 October 2023.
- ↑ "Overview of ROBERTa model". GeeksforGeeks. 24 November 2020. Retrieved 16 October 2023.
- ↑ Liu, Yinhan; Ott, Myle; Goyal, Naman; Du, Jingfei; Joshi, Mandar; Chen, Danqi; Levy, Omer; Lewis, Mike; Zettlemoyer, Luke; Stoyanov, Veselin (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". doi:10.48550/arXiv.1907.11692.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ 20.0 20.1 20.2 20.3 "AI: Megatron the Transformer, and its related language models". Dr Alan D. Thompson – Life Architect. 24 September 2021. Retrieved 16 October 2023.
- ↑ "Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained | Exxact Blog". www.exxactcorp.com. Retrieved 11 March 2023.
- ↑ 22.0 22.1 22.2 "AI: Megatron the Transformer, and its related language models". lifearchitect.ai. 24 September 2021. Retrieved 18 September 2023.
- ↑ "NeMo Megatron — NVIDIA NeMo". docs.nvidia.com. Retrieved 11 March 2023.
- ↑ "Nvidia trains world's largest Transformer-based language model". VentureBeat. 13 August 2019. Retrieved 18 September 2023.
- ↑ Keskar, Nitish Shirish; McCann, Bryan; Varshney, Lav R.; Xiong, Caiming; Socher, Richard (2019). "CTRL: A Conditional Transformer Language Model for Controllable Generation". doi:10.48550/arXiv.1909.05858.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Lan, Zhenzhong; Chen, Mingda; Goodman, Sebastian; Gimpel, Kevin; Sharma, Piyush; Soricut, Radu (2019). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". doi:10.48550/arXiv.1909.11942.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Herbst, Sabrina (24 January 2023). "Training a DistilBERT model from scratch". Medium. Retrieved 17 October 2023.
- ↑ Sanh, Victor; Debut, Lysandre; Chaumond, Julien; Wolf, Thomas (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter". doi:10.48550/ARXIV.1910.01108.
{{cite web}}: Missing or empty|url=(help) - ↑ Kuzman, Taja (29 March 2023). "Microsoft introduced its DialoGPT to Skype and Edge". Medium. Retrieved 19 September 2023.
- ↑ Zhang, Yizhe; Sun, Siqi; Galley, Michel; Chen, Yen-Chun; Brockett, Chris; Gao, Xiang; Gao, Jianfeng; Liu, Jingjing; Dolan, Bill (2019). "DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation". doi:10.48550/arXiv.1911.00536.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Pretrained models — transformers 2.10.0 documentation". huggingface.co.
- ↑ 32.0 32.1 32.2 32.3 Blanc, Corentin; Bailly, Alexandre; Francis, Élie; Guillotin, Thierry; Jamal, Fadi; Wakim, Béchara; Roy, Pascal (1 May 2022). "FlauBERT vs. CamemBERT: Understanding patient's answers by a French medical chatbot". Artificial Intelligence in Medicine. 127: 102264. doi:10.1016/j.artmed.2022.102264. ISSN 0933-3657.
- ↑ Martin, Louis; Muller, Benjamin; Suárez, Pedro Javier Ortiz; Dupont, Yoann; Romary, Laurent; de la Clergerie, Éric Villemonte; Seddah, Djamé; Sagot, Benoît (2019). "CamemBERT: a Tasty French Language Model". doi:10.48550/arXiv.1911.03894.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Sambucci, Luca (17 November 2021). "Cedille, the largest French AI language model, is actually from Switzerland". Artificial Intelligence news. Retrieved 30 June 2023.
- ↑ Le, Hang; Vial, Loïc; Frej, Jibril; Segonne, Vincent; Coavoux, Maximin; Lecouteux, Benjamin; Allauzen, Alexandre; Crabbé, Benoît; Besacier, Laurent; Schwab, Didier (2019). "FlauBERT: Unsupervised Language Model Pre-training for French". doi:10.48550/arXiv.1912.05372.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Qi, Weizhen; Yan, Yu; Gong, Yeyun; Liu, Dayiheng; Duan, Nan; Chen, Jiusheng; Zhang, Ruofei; Zhou, Ming (2020). "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training". doi:10.48550/arXiv.2001.04063.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Jagtap, Rohan (2 August 2020). "T5: Text-To-Text Transfer Transformer". Medium. Retrieved 19 September 2023.
- ↑ 38.00 38.01 38.02 38.03 38.04 38.05 38.06 38.07 38.08 38.09 38.10 38.11 38.12 38.13 38.14 38.15 38.16 38.17 38.18 38.19 38.20 38.21 38.22 38.23 38.24 38.25 38.26 38.27 38.28 38.29 38.30 38.31 38.32 38.33 38.34 38.35 38.36 38.37 38.38 38.39 38.40 38.41 38.42 38.43 38.44 38.45 38.46 38.47 38.48 38.49 38.50 38.51 38.52 38.53 Zhao, Wayne Xin; Zhou, Kun; Li, Junyi; Tang, Tianyi; Wang, Xiaolei; Hou, Yupeng; Min, Yingqian; Zhang, Beichen; Zhang, Junjie; Dong, Zican; Du, Yifan; Yang, Chen; Chen, Yushuo; Chen, Zhipeng; Jiang, Jinhao; Ren, Ruiyang; Li, Yifan; Tang, Xinyu; Liu, Zikang; Liu, Peiyu; Nie, Jian-Yun; Wen, Ji-Rong (2023). "A Survey of Large Language Models". doi:10.48550/arXiv.2303.18223.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer". ai.googleblog.com. 24 February 2020. Retrieved 25 June 2023.
- ↑ "More Efficient NLP Model Pre-training with ELECTRA". ai.googleblog.com. 10 March 2020. Retrieved 28 June 2023.
- ↑ Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter". TechCrunch. Retrieved 29 June 2023.
- ↑ "OpenAI GPT-3: Everything You Need to Know [Updated]". springboard.com. Retrieved 16 October 2023.
- ↑ Romero, Alberto (25 May 2021). "GPT-3 — A Complete Overview". Medium. Retrieved 20 October 2023.
- ↑ Lee, Angie (26 January 2023). "What Are Large Language Models Used For and Why Are They Important?". NVIDIA Blog. Retrieved 11 March 2023.
- ↑ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (2020). "Language Models are Few-Shot Learners". doi:10.48550/arXiv.2005.14165.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Tsang, Sik-Ho (21 January 2023). "Brief Review — DeBERTa: Decoding-enhanced BERT with Disentangled Attention". Medium. Retrieved 18 September 2023.
- ↑ He, Pengcheng; Liu, Xiaodong; Gao, Jianfeng; Chen, Weizhu (2020). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention". doi:10.48550/arXiv.2006.03654.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun, Maxim; Shazeer, Noam; Chen, Zhifeng (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". doi:10.48550/arXiv.2006.16668.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Maynez, Joshua; Narayan, Shashi; Bohnet, Bernd; McDonald, Ryan (July 2020). "On Faithfulness and Factuality in Abstractive Summarization". Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics: 1906–1919. doi:10.18653/v1/2020.acl-main.173.
- ↑ Xue, Linting; Constant, Noah; Roberts, Adam; Kale, Mihir; Al-Rfou, Rami; Siddhant, Aditya; Barua, Aditya; Raffel, Colin (2021). "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer": 483–498. doi:10.18653/v1/2021.naacl-main.41.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Wu Dao 2.0 in 2023: China's Improved Version of GPT-3". research.aimultiple.com. Retrieved 16 October 2023.
- ↑ "China's gigantic multi-modal AI is no one-trick pony". Engadget. 2 June 2021. Retrieved 18 October 2023.
- ↑ "GPT Neo". March 15, 2023.
- ↑ "GPT-3's free alternative GPT-Neo is something to be excited about". VentureBeat. 15 May 2021. Retrieved 29 June 2023.
- ↑ Zeng, Wei; Ren, Xiaozhe; Su, Teng; Wang, Hui; Liao, Yi; Wang, Zhiwei; Jiang, Xin; Yang, ZhenZhang; Wang, Kaisheng; Zhang, Xiaoda; Li, Chen; Gong, Ziyan; Yao, Yifan; Huang, Xinjing; Wang, Jun; Yu, Jianfeng; Guo, Qi; Yu, Yue; Zhang, Yan; Wang, Jin; Tao, Hengtao; Yan, Dasen; Yi, Zexuan; Peng, Fang; Jiang, Fangqing; Zhang, Han; Deng, Lingfeng; Zhang, Yehong; Lin, Zhe; Zhang, Chao; Zhang, Shaojie; Guo, Mingyue; Gu, Shanzhi; Fan, Gaojun; Wang, Yaowei; Jin, Xuefeng; Liu, Qun; Tian, Yonghong (2021). "PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation". doi:10.48550/arXiv.2104.12369.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ 56.0 56.1 56.2 56.3 56.4 56.5 56.6 56.7 Kazi, Suleman (28 March 2023). "Top Large Language Models (LLMs): GPT-4, LLaMA, FLAN UL2, BLOOM, and More". Vectara. Retrieved 29 June 2023.
- ↑ Tsang, Sik-Ho (13 May 2023). "Brief Review — LaMDA: Language Models for Dialog Applications". Medium. Retrieved 16 October 2023.
- ↑ Zhang, Zhengyan; Gu, Yuxian; Han, Xu; Chen, Shengqi; Xiao, Chaojun; Sun, Zhenbo; Yao, Yuan; Qi, Fanchao; Guan, Jian; Ke, Pei; Cai, Yanzheng; Zeng, Guoyang; Tan, Zhixing; Liu, Zhiyuan; Huang, Minlie; Han, Wentao; Liu, Yang; Zhu, Xiaoyan; Sun, Maosong (2021). "CPM-2: Large-scale Cost-effective Pre-trained Language Models". doi:10.48550/arXiv.2106.10715.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Sun, Yu; Wang, Shuohuan; Feng, Shikun; Ding, Siyu; Pang, Chao; Shang, Junyuan; Liu, Jiaxiang; Chen, Xuyi; Zhao, Yanbin; Lu, Yuxiang; Liu, Weixin; Wu, Zhihua; Gong, Weibao; Liang, Jianzhong; Shang, Zhizhou; Sun, Peng; Liu, Wei; Ouyang, Xuan; Yu, Dianhai; Tian, Hao; Wu, Hua; Wang, Haifeng (2021). "ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation". doi:10.48550/arXiv.2107.02137.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas; Brockman, Greg; Ray, Alex; Puri, Raul; Krueger, Gretchen; Petrov, Michael; Khlaaf, Heidy; Sastry, Girish; Mishkin, Pamela; Chan, Brooke; Gray, Scott; Ryder, Nick; Pavlov, Mikhail; Power, Alethea; Kaiser, Lukasz; Bavarian, Mohammad; Winter, Clemens; Tillet, Philippe; Such, Felipe Petroski; Cummings, Dave; Plappert, Matthias; Chantzis, Fotios; Barnes, Elizabeth; Herbert-Voss, Ariel; Guss, William Hebgen; Nichol, Alex; Paino, Alex; Tezak, Nikolas; Tang, Jie; Babuschkin, Igor; Balaji, Suchir; Jain, Shantanu; Saunders, William; Hesse, Christopher; Carr, Andrew N.; Leike, Jan; Achiam, Josh; Misra, Vedant; Morikawa, Evan; Radford, Alec; Knight, Matthew; Brundage, Miles; Murati, Mira; Mayer, Katie; Welinder, Peter; McGrew, Bob; Amodei, Dario; McCandlish, Sam; Sutskever, Ilya; Zaremba, Wojciech (2021). "Evaluating Large Language Models Trained on Code". doi:10.48550/arXiv.2107.03374.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ 61.0 61.1 Demo, GPT-3. "HyperCLOVA | Discover AI use cases". gpt3demo.com. Retrieved 20 October 2023.
{{cite web}}: CS1 maint: numeric names: authors list (link) - ↑ 62.0 62.1 Kim, Boseop; Kim, HyoungSeok; Lee, Sang-Woo; Lee, Gichang; Kwak, Donghyun; Dong Hyeon, Jeon; Park, Sunghyun; Kim, Sungju; Kim, Seonhoon; Seo, Dongpil; Lee, Heungsub; Jeong, Minyoung; Lee, Sungjae; Kim, Minsub; Ko, Suk Hyun; Kim, Seokhun; Park, Taeyong; Kim, Jinuk; Kang, Soyoung; Ryu, Na-Hyeon; Yoo, Kang Min; Chang, Minsuk; Suh, Soobin; In, Sookyo; Park, Jinseong; Kim, Kyungduk; Kim, Hiun; Jeong, Jisu; Yeo, Yong Goo; Ham, Donghoon; Park, Dongju; Lee, Min Young; Kang, Jaewook; Kang, Inho; Ha, Jung-Woo; Park, Woomyoung; Sung, Nako (2021). "What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers": 3405–3424. doi:10.18653/v1/2021.emnlp-main.274.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Wu, Shaohua; Zhao, Xudong; Yu, Tong; Zhang, Rongguo; Shen, Chong; Liu, Hongli; Li, Feng; Zhu, Hong; Luo, Jiangang; Xu, Liang; Zhang, Xuanwei (2021). "Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning". doi:10.48550/arXiv.2110.04725.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model". NVIDIA Technical Blog. 11 October 2021. Retrieved 30 June 2023.
- ↑ Rae, Jack W.; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; Rutherford, Eliza; Hennigan, Tom; Menick, Jacob; Cassirer, Albin; Powell, Richard; Driessche, George van den; Hendricks, Lisa Anne; Rauh, Maribeth; Huang, Po-Sen; Glaese, Amelia; Welbl, Johannes; Dathathri, Sumanth; Huang, Saffron; Uesato, Jonathan; Mellor, John; Higgins, Irina; Creswell, Antonia; McAleese, Nat; Wu, Amy; Elsen, Erich; Jayakumar, Siddhant; Buchatskaya, Elena; Budden, David; Sutherland, Esme; Simonyan, Karen; Paganini, Michela; Sifre, Laurent; Martens, Lena; Li, Xiang Lorraine; Kuncoro, Adhiguna; Nematzadeh, Aida; Gribovskaya, Elena; Donato, Domenic; Lazaridou, Angeliki; Mensch, Arthur; Lespiau, Jean-Baptiste; Tsimpoukelli, Maria; Grigorev, Nikolai; Fritz, Doug; Sottiaux, Thibault; Pajarskas, Mantas; Pohlen, Toby; Gong, Zhitao; Toyama, Daniel; d'Autume, Cyprien de Masson; Li, Yujia; Terzi, Tayfun; Mikulik, Vladimir; Babuschkin, Igor; Clark, Aidan; Casas, Diego de Las; Guy, Aurelia; Jones, Chris; Bradbury, James; Johnson, Matthew; Hechtman, Blake; Weidinger, Laura; Gabriel, Iason; Isaac, William; Lockhart, Ed; Osindero, Simon; Rimell, Laura; Dyer, Chris; Vinyals, Oriol; Ayoub, Kareem; Stanway, Jeff; Bennett, Lorrayne; Hassabis, Demis; Kavukcuoglu, Koray; Irving, Geoffrey (2021). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". doi:10.48550/arXiv.2112.11446.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Google Trains 280 Billion Parameter AI Language Model Gopher". InfoQ. Retrieved 21 October 2023.
- ↑ Du, Nan; Huang, Yanping; Dai, Andrew M.; Tong, Simon; Lepikhin, Dmitry; Xu, Yuanzhong; Krikun, Maxim; Zhou, Yanqi; Yu, Adams Wei; Firat, Orhan; Zoph, Barret; Fedus, Liam; Bosma, Maarten; Zhou, Zongwei; Wang, Tao; Wang, Yu Emma; Webster, Kellie; Pellat, Marie; Robinson, Kevin; Meier-Hellstern, Kathleen; Duke, Toju; Dixon, Lucas; Zhang, Kun; Le, Quoc V; Wu, Yonghui; Chen, Zhifeng; Cui, Claire (2021). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts". doi:10.48550/arXiv.2112.06905.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "WebGPT: Improving the factual accuracy of language models through web browsing". openai.com. Retrieved 21 October 2023.
- ↑ "fairseq documentation — fairseq 0.12.2 documentation". fairseq.readthedocs.io. Retrieved 16 May 2023.
- ↑ Aghajanyan, Armen; Huang, Bernie; Ross, Candace; Karpukhin, Vladimir; Xu, Hu; Goyal, Naman; Okhonko, Dmytro; Joshi, Mandar; Ghosh, Gargi; Lewis, Mike; Zettlemoyer, Luke (2022). "CM3: A Causal Masked Multimodal Model of the Internet". doi:10.48550/arXiv.2201.07520.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Aligning language models to follow instructions". openai.com. Retrieved 21 March 2023.
- ↑ "Finally, an AI bot that can ace technical interview questions (Ep. 417) - Stack Overflow". stackoverflow.blog. 22 February 2022. Retrieved 21 October 2023.
- ↑ "Cohere launches Extremely Large (beta)". Context by Cohere. 1 March 2022. Retrieved 12 March 2023.
- ↑ Shuster, Kurt; Komeili, Mojtaba; Adolphs, Leonard; Roller, Stephen; Szlam, Arthur; Weston, Jason (2022). "Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion". doi:10.48550/arXiv.2203.13224.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Nijkamp, Erik; Pang, Bo; Hayashi, Hiroaki; Tu, Lifu; Wang, Huan; Zhou, Yingbo; Savarese, Silvio; Xiong, Caiming (2022). "CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis". doi:10.48550/arXiv.2203.13474.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "CodeGen". github.com. Salesforce. 16 May 2023. Retrieved 16 May 2023.
- ↑ Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan; Guy, Aurelia; Osindero, Simon; Simonyan, Karen; Elsen, Erich; Rae, Jack W.; Vinyals, Oriol; Sifre, Laurent (2022). "Training Compute-Optimal Large Language Models". doi:10.48550/arXiv.2203.15556.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Chowdhery, Aakanksha; Narang, Sharan; Devlin, Jacob; Bosma, Maarten; Mishra, Gaurav; Roberts, Adam; Barham, Paul; Chung, Hyung Won; Sutton, Charles; Gehrmann, Sebastian; Schuh, Parker; Shi, Kensen; Tsvyashchenko, Sasha; Maynez, Joshua; Rao, Abhishek; Barnes, Parker; Tay, Yi; Shazeer, Noam; Prabhakaran, Vinodkumar; Reif, Emily; Du, Nan; Hutchinson, Ben; Pope, Reiner; Bradbury, James; Austin, Jacob; Isard, Michael; Gur-Ari, Guy; Yin, Pengcheng; Duke, Toju; Levskaya, Anselm; Ghemawat, Sanjay; Dev, Sunipa; Michalewski, Henryk; Garcia, Xavier; Misra, Vedant; Robinson, Kevin; Fedus, Liam; Zhou, Denny; Ippolito, Daphne; Luan, David; Lim, Hyeontaek; Zoph, Barret; Spiridonov, Alexander; Sepassi, Ryan; Dohan, David; Agrawal, Shivani; Omernick, Mark; Dai, Andrew M.; Pillai, Thanumalayan Sankaranarayana; Pellat, Marie; Lewkowycz, Aitor; Moreira, Erica; Child, Rewon; Polozov, Oleksandr; Lee, Katherine; Zhou, Zongwei; Wang, Xuezhi; Saeta, Brennan; Diaz, Mark; Firat, Orhan; Catasta, Michele; Wei, Jason; Meier-Hellstern, Kathy; Eck, Douglas; Dean, Jeff; Petrov, Slav; Fiedel, Noah (2022). "PaLM: Scaling Language Modeling with Pathways". doi:10.48550/arXiv.2204.02311.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance". ai.googleblog.com. Retrieved 21 March 2023.
- ↑ Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". doi:10.48550/arXiv.2204.05862.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Black, Sid; Biderman, Stella; Hallahan, Eric; Anthony, Quentin; Gao, Leo; Golding, Laurence; He, Horace; Leahy, Connor; McDonell, Kyle; Phang, Jason; Pieler, Michael; Prashanth, USVSN Sai; Purohit, Shivanshu; Reynolds, Laria; Tow, Jonathan; Wang, Ben; Weinbach, Samuel (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model". doi:10.48550/arXiv.2204.06745.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Leahy, Connor (2 February 2022). "Announcing GPT-NeoX-20B". EleutherAI Blog. Retrieved 21 March 2023.
- ↑ "Comparing AI models : DALLE and Stable Diffusion". www.linkedin.com. Retrieved 16 October 2023.
- ↑ Howell, James (22 September 2023). "What is Dall-E and How Does it Work? What is Dall-E and How Does it Work?". 101 Blockchains. Retrieved 16 October 2023.
- ↑ "What is Dall-E (Dall-E 2) and How Does it Work?". Enterprise AI. Retrieved 16 October 2023.
- ↑ Gonsalves, Robert A. (5 September 2023). "Exploring DALL-E for Digital Art Creation". Medium. Retrieved 16 October 2023.
- ↑ "Democratizing access to large-scale language models with OPT-175B". ai.meta.com. Retrieved 20 September 2023.
- ↑ Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; García, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Bahri, Dara; Schuster, Tal; Zheng, H.; Zhou, Denny; Houlsby, N.; Metzler, Donald (10 May 2022). "UL2: Unifying Language Learning Paradigms".
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Khrushchev, Mikhail (23 June 2022). "Yandex Publishes YaLM 100B. It's the Largest GPT-Like Neural Network in Open Source". Yandex. Retrieved 20 September 2023.
- ↑ Lewkowycz, Aitor; Andreassen, Anders; Dohan, David; Dyer, Ethan; Michalewski, Henryk; Ramasesh, Vinay; Slone, Ambrose; Anil, Cem; Schlag, Imanol; Gutman-Solo, Theo; Wu, Yuhuai; Neyshabur, Behnam; Gur-Ari, Guy; Misra, Vedant (2022). "Solving Quantitative Reasoning Problems with Language Models". doi:10.48550/arXiv.2206.14858.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Chopra, Disha (1 July 2022). "Google Developed Minerva, an AI That Can Answer Math Questions". Analytics Drift. Retrieved 20 September 2023.
- ↑ "New AI Model Translates 200 Languages, Making Technology Accessible to More People". Meta. 6 July 2022. Retrieved 19 October 2023.
- ↑ Rodriguez, Jesus (15 August 2022). "AlexaTM 20B is Amazon's New Language Super Model Which is Also Capable of Few-Shot Learning". Medium. Retrieved 21 October 2023.
- ↑ Elemuwa, Fimber (22 February 2023). "Using CodeGeeX as a GitHub Copilot alternative". LogRocket Blog. Retrieved 19 October 2023.
- ↑ Zheng, Qinkai; Xia, Xiao; Zou, Xu; Dong, Yuxiao; Wang, Shan; Xue, Yufei; Wang, Zihan; Shen, Lei; Wang, Andi; Li, Yang; Su, Teng; Yang, Zhilin; Tang, Jie (2023). "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X". doi:10.48550/arXiv.2303.17568.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Could Deepmind's Sparrow be Google's answer to ChatGPT?". medium.com. Retrieved 21 October 2023.
- ↑ Su, Hui; Zhou, Xiao; Yu, Houjin; Shen, Xiaoyu; Chen, Yuwen; Zhu, Zilin; Yu, Yang; Zhou, Jie (2022). "WeLM: A Well-Read Pre-trained Language Model for Chinese". doi:10.48550/arXiv.2209.10372.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Zeng, Aohan; Liu, Xiao; Du, Zhengxiao; Wang, Zihan; Lai, Hanyu; Ding, Ming; Yang, Zhuoyi; Xu, Yifan; Zheng, Wendi; Xia, Xiao; Tam, Weng Lam; Ma, Zixuan; Xue, Yufei; Zhai, Jidong; Chen, Wenguang; Zhang, Peng; Dong, Yuxiao; Tang, Jie (2022). "GLM-130B: An Open Bilingual Pre-trained Model". doi:10.48550/arXiv.2210.02414.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Workshop, BigScience; Scao, Teven Le; Fan, Angela; Akiki, Christopher; Pavlick, Ellie; Ilić, Suzana; Hesslow, Daniel; Castagné, Roman; Luccioni, Alexandra Sasha; Yvon, François; Gallé, Matthias; Tow, Jonathan; Rush, Alexander M.; Biderman, Stella; Webson, Albert; Ammanamanchi, Pawan Sasanka; Wang, Thomas; Sagot, Benoît; Muennighoff, Niklas; del Moral, Albert Villanova; Ruwase, Olatunji; Bawden, Rachel; Bekman, Stas; McMillan-Major, Angelina; Beltagy, Iz; Nguyen, Huu; Saulnier, Lucile; Tan, Samson; Suarez, Pedro Ortiz; Sanh, Victor; Laurençon, Hugo; Jernite, Yacine; Launay, Julien; Mitchell, Margaret; Raffel, Colin; Gokaslan, Aaron; Simhi, Adi; Soroa, Aitor; Aji, Alham Fikri; Alfassy, Amit; Rogers, Anna; Nitzav, Ariel Kreisberg; Xu, Canwen; Mou, Chenghao; Emezue, Chris; Klamm, Christopher; Leong, Colin; van Strien, Daniel; Adelani, David Ifeoluwa; Radev, Dragomir; Ponferrada, Eduardo González; Levkovizh, Efrat; Kim, Ethan; Natan, Eyal Bar; De Toni, Francesco; Dupont, Gérard; Kruszewski, Germán; Pistilli, Giada; Elsahar, Hady; Benyamina, Hamza; Tran, Hieu; Yu, Ian; Abdulmumin, Idris; Johnson, Isaac; Gonzalez-Dios, Itziar; de la Rosa, Javier; Chim, Jenny; Dodge, Jesse; Zhu, Jian; Chang, Jonathan; Frohberg, Jörg; Tobing, Joseph; Bhattacharjee, Joydeep; Almubarak, Khalid; Chen, Kimbo; Lo, Kyle; Von Werra, Leandro; Weber, Leon; Phan, Long; allal, Loubna Ben; Tanguy, Ludovic; Dey, Manan; Muñoz, Manuel Romero; Masoud, Maraim; Grandury, María; Šaško, Mario; Huang, Max; Coavoux, Maximin; Singh, Mayank; Jiang, Mike Tian-Jian; Vu, Minh Chien; Jauhar, Mohammad A.; Ghaleb, Mustafa; Subramani, Nishant; Kassner, Nora; Khamis, Nurulaqilla; Nguyen, Olivier; Espejel, Omar; de Gibert, Ona; Villegas, Paulo; Henderson, Peter; Colombo, Pierre; Amuok, Priscilla; Lhoest, Quentin; Harliman, Rheza; Bommasani, Rishi; López, Roberto Luis; Ribeiro, Rui; Osei, Salomey; Pyysalo, Sampo; Nagel, Sebastian; Bose, Shamik; Muhammad, Shamsuddeen Hassan; Sharma, Shanya; Longpre, Shayne; Nikpoor, Somaieh; Silberberg, Stanislav; Pai, Suhas; Zink, Sydney; Torrent, Tiago Timponi; Schick, Timo; Thrush, Tristan; Danchev, Valentin; Nikoulina, Vassilina; Laippala, Veronika; Lepercq, Violette; Prabhu, Vrinda; Alyafeai, Zaid; Talat, Zeerak; Raja, Arun; Heinzerling, Benjamin; Si, Chenglei; Taşar, Davut Emre; Salesky, Elizabeth; Mielke, Sabrina J.; Lee, Wilson Y.; Sharma, Abheesht; Santilli, Andrea; Chaffin, Antoine; Stiegler, Arnaud; Datta, Debajyoti; Szczechla, Eliza; Chhablani, Gunjan; Wang, Han; Pandey, Harshit; Strobelt, Hendrik; Fries, Jason Alan; Rozen, Jos; Gao, Leo; Sutawika, Lintang; Bari, M. Saiful; Al-shaibani, Maged S.; Manica, Matteo; Nayak, Nihal; Teehan, Ryan; Albanie, Samuel; Shen, Sheng; Ben-David, Srulik; Bach, Stephen H.; Kim, Taewoon; Bers, Tali; Fevry, Thibault; Neeraj, Trishala; Thakker, Urmish; Raunak, Vikas; Tang, Xiangru; Yong, Zheng-Xin; Sun, Zhiqing; Brody, Shaked; Uri, Yallow; Tojarieh, Hadar; Roberts, Adam; Chung, Hyung Won; Tae, Jaesung; Phang, Jason; Press, Ofir; Li, Conglong; Narayanan, Deepak; Bourfoune, Hatim; Casper, Jared; Rasley, Jeff; Ryabinin, Max; Mishra, Mayank; Zhang, Minjia; Shoeybi, Mohammad; Peyrounette, Myriam; Patry, Nicolas; Tazi, Nouamane; Sanseviero, Omar; von Platen, Patrick; Cornette, Pierre; Lavallée, Pierre François; Lacroix, Rémi; Rajbhandari, Samyam; Gandhi, Sanchit; Smith, Shaden; Requena, Stéphane; Patil, Suraj; Dettmers, Tim; Baruwa, Ahmed; Singh, Amanpreet; Cheveleva, Anastasia; Ligozat, Anne-Laure; Subramonian, Arjun; Névéol, Aurélie; Lovering, Charles; Garrette, Dan; Tunuguntla, Deepak; Reiter, Ehud; Taktasheva, Ekaterina; Voloshina, Ekaterina; Bogdanov, Eli; Winata, Genta Indra; Schoelkopf, Hailey; Kalo, Jan-Christoph; Novikova, Jekaterina; Forde, Jessica Zosa; Clive, Jordan; Kasai, Jungo; Kawamura, Ken; Hazan, Liam; Carpuat, Marine; Clinciu, Miruna; Kim, Najoung; Cheng, Newton; Serikov, Oleg; Antverg, Omer; van der Wal, Oskar; Zhang, Rui; Zhang, Ruochen; Gehrmann, Sebastian; Mirkin, Shachar; Pais, Shani; Shavrina, Tatiana; Scialom, Thomas; Yun, Tian; Limisiewicz, Tomasz; Rieser, Verena; Protasov, Vitaly; Mikhailov, Vladislav; Pruksachatkun, Yada; Belinkov, Yonatan; Bamberger, Zachary; Kasner, Zdeněk; Rueda, Alice; Pestana, Amanda; Feizpour, Amir; Khan, Ammar; Faranak, Amy; Santos, Ana; Hevia, Anthony; Unldreaj, Antigona; Aghagol, Arash; Abdollahi, Arezoo; Tammour, Aycha; HajiHosseini, Azadeh; Behroozi, Bahareh; Ajibade, Benjamin; Saxena, Bharat; Ferrandis, Carlos Muñoz; Contractor, Danish; Lansky, David; David, Davis; Kiela, Douwe; Nguyen, Duong A.; Tan, Edward; Baylor, Emi; Ozoani, Ezinwanne; Mirza, Fatima; Ononiwu, Frankline; Rezanejad, Habib; Jones, Hessie; Bhattacharya, Indrani; Solaiman, Irene; Sedenko, Irina; Nejadgholi, Isar; Passmore, Jesse; Seltzer, Josh; Sanz, Julio Bonis; Dutra, Livia; Samagaio, Mairon; Elbadri, Maraim; Mieskes, Margot; Gerchick, Marissa; Akinlolu, Martha; McKenna, Michael; Qiu, Mike; Ghauri, Muhammed; Burynok, Mykola; Abrar, Nafis; Rajani, Nazneen; Elkott, Nour; Fahmy, Nour; Samuel, Olanrewaju; An, Ran; Kromann, Rasmus; Hao, Ryan; Alizadeh, Samira; Shubber, Sarmad; Wang, Silas; Roy, Sourav; Viguier, Sylvain; Le, Thanh; Oyebade, Tobi; Le, Trieu; Yang, Yoyo; Nguyen, Zach; Kashyap, Abhinav Ramesh; Palasciano, Alfredo; Callahan, Alison; Shukla, Anima; Miranda-Escalada, Antonio; Singh, Ayush; Beilharz, Benjamin; Wang, Bo; Brito, Caio; Zhou, Chenxi; Jain, Chirag; Xu, Chuxin; Fourrier, Clémentine; Periñán, Daniel León; Molano, Daniel; Yu, Dian; Manjavacas, Enrique; Barth, Fabio; Fuhrimann, Florian; Altay, Gabriel; Bayrak, Giyaseddin; Burns, Gully; Vrabec, Helena U.; Bello, Imane; Dash, Ishani; Kang, Jihyun; Giorgi, John; Golde, Jonas; Posada, Jose David; Sivaraman, Karthik Rangasai; Bulchandani, Lokesh; Liu, Lu; Shinzato, Luisa; de Bykhovetz, Madeleine Hahn; Takeuchi, Maiko; Pàmies, Marc; Castillo, Maria A.; Nezhurina, Marianna; Sänger, Mario; Samwald, Matthias; Cullan, Michael; Weinberg, Michael; De Wolf, Michiel; Mihaljcic, Mina; Liu, Minna; Freidank, Moritz; Kang, Myungsun; Seelam, Natasha; Dahlberg, Nathan; Broad, Nicholas Michio; Muellner, Nikolaus; Fung, Pascale; Haller, Patrick; Chandrasekhar, Ramya; Eisenberg, Renata; Martin, Robert; Canalli, Rodrigo; Su, Rosaline; Su, Ruisi; Cahyawijaya, Samuel; Garda, Samuele; Deshmukh, Shlok S.; Mishra, Shubhanshu; Kiblawi, Sid; Ott, Simon; Sang-aroonsiri, Sinee; Kumar, Srishti; Schweter, Stefan; Bharati, Sushil; Laud, Tanmay; Gigant, Théo; Kainuma, Tomoya; Kusa, Wojciech; Labrak, Yanis; Bajaj, Yash Shailesh; Venkatraman, Yash; Xu, Yifan; Xu, Yingxin; Xu, Yu; Tan, Zhe; Xie, Zhongli; Ye, Zifan; Bras, Mathilde; Belkada, Younes; Wolf, Thomas (13 March 2023). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model". arXiv:2211.05100 [cs].
- ↑ "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog". aws.amazon.com. 17 November 2022. Retrieved 20 September 2023.
- ↑ "FLAN-T5 vs. FLAN-UL2: Which LLM is Better? | Sapling". sapling.ai. Retrieved 19 October 2023.
- ↑ Subhash, Varshini (5 January 2023). "Can Large Language Models Change User Preference Adversarially?". arXiv:2302.10291 [cs]. doi:10.48550/arXiv.2302.10291.
- ↑ Joshi, Harshit; Ebenezer, Abishai; Cambronero, José; Gulwani, Sumit; Kanade, Aditya; Le, Vu; Radiček, Ivan; Verbruggen, Gust (31 January 2023). "FLAME: A small language model for spreadsheet formulas". arXiv:2301.13779 [cs]. doi:10.48550/arXiv.2301.13779.
- ↑ Zhang, Zhuosheng; Zhang, Aston; Li, Mu; Zhao, Hai; Karypis, George; Smola, Alex (2023). "Multimodal Chain-of-Thought Reasoning in Language Models". doi:10.48550/arXiv.2302.00923.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Vinija's Notes • Models • Toolformer". vinija.ai. Retrieved 26 June 2023.
- ↑ Schick, Timo; Dwivedi-Yu, Jane; Dessì, Roberto; Raileanu, Roberta; Lomeli, Maria; Zettlemoyer, Luke; Cancedda, Nicola; Scialom, Thomas (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools". doi:10.48550/arXiv.2302.04761.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Shaped". www.shaped.ai. Retrieved 16 May 2023.
- ↑ "fnlp/moss-moon-003-base · Hugging Face". huggingface.co. 20 April 2023. Retrieved 26 June 2023.
- ↑ "MOSS". txsun1997.github.io. Retrieved 11 March 2023.
- ↑ White, Jules; Fu, Quchen; Hays, Sam; Sandborn, Michael; Olea, Carlos; Gilbert, Henry; Elnashar, Ashraf; Spencer-Smith, Jesse; Schmidt, Douglas C. (21 February 2023). "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT". arXiv:2302.11382 [cs]. doi:10.48550/arXiv.2302.11382.
- ↑ "LLaMA: Open and Efficient Foundation Language Models - Meta Research". Meta Research. Retrieved 11 March 2023.
- ↑ Peng, Baolin; Galley, Michel; He, Pengcheng; Cheng, Hao; Xie, Yujia; Hu, Yu; Huang, Qiuyuan; Liden, Lars; Yu, Zhou; Chen, Weizhu; Gao, Jianfeng (1 March 2023). "Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback". arXiv:2302.12813 [cs]. doi:10.48550/arXiv.2302.12813.
- ↑ Raieli, Salvatore (13 March 2023). "SpikeGPT: a 260 M only parameters LM not afraid of competition". Medium. Retrieved 26 June 2023.
- ↑ Zhu, Rui-Jie; Zhao, Qihang; Eshraghian, Jason K. (28 February 2023). "SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks". arXiv:2302.13939 [cs]. doi:10.48550/arXiv.2302.13939.
- ↑ Al-Kaswan, Ali; Izadi, Maliheh (28 February 2023). "The (ab)use of Open Source Code to Train Large Language Models". arXiv:2302.13681 [cs]. doi:10.48550/arXiv.2302.13681.
- ↑ Kwon, Minae; Xie, Sang Michael; Bullard, Kalesha; Sadigh, Dorsa (27 February 2023). "Reward Design with Language Models". arXiv:2303.00001 [cs]. doi:10.48550/arXiv.2303.00001.
- ↑ Bastian, Matthias (3 March 2023). "Microsoft's Kosmos-1 is a multimodal step toward more general AI". THE DECODER. Retrieved 18 September 2023.
- ↑ Huang, Shaohan; Dong, Li; Wang, Wenhui; Hao, Yaru; Singhal, Saksham; Ma, Shuming; Lv, Tengchao; Cui, Lei; Mohammed, Owais Khan; Patra, Barun; Liu, Qiang; Aggarwal, Kriti; Chi, Zewen; Bjorck, Johan; Chaudhary, Vishrav; Som, Subhojit; Song, Xia; Wei, Furu (1 March 2023). "Language Is Not All You Need: Aligning Perception with Language Models". arXiv:2302.14045 [cs]. doi:10.48550/arXiv.2302.14045.
- ↑ Cao, Meng; Fatemi, Mehdi; Cheung, Jackie Chi Kit; Shabanian, Samira (27 February 2023). "Systematic Rectification of Language Models via Dead-end Analysis". arXiv:2302.14003 [cs]. doi:10.48550/arXiv.2302.14003.
- ↑ Bertolini, Lorenzo; Elce, Valentina; Michalak, Adriana; Bernardi, Giulio; Weeds, Julie (28 February 2023). "Automatic Scoring of Dream Reports' Emotional Content with Large Language Models". arXiv:2302.14828 [cs]. doi:10.48550/arXiv.2302.14828.
- ↑ Ye, Seonghyeon; Hwang, Hyeonbin; Yang, Sohee; Yun, Hyeongu; Kim, Yireun; Seo, Minjoon (28 February 2023). "In-Context Instruction Learning". arXiv:2302.14691 [cs]. doi:10.48550/arXiv.2302.14691.
- ↑ Houghton, Conor; Kazanina, Nina; Sukumaran, Priyanka (28 February 2023). "Beyond the limitations of any imaginable mechanism: large language models and psycholinguistics". arXiv:2303.00077 [cs]. doi:10.48550/arXiv.2303.00077. Retrieved 10 March 2023.
- ↑ Yuan, Yang (2023). "Succinct Representations for Concepts". doi:10.48550/arXiv.2303.00446.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "A New Open Source Flan 20B with UL2". Yi Tay. Retrieved 30 June 2023.
- ↑ Zhang, Bowen; Soh, Harold (6 March 2023). "Large Language Models as Zero-Shot Human Models for Human-Robot Interaction". arXiv:2303.03548 [cs]. doi:10.48550/arXiv.2303.03548.
- ↑ "Stanford CRFM". crfm.stanford.edu. Retrieved 21 March 2023.
- ↑ "Announcement of Jurassic-2 and Task-Specific APIs". Data Phoenix. 12 March 2023. Retrieved 21 September 2023.
- ↑ "Large language model: Revision history - Wikipedia". en.wikipedia.org. Retrieved 21 September 2023.
- ↑ "How does Claude, the new LLM from Anthropic compare to ChatGPT? A serious contender". www.cerebrium.ai. Retrieved 18 September 2023.
- ↑ "Introducing Claude". Anthropic. Retrieved 30 June 2023.
- ↑ "Falcon LLM: Abu Dhabi's Based TII Latest AI Breakthrough for Next-Gen Solutions". www.tii.ae. 6 September 2023. Retrieved 20 September 2023.
- ↑ "GPT-NeoX". huggingface.co. Retrieved 20 March 2023.
- ↑ Lubbad, Mohammed (7 August 2023). "The Ultimate Guide to GPT-4 Parameters: Everything You Need to Know about NLP's Game-Changer". Medium. Retrieved 19 September 2023.
- ↑ "GPT-4 Technical Report". 2023. doi:10.48550/arXiv.2303.08774.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Ren, Xiaozhe; Zhou, Pingyi; Meng, Xinfan; Huang, Xinjing; Wang, Yadao; Wang, Weichao; Li, Pengfei; Zhang, Xiaoda; Podolskiy, Alexander; Arshinov, Grigory; Bout, Andrey; Piontkovskaya, Irina; Wei, Jiansheng; Jiang, Xin; Su, Teng; Liu, Qun; Yao, Jun (2023). "PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing". doi:10.48550/arXiv.2303.10845.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Tickoo, Aneesh (10 July 2023). "Huawei Researchers Develop Pangu-Σ: A Large Language Model With Sparse Architecture And 1.085 Trillion Parameters". MarkTechPost. Retrieved 16 October 2023.
- ↑ "ChatGLM-6B". github.com. THUDM. 30 June 2023. Retrieved 30 June 2023.
- ↑ Eloundou, Tyna; Manning, Sam; Mishkin, Pamela; Rock, Daniel (2023). "GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models". doi:10.48550/arXiv.2303.10130.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM". Databricks. 12 April 2023. Retrieved 19 September 2023.
- ↑ "Hello Dolly: Democratizing the magic of ChatGPT with open models". Databricks. 24 March 2023. Retrieved 19 June 2023.
- ↑ Dey, Nolan (28 March 2023). "Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models". Cerebras. Retrieved 20 September 2023.
- ↑ "BloombergGPT: The 50 Billion Parameter Large Language Model for Finance". Medium. 8 April 2023. Retrieved 20 September 2023.
- ↑ Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew; Schuhmann, Christoph; Nguyen, Huu; Mattick, Alexander (2023). "OpenAssistant Conversations -- Democratizing Large Language Model Alignment". doi:10.48550/arXiv.2304.07327.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Roth, Emma (19 April 2023). "Stability AI announces new open-source large language model". The Verge. Retrieved 9 May 2023.
- ↑ "Stability AI Launches the First of its StableLM Suite of Language Models". Stability AI. Retrieved 9 May 2023.
- ↑ Xu, Can; Sun, Qingfeng; Zheng, Kai; Geng, Xiubo; Zhao, Pu; Feng, Jiazhan; Tao, Chongyang; Jiang, Daxin (2023). "WizardLM: Empowering Large Language Models to Follow Complex Instructions". doi:10.48550/arXiv.2304.12244.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Salesforce/codegen2-16B · Hugging Face". huggingface.co. Retrieved 20 October 2023.
- ↑ Nijkamp, Erik; Hayashi, Hiroaki; Xiong, Caiming; Savarese, Silvio; Zhou, Yingbo (2023). "CodeGen2: Lessons for Training LLMs on Programming and Natural Languages". doi:10.48550/arXiv.2305.02309.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Schwartz, Barry (12 May 2023). "Bing Chat gains image answers with knowledge cards and optimized answers". Search Engine Land. Retrieved 16 May 2023.
- ↑ "How to Access PaLM 2 AND TRY IT". MLYearning. 15 May 2023. Retrieved 16 May 2023.
- ↑ Hern, Alex (10 May 2023). "Google launches new AI PaLM 2 in attempt to regain leadership of the pack". The Guardian. Retrieved 16 May 2023.
- ↑ Wang, Wenhai; Chen, Zhe; Chen, Xiaokang; Wu, Jiannan; Zhu, Xizhou; Zeng, Gang; Luo, Ping; Lu, Tong; Zhou, Jie; Qiao, Yu; Dai, Jifeng (2023). "VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks". doi:10.48550/arXiv.2305.11175.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Xu, Canwen; Guo, Daya; Duan, Nan; McAuley, Julian (2023). "Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data". doi:10.48550/arXiv.2304.01196.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Patil, Shishir G.; Zhang, Tianjun; Wang, Xin; Gonzalez, Joseph E. (2023). "Gorilla: Large Language Model Connected with Massive APIs". doi:10.48550/arXiv.2305.15334.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "UC Berkeley Researchers Open-Source API-Calling Language Model Gorilla". InfoQ. Retrieved 15 October 2023.
- ↑ Ko, Hyunwoong; Yang, Kichang; Ryu, Minho; Choi, Taekyoon; Yang, Seungmu; Hyun, Jiwung; Park, Sungho; Park, Kyubyong (2023). "A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models". doi:10.48550/arXiv.2306.02254.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Yang, Hongyang; Liu, Xiao-Yang; Wang, Christina Dan (2023). "FinGPT: Open-Source Financial Large Language Models". doi:10.48550/arXiv.2306.06031.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Tăiatu, Iulian-Marius; Avram, Andrei-Marius; Cercel, Dumitru-Clementin; Pop, Florin (2023). "RoBERTweet: A BERT Language Model for Romanian Tweets". doi:10.48550/arXiv.2306.06598.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Gao, Difei; Ji, Lei; Zhou, Luowei; Lin, Kevin Qinghong; Chen, Joya; Fan, Zihan; Shou, Mike Zheng (2023). "AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn". doi:10.48550/arXiv.2306.08640.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Feng, Xidong; Luo, Yicheng; Wang, Ziyan; Tang, Hongrui; Yang, Mengyue; Shao, Kun; Mguni, David; Du, Yali; Wang, Jun (2023). "ChessGPT: Bridging Policy Learning and Language Modeling". doi:10.48550/arXiv.2306.09200.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ Sun, Yuqian; Li, Xingyu; Gao, Ze (2023). "Inspire creativity with ORIBA: Transform Artists' Original Characters into Chatbots through Large Language Model". doi:10.48550/arXiv.2306.09776.
{{cite journal}}: Cite journal requires|journal=(help) - ↑ "Your job is (probably) safe from artificial intelligence". The Economist. 7 May 2023. Retrieved 18 June 2023.
- ↑ "Generative AI Could Raise Global GDP by 7%". Goldman Sachs. Retrieved 18 June 2023.
- ↑ Dickson, Ben (19 June 2023). "ChatGPT will make the web toxic for its successors - TechTalks". bdtechtalks.com. Retrieved 18 July 2023.
- ↑ "ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases". arxiv.org. Retrieved 29 June 2023.
- ↑ Liao, Rita (11 July 2023). "China's search engine pioneer unveils open source large language model to rival OpenAI". TechCrunch. Retrieved 16 July 2023.
- ↑ Watson, Clare (9 September 2023). "Scientists Devised a Way to Tell if ChatGPT Becomes Aware of Itself". ScienceAlert. Retrieved 17 September 2023.
- ↑ "Alibaba launches its ChatGPT-like AI model for public use amid loosening restrictions in China". Cointelegraph. 13 September 2023. Retrieved 17 September 2023.
- ↑ "Google DeepMind Gemini". Dr Alan D. Thompson – Life Architect. 20 May 2023. Retrieved 18 September 2023.
- ↑ Jones, Luke (9 October 2023). "Microsoft Researchers Develop Unlearning Technique for Large Language Models". WinBuzzer. Retrieved 9 October 2023.
- ↑ "Announcing Grok". x.ai. Retrieved 3 September 2024.
- ↑ "Claude 2.1 Introduces 200K Context Window". zeniteq.com. Retrieved 3 September 2024.
- ↑ "Introducing Claude 2.1". anthropic.com. Retrieved 3 September 2024.
- ↑ "Our next-generation model: Gemini 1.5". Google Blog. 15 February 2024. Retrieved 10 December 2025.
- ↑ "Introducing the next generation of Claude". Anthropic. 4 March 2024. Retrieved 9 December 2025.
- ↑ "Cohere releases Command R+ AI model designed for enterprise-scale use". SiliconANGLE. 4 April 2024. Retrieved 11 December 2025.
- ↑ "Introducing Meta Llama 3: The most capable openly available LLM to date". Meta AI. 18 April 2024. Retrieved 11 December 2025.
- ↑ "The Phi-3 small language models with big potential". Microsoft News. Microsoft. Retrieved 11 December 2025.
- ↑ "OpenAI Unveils GPT-4o: How does it affect us?". Teneo.ai. Retrieved 11 December 2025.
- ↑ "Claude 3.5 Sonnet". Anthropic. Retrieved 11 December 2025.
- ↑ "Mistral NeMo". Mistral AI. Retrieved 11 December 2025.
- ↑ "Mistral Large 2". Mistral AI. Retrieved 11 December 2025.
- ↑ "Grok-2 Beta Release". xAI. Retrieved 11 December 2025.
- ↑ "Alibaba releases Qwen 2.5 models, raising the bar for open-weight LLMs". DeepLearning.AI. Retrieved 11 December 2025.
- ↑ "Qwen 2.5 LLM". QwenLM Blog. Retrieved 11 December 2025.
- ↑ "DeepSeek‑V3.2 Release". DeepSeek API Docs. Retrieved 11 December 2025.
- ↑ "Introducing Gemini 2.0: our new AI model for the agentic era". Google Blog. 11 December 2024. Retrieved 11 December 2025.
- ↑ "DeepSeek-R1 Release". DeepSeek API Docs. Retrieved 11 December 2025.
- ↑ "Claude 3.7 Sonnet and Claude Code". Anthropic. Retrieved 11 December 2025.
- ↑ "Introducing GPT-4.5". OpenAI. Retrieved 11 December 2025.
- ↑ "Introducing Gemma 3: The most capable model you can run on a single GPU or TPU". Google Blog. Retrieved 11 December 2025.
- ↑ "Introducing Gemini 2.5: Our most intelligent AI model". Google Blog. 25 March 2025. Retrieved 11 December 2025.
- ↑ "The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation". Meta AI. Meta. 5 April 2025. Retrieved 11 December 2025.
- ↑ "Introducing Claude 4". Anthropic. Retrieved 11 December 2025.
- ↑ "Grok 4". x.ai. xAI. Retrieved 13 December 2025.
- ↑ "Apertus". Swiss AI Initiative. Swiss AI. Retrieved 13 December 2025.
- ↑ Template:Cite arXiv
- ↑ "GPT-5.1: Un ChatGPT más inteligente y conversacional". OpenAI. Retrieved 13 December 2025.
- ↑ "Introducing GPT-5.2". openai.com. OpenAI. Retrieved 13 December 2025.
- ↑ "Wikipedia Views: results". wikipediaviews.org. Retrieved 21 September 2023.
- ↑ "Google Trends". Google Trends. Retrieved 21 September 2023.