Difference between revisions of "Timeline of large language models"
(44 intermediate revisions by 2 users not shown) | |||
Line 32: | Line 32: | ||
| 1970s-2000s || Incremental progress || These decades see incremental progress. Researchers experiment with conceptual ontologies and rule-based systems in NLP. In the 1990s, the emergence of deep learning, a form of machine learning employing neural networks for data processing, enables the development of increasingly sophisticated language models. The introduction of Long Short-Term Memory (LSTM) networks in 1997 enables the development of deeper neural networks capable of handling larger datasets. Additionally, tools like Stanford’s CoreNLP suite, introduced in 2010, provides algorithms for complex NLP tasks such as sentiment analysis and named entity recognition. Google Brain’s launch in 2011, offering advanced resources and features like word embeddings, further propells the field.<ref name="omegavp.comv"/> | | 1970s-2000s || Incremental progress || These decades see incremental progress. Researchers experiment with conceptual ontologies and rule-based systems in NLP. In the 1990s, the emergence of deep learning, a form of machine learning employing neural networks for data processing, enables the development of increasingly sophisticated language models. The introduction of Long Short-Term Memory (LSTM) networks in 1997 enables the development of deeper neural networks capable of handling larger datasets. Additionally, tools like Stanford’s CoreNLP suite, introduced in 2010, provides algorithms for complex NLP tasks such as sentiment analysis and named entity recognition. Google Brain’s launch in 2011, offering advanced resources and features like word embeddings, further propells the field.<ref name="omegavp.comv"/> | ||
|- | |- | ||
− | | 2010s onwards || Rise of large language models || In the 2010s, the landscape of language processing transforms dramatically. The introduction of Transformer models in 2017 revolutionizes NLP. This architecture allows for the creation of Large Language Models (LLMs) capable of understanding context and generating human-like text. From 2019 onwards, the rise of Large Language Models | + | | 2010s onwards || Rise of large language models || In the 2010s, the landscape of language processing transforms dramatically. The introduction of Transformer models in 2017 revolutionizes NLP. This architecture allows for the creation of Large Language Models (LLMs) capable of understanding context and generating human-like text. From 2019 onwards, the rise of Large Language Models gains momentum with the introduction of models like GPT-2, GPT-3, and T5. These models can perform diverse tasks, driving a paradigm shift in AI capabilities. They become emblematic, serving as foundations for various applications, including ChatGPT.<ref name="Brief History of Large"/> Recent years also witness the emergence of user-friendly frameworks, such as Hugging Face and BARD, empowering researchers and developers to create their own LLMs seamlessly.<ref name=llm>{{cite web |title=Large Language Model Training in 2023 |url=https://research.aimultiple.com/large-language-model-training/ |website=research.aimultiple.com |access-date=11 March 2023}}</ref><ref name="Pathak"/> |
|- | |- | ||
|} | |} | ||
Line 55: | Line 55: | ||
| 2017 || || || || || Early development || [[w:Transformer (machine learning model)|Transformer models]] are introduced. This innovative architecture, enabled by {{w|Google Brain}}'s pioneering work, would revolutionize {{w|natural language processing}}. Transformers allow for the creation of larger and more sophisticated LLMs, including OpenAI’s GPT-3 (Generative Pre-Trained Transformer). These models would become foundational, serving as the basis for applications like {{w|ChatGPT}} and numerous other AI-driven innovations. The introduction of Transformers ushers in a new era of highly capable and versatile language processing systems.<ref name="Pathak"/><ref name="Brief History of Large"/> | | 2017 || || || || || Early development || [[w:Transformer (machine learning model)|Transformer models]] are introduced. This innovative architecture, enabled by {{w|Google Brain}}'s pioneering work, would revolutionize {{w|natural language processing}}. Transformers allow for the creation of larger and more sophisticated LLMs, including OpenAI’s GPT-3 (Generative Pre-Trained Transformer). These models would become foundational, serving as the basis for applications like {{w|ChatGPT}} and numerous other AI-driven innovations. The introduction of Transformers ushers in a new era of highly capable and versatile language processing systems.<ref name="Pathak"/><ref name="Brief History of Large"/> | ||
|- | |- | ||
− | | 2018 || October 11 || BERT || 340,000,000<ref>{{cite journal |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=2018 |doi=10.48550/arXiv.1810.04805 |url=}}</ref> || 3,300,000,000<ref>{{cite web |title=BERT 101 - State Of The Art NLP Model Explained |url=https://huggingface.co/blog/bert-101 |website=huggingface.co |access-date=16 October 2023}}</ref> || LLM launch || Google researchers unveil BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language model. BERT's bidirectional design enables it to consider both input and output context, enhancing its understanding of language nuances. Employing a consistent-width neural network, BERT adapts to diverse tasks. Pre-trained on extensive unstructured data, it comprehensively grasps word relationships. BERT's simplicity and effectiveness makes it accessible to researchers and practitioners, allowing fine-tuning for various tasks with minimal adjustments. Upon its release, BERT sets unprecedented records in NLP benchmark tests, swiftly becoming the industry standard. Within 18 months, it would power the majority of English queries processed by {{w|Google Search}}.<ref name="Snorkel AI"/><ref>{{cite journal |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=2018 |doi=10.48550/arXiv.1810.04805}}</ref><ref name="Brief History of Large"/> | + | | 2018 || October 11 || BERT || 340,000,000<ref>{{cite journal |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=2018 |doi=10.48550/arXiv.1810.04805 |url=}}</ref> || 3,300,000,000 words<ref>{{cite web |title=BERT 101 - State Of The Art NLP Model Explained |url=https://huggingface.co/blog/bert-101 |website=huggingface.co |access-date=16 October 2023}}</ref> || LLM launch || Google researchers unveil BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language model. BERT's bidirectional design enables it to consider both input and output context, enhancing its understanding of language nuances. Employing a consistent-width neural network, BERT adapts to diverse tasks. Pre-trained on extensive unstructured data, it comprehensively grasps word relationships. BERT's simplicity and effectiveness makes it accessible to researchers and practitioners, allowing fine-tuning for various tasks with minimal adjustments. Upon its release, BERT sets unprecedented records in NLP benchmark tests, swiftly becoming the industry standard. Within 18 months, it would power the majority of English queries processed by {{w|Google Search}}.<ref name="Snorkel AI"/><ref>{{cite journal |last1=Devlin |first1=Jacob |last2=Chang |first2=Ming-Wei |last3=Lee |first3=Kenton |last4=Toutanova |first4=Kristina |title=BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding |date=2018 |doi=10.48550/arXiv.1810.04805}}</ref><ref name="Brief History of Large"/> |
|- | |- | ||
| 2019 || May 29 || GROVER || || || LLM launch || A team of researchers from the {{w|University of Washington}} and [[w:Allen Institute for AI|Allen Institute for AI Research]] introduce GROVER, a language model similar to GPT-2. However, they do not make the larger versions of the model publicly available.<ref name="GPT-2: 6-month follow-up">{{cite web |title=GPT-2: 6-month follow-up |url=https://openai.com/research/gpt-2-6-month-follow-up |website=openai.com |access-date=23 March 2023}}</ref> Their publication discusses the potential risks of natural language generation technology and the need for robust defenses against neural fake news. Grover can generate realistic news articles that are difficult to distinguish from real news. They also explore the effectiveness of current methods for detecting fake news and find that the best defense against Grover is itself, with 92% accuracy. The article concludes by discussing the ethical issues related to the technology and the importance of public release of strong generators to facilitate better detection of neural fake news.<ref>{{cite journal |last1=Zellers |first1=Rowan |last2=Holtzman |first2=Ari |last3=Rashkin |first3=Hannah |last4=Bisk |first4=Yonatan |last5=Farhadi |first5=Ali |last6=Roesner |first6=Franziska |last7=Choi |first7=Yejin |title=Defending Against Neural Fake News |date=2019 |doi=10.48550/arXiv.1905.12616}}</ref> | | 2019 || May 29 || GROVER || || || LLM launch || A team of researchers from the {{w|University of Washington}} and [[w:Allen Institute for AI|Allen Institute for AI Research]] introduce GROVER, a language model similar to GPT-2. However, they do not make the larger versions of the model publicly available.<ref name="GPT-2: 6-month follow-up">{{cite web |title=GPT-2: 6-month follow-up |url=https://openai.com/research/gpt-2-6-month-follow-up |website=openai.com |access-date=23 March 2023}}</ref> Their publication discusses the potential risks of natural language generation technology and the need for robust defenses against neural fake news. Grover can generate realistic news articles that are difficult to distinguish from real news. They also explore the effectiveness of current methods for detecting fake news and find that the best defense against Grover is itself, with 92% accuracy. The article concludes by discussing the ethical issues related to the technology and the importance of public release of strong generators to facilitate better detection of neural fake news.<ref>{{cite journal |last1=Zellers |first1=Rowan |last2=Holtzman |first2=Ari |last3=Rashkin |first3=Hannah |last4=Bisk |first4=Yonatan |last5=Farhadi |first5=Ali |last6=Roesner |first6=Franziska |last7=Choi |first7=Yejin |title=Defending Against Neural Fake News |date=2019 |doi=10.48550/arXiv.1905.12616}}</ref> | ||
|- | |- | ||
− | | 2019 || June 19 || XLNet || ~340,000,000<ref>{{cite web |title=BERT, RoBERTa, DistilBERT, XLNet: Which one to use? |url=https://www.kdnuggets.com/2019/09/bert-roberta-distilbert-xlnet-one-use.html |website=KDnuggets |access-date=29 June 2023}}</ref> || 130,000,000,000<ref>{{cite web |last1=Ph.D |first1=Suleiman Khan |title=BERT, RoBERTa, DistilBERT, XLNet — which one to use? |url=https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8#:~:text=XLNet%20was%20trained%20with%20over,e%20much%20larger%20than%20BERT. |website=Medium |access-date=16 October 2023 |language=en |date=18 May 2021}}</ref> || LLM launch || XLNet is introduced as a generalized autoregressive pretraining method for language understanding. Unlike BERT, which relies on masking input tokens, XLNet considers all permutations of the factorization order to model bidirectional contexts. This approach overcomes the limitations of BERT and improves pretrain-finetune consistency. XLNet incorporates ideas from Transformer-XL, an autoregressive model, into its pretraining process. In empirical evaluations across 20 tasks, XLNet outperforms BERT by a significant margin, including question answering, natural language inference, sentiment analysis, and document ranking.<ref>{{cite journal |last1=Yang |first1=Zhilin |last2=Dai |first2=Zihang |last3=Yang |first3=Yiming |last4=Carbonell |first4=Jaime |last5=Salakhutdinov |first5=Ruslan |last6=Le |first6=Quoc V. |title=XLNet: Generalized Autoregressive Pretraining for Language Understanding |date=2019 |doi=10.48550/arXiv.1906.08237}}</ref> | + | | 2019 || June 19 || XLNet || ~340,000,000<ref>{{cite web |title=BERT, RoBERTa, DistilBERT, XLNet: Which one to use? |url=https://www.kdnuggets.com/2019/09/bert-roberta-distilbert-xlnet-one-use.html |website=KDnuggets |access-date=29 June 2023}}</ref> || 130,000,000,000 bytes<ref>{{cite web |last1=Ph.D |first1=Suleiman Khan |title=BERT, RoBERTa, DistilBERT, XLNet — which one to use? |url=https://towardsdatascience.com/bert-roberta-distilbert-xlnet-which-one-to-use-3d5ab82ba5f8#:~:text=XLNet%20was%20trained%20with%20over,e%20much%20larger%20than%20BERT. |website=Medium |access-date=16 October 2023 |language=en |date=18 May 2021}}</ref> || LLM launch || XLNet is introduced as a generalized autoregressive pretraining method for language understanding. Unlike BERT, which relies on masking input tokens, XLNet considers all permutations of the factorization order to model bidirectional contexts. This approach overcomes the limitations of BERT and improves pretrain-finetune consistency. XLNet incorporates ideas from Transformer-XL, an autoregressive model, into its pretraining process. In empirical evaluations across 20 tasks, XLNet outperforms BERT by a significant margin, including question answering, natural language inference, sentiment analysis, and document ranking.<ref>{{cite journal |last1=Yang |first1=Zhilin |last2=Dai |first2=Zihang |last3=Yang |first3=Yiming |last4=Carbonell |first4=Jaime |last5=Salakhutdinov |first5=Ruslan |last6=Le |first6=Quoc V. |title=XLNet: Generalized Autoregressive Pretraining for Language Understanding |date=2019 |doi=10.48550/arXiv.1906.08237}}</ref> |
|- | |- | ||
− | | 2019 || July 26 || RoBERTa || 123,000,000–354,000,000<ref name="Factored">{{cite web |last1=G |first1=Juan |title=An Intuitive Explanation of Transformer-Based Models |url=https://factored.ai/transformer-based-language-models/ |website=Factored {{!}} Machine Learning, Data Engineering and Data Analytics Company |access-date=16 October 2023 |date=21 September 2021}}</ref> || 160,000,000,000<ref>{{cite web |title=Overview of ROBERTa model |url=https://www.geeksforgeeks.org/overview-of-roberta-model/ |website=GeeksforGeeks |access-date=16 October 2023 |date=24 November 2020}}</ref> || LLM launch || Researchers introduce "RoBERTa: A Robustly Optimized BERT Pretraining Approach," after conducting a replication study of BERT pretraining (Devlin et al., 2019) to evaluate the impact of key hyperparameters and training data size on performance. They find that BERT was undertrained and demonstrate that it can achieve or surpass the performance of subsequent models. The authors achieve state-of-the-art results on GLUE, RACE, and SQuAD benchmarks, highlighting the significance of overlooked design choices and questioning the origins of recently reported improvements.<ref>{{cite journal |last1=Liu |first1=Yinhan |last2=Ott |first2=Myle |last3=Goyal |first3=Naman |last4=Du |first4=Jingfei |last5=Joshi |first5=Mandar |last6=Chen |first6=Danqi |last7=Levy |first7=Omer |last8=Lewis |first8=Mike |last9=Zettlemoyer |first9=Luke |last10=Stoyanov |first10=Veselin |title=RoBERTa: A Robustly Optimized BERT Pretraining Approach |date=2019 |doi=10.48550/arXiv.1907.11692}}</ref> | + | | 2019 || July 26 || RoBERTa || 123,000,000–354,000,000<ref name="Factored">{{cite web |last1=G |first1=Juan |title=An Intuitive Explanation of Transformer-Based Models |url=https://factored.ai/transformer-based-language-models/ |website=Factored {{!}} Machine Learning, Data Engineering and Data Analytics Company |access-date=16 October 2023 |date=21 September 2021}}</ref> || 160,000,000,000 bytes<ref>{{cite web |title=Overview of ROBERTa model |url=https://www.geeksforgeeks.org/overview-of-roberta-model/ |website=GeeksforGeeks |access-date=16 October 2023 |date=24 November 2020}}</ref> || LLM launch || Researchers introduce "RoBERTa: A Robustly Optimized BERT Pretraining Approach," after conducting a replication study of BERT pretraining (Devlin et al., 2019) to evaluate the impact of key hyperparameters and training data size on performance. They find that BERT was undertrained and demonstrate that it can achieve or surpass the performance of subsequent models. The authors achieve state-of-the-art results on GLUE, RACE, and SQuAD benchmarks, highlighting the significance of overlooked design choices and questioning the origins of recently reported improvements.<ref>{{cite journal |last1=Liu |first1=Yinhan |last2=Ott |first2=Myle |last3=Goyal |first3=Naman |last4=Du |first4=Jingfei |last5=Joshi |first5=Mandar |last6=Chen |first6=Danqi |last7=Levy |first7=Omer |last8=Lewis |first8=Mike |last9=Zettlemoyer |first9=Luke |last10=Stoyanov |first10=Veselin |title=RoBERTa: A Robustly Optimized BERT Pretraining Approach |date=2019 |doi=10.48550/arXiv.1907.11692}}</ref> |
|- | |- | ||
| 2019 || August || Megatron-LM || 8,300,000,000 || 174,000,000,000 bytes<ref name="Dr Alan D. Thompson">{{cite web |title=AI: Megatron the Transformer, and its related language models |url=https://lifearchitect.ai/megatron/#megatron-lm |website=Dr Alan D. Thompson – Life Architect |access-date=16 October 2023 |language=en-AU |date=24 September 2021}}</ref> || LLM launch || NVIDIA introduces Megatron-LM<ref>{{cite web |title=Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained {{!}} Exxact Blog |url=https://www.exxactcorp.com/blog/Deep-Learning/megatron-unleashed-nvidia-s-nlp-model-megatron-lm-is-the-largest-transformer-ever-trained |website=www.exxactcorp.com |access-date=11 March 2023 |language=en}}</ref>, which boasts 8.3 billion parameters and is trained with data parallelism on a remarkable 512 GPUs. The training process took a mere 53 minutes, showcasing its computational efficiency. Megatron-LM's training data is sourced from diverse places, including Wikipedia, OpenWebText, RealNews, and CC-Stories, with a combined dataset size of 174 gigabytes. This model represents a significant milestone in the development of large-scale language models, highlighting the capabilities of modern hardware and data processing in the field of natural language processing.<ref name="lifearchitect.ai"/><ref>{{cite web |title=NeMo Megatron — NVIDIA NeMo |url=https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/megatron.html?nvid=nv-int-tblg-592611#:~:text=Megatron%2DLM%20is%20a%20highly,in%20NeMo%20for%20downstream%20tasks. |website=docs.nvidia.com |access-date=11 March 2023}}</ref><ref>{{cite web |title=Nvidia trains world’s largest Transformer-based language model |url=https://venturebeat.com/ai/nvidia-trains-worlds-largest-transformer-based-language-model/ |website=VentureBeat |access-date=18 September 2023 |date=13 August 2019}}</ref> | | 2019 || August || Megatron-LM || 8,300,000,000 || 174,000,000,000 bytes<ref name="Dr Alan D. Thompson">{{cite web |title=AI: Megatron the Transformer, and its related language models |url=https://lifearchitect.ai/megatron/#megatron-lm |website=Dr Alan D. Thompson – Life Architect |access-date=16 October 2023 |language=en-AU |date=24 September 2021}}</ref> || LLM launch || NVIDIA introduces Megatron-LM<ref>{{cite web |title=Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained {{!}} Exxact Blog |url=https://www.exxactcorp.com/blog/Deep-Learning/megatron-unleashed-nvidia-s-nlp-model-megatron-lm-is-the-largest-transformer-ever-trained |website=www.exxactcorp.com |access-date=11 March 2023 |language=en}}</ref>, which boasts 8.3 billion parameters and is trained with data parallelism on a remarkable 512 GPUs. The training process took a mere 53 minutes, showcasing its computational efficiency. Megatron-LM's training data is sourced from diverse places, including Wikipedia, OpenWebText, RealNews, and CC-Stories, with a combined dataset size of 174 gigabytes. This model represents a significant milestone in the development of large-scale language models, highlighting the capabilities of modern hardware and data processing in the field of natural language processing.<ref name="lifearchitect.ai"/><ref>{{cite web |title=NeMo Megatron — NVIDIA NeMo |url=https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nlp/megatron.html?nvid=nv-int-tblg-592611#:~:text=Megatron%2DLM%20is%20a%20highly,in%20NeMo%20for%20downstream%20tasks. |website=docs.nvidia.com |access-date=11 March 2023}}</ref><ref>{{cite web |title=Nvidia trains world’s largest Transformer-based language model |url=https://venturebeat.com/ai/nvidia-trains-worlds-largest-transformer-based-language-model/ |website=VentureBeat |access-date=18 September 2023 |date=13 August 2019}}</ref> | ||
Line 73: | Line 73: | ||
| 2019 || November 1 || DialoGPT || 1,500,000,000<ref>{{cite web |last1=Kuzman |first1=Taja |title=Microsoft introduced its DialoGPT to Skype and Edge |url=https://medium.com/@taja.kuzman/microsoft-introduced-its-dialogpt-to-skype-and-edge-4e1d7b694bfe |website=Medium |access-date=19 September 2023 |language=en |date=29 March 2023}}</ref> || || LLM launch || DialoGPT is introduced as a large, adaptable neural model for generating conversational responses. It is trained on 147 million conversation-like exchanges from {{w|Reddit}} comment chains spanning 2005 to 2017. DialoGPT, an extension of the Hugging Face PyTorch transformer, achieves performance close to human-level evaluation in single-turn dialogues. It outperforms strong baseline systems by generating more relevant, meaningful, and contextually consistent responses. The pre-trained model and training pipeline are publicly available, encouraging research in neural response generation and the advancement of intelligent open-domain dialogue systems.<ref>{{cite journal |last1=Zhang |first1=Yizhe |last2=Sun |first2=Siqi |last3=Galley |first3=Michel |last4=Chen |first4=Yen-Chun |last5=Brockett |first5=Chris |last6=Gao |first6=Xiang |last7=Gao |first7=Jianfeng |last8=Liu |first8=Jingjing |last9=Dolan |first9=Bill |title=DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation |date=2019 |doi=10.48550/arXiv.1911.00536}}</ref> | | 2019 || November 1 || DialoGPT || 1,500,000,000<ref>{{cite web |last1=Kuzman |first1=Taja |title=Microsoft introduced its DialoGPT to Skype and Edge |url=https://medium.com/@taja.kuzman/microsoft-introduced-its-dialogpt-to-skype-and-edge-4e1d7b694bfe |website=Medium |access-date=19 September 2023 |language=en |date=29 March 2023}}</ref> || || LLM launch || DialoGPT is introduced as a large, adaptable neural model for generating conversational responses. It is trained on 147 million conversation-like exchanges from {{w|Reddit}} comment chains spanning 2005 to 2017. DialoGPT, an extension of the Hugging Face PyTorch transformer, achieves performance close to human-level evaluation in single-turn dialogues. It outperforms strong baseline systems by generating more relevant, meaningful, and contextually consistent responses. The pre-trained model and training pipeline are publicly available, encouraging research in neural response generation and the advancement of intelligent open-domain dialogue systems.<ref>{{cite journal |last1=Zhang |first1=Yizhe |last2=Sun |first2=Siqi |last3=Galley |first3=Michel |last4=Chen |first4=Yen-Chun |last5=Brockett |first5=Chris |last6=Gao |first6=Xiang |last7=Gao |first7=Jianfeng |last8=Liu |first8=Jingjing |last9=Dolan |first9=Bill |title=DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation |date=2019 |doi=10.48550/arXiv.1911.00536}}</ref> | ||
|- | |- | ||
− | | 2019 || November 10 || CamemBERT || 110,000,000<ref name="Pretrained models">{{cite web |title=Pretrained models — transformers 2.10.0 documentation |url=https://huggingface.co/transformers/v2.10.0/pretrained_models.html |website=huggingface.co}}</ref><ref name="FlauBERT"/> || | + | | 2019 || November 10 || CamemBERT || 110,000,000<ref name="Pretrained models">{{cite web |title=Pretrained models — transformers 2.10.0 documentation |url=https://huggingface.co/transformers/v2.10.0/pretrained_models.html |website=huggingface.co}}</ref><ref name="FlauBERT"/> || 138,000,000,000 bytes<ref name="FlauBERT"/> || LLM launch || A paper introduces CamemBERT, a monolingual Transformer-based language model trained specifically for French. It addresses the limited practical use of pretrained models in languages other than English. The authors evaluate CamemBERT on various tasks including part-of-speech tagging, dependency parsing, named entity recognition, and natural language inference. They find that using web crawled data is preferable to {{w|Wikipedia}} data. Surprisingly, even with a relatively small web crawled dataset of 4GB, CamemBERT achieves results on par with or better than models trained on larger datasets of over 130GB. In fact, CamemBERT outperforms the state-of-the-art models in all four downstream tasks.<ref>{{cite journal |last1=Martin |first1=Louis |last2=Muller |first2=Benjamin |last3=Suárez |first3=Pedro Javier Ortiz |last4=Dupont |first4=Yoann |last5=Romary |first5=Laurent |last6=de la Clergerie |first6=Éric Villemonte |last7=Seddah |first7=Djamé |last8=Sagot |first8=Benoît |title=CamemBERT: a Tasty French Language Model |date=2019 |doi=10.48550/arXiv.1911.03894}}</ref> |
|- | |- | ||
| 2019 || December 11 || FlauBERT || 138,000,000 – 373,000,000<ref>{{cite web |last1=Sambucci |first1=Luca |title=Cedille, the largest French AI language model, is actually from Switzerland |url=https://www.artificialintelligence.news/cedille-the-largest-french-ai-language-model-is-actually-from-switzerland/ |website=Artificial Intelligence news |access-date=30 June 2023 |date=17 November 2021}}</ref><ref name="FlauBERT"/> || 71,000,000,000 bytes<ref name="FlauBERT">{{cite journal |last1=Blanc |first1=Corentin |last2=Bailly |first2=Alexandre |last3=Francis |first3=Élie |last4=Guillotin |first4=Thierry |last5=Jamal |first5=Fadi |last6=Wakim |first6=Béchara |last7=Roy |first7=Pascal |title=FlauBERT vs. CamemBERT: Understanding patient's answers by a French medical chatbot |journal=Artificial Intelligence in Medicine |date=1 May 2022 |volume=127 |pages=102264 |doi=10.1016/j.artmed.2022.102264 |url=https://www.sciencedirect.com/science/article/pii/S093336572200029X |issn=0933-3657}}</ref> || LLM launch || FlauBERT is introduced as an unsupervised language model for French. Developed by Hang Le et al., it leverages unlabeled texts to pre-train word representations, demonstrating superior performance in various NLP tasks. Trained on a large and diverse French corpus, FlauBERT outperforms other pre-training approaches. The authors share different FlauBERT versions and a unified evaluation protocol, FLUE, for reproducible French NLP experiments.<ref>{{cite journal |last1=Le |first1=Hang |last2=Vial |first2=Loïc |last3=Frej |first3=Jibril |last4=Segonne |first4=Vincent |last5=Coavoux |first5=Maximin |last6=Lecouteux |first6=Benjamin |last7=Allauzen |first7=Alexandre |last8=Crabbé |first8=Benoît |last9=Besacier |first9=Laurent |last10=Schwab |first10=Didier |title=FlauBERT: Unsupervised Language Model Pre-training for French |date=2019 |doi=10.48550/arXiv.1912.05372}}</ref> | | 2019 || December 11 || FlauBERT || 138,000,000 – 373,000,000<ref>{{cite web |last1=Sambucci |first1=Luca |title=Cedille, the largest French AI language model, is actually from Switzerland |url=https://www.artificialintelligence.news/cedille-the-largest-french-ai-language-model-is-actually-from-switzerland/ |website=Artificial Intelligence news |access-date=30 June 2023 |date=17 November 2021}}</ref><ref name="FlauBERT"/> || 71,000,000,000 bytes<ref name="FlauBERT">{{cite journal |last1=Blanc |first1=Corentin |last2=Bailly |first2=Alexandre |last3=Francis |first3=Élie |last4=Guillotin |first4=Thierry |last5=Jamal |first5=Fadi |last6=Wakim |first6=Béchara |last7=Roy |first7=Pascal |title=FlauBERT vs. CamemBERT: Understanding patient's answers by a French medical chatbot |journal=Artificial Intelligence in Medicine |date=1 May 2022 |volume=127 |pages=102264 |doi=10.1016/j.artmed.2022.102264 |url=https://www.sciencedirect.com/science/article/pii/S093336572200029X |issn=0933-3657}}</ref> || LLM launch || FlauBERT is introduced as an unsupervised language model for French. Developed by Hang Le et al., it leverages unlabeled texts to pre-train word representations, demonstrating superior performance in various NLP tasks. Trained on a large and diverse French corpus, FlauBERT outperforms other pre-training approaches. The authors share different FlauBERT versions and a unified evaluation protocol, FLUE, for reproducible French NLP experiments.<ref>{{cite journal |last1=Le |first1=Hang |last2=Vial |first2=Loïc |last3=Frej |first3=Jibril |last4=Segonne |first4=Vincent |last5=Coavoux |first5=Maximin |last6=Lecouteux |first6=Benjamin |last7=Allauzen |first7=Alexandre |last8=Crabbé |first8=Benoît |last9=Besacier |first9=Laurent |last10=Schwab |first10=Didier |title=FlauBERT: Unsupervised Language Model Pre-training for French |date=2019 |doi=10.48550/arXiv.1912.05372}}</ref> | ||
|- | |- | ||
− | | 2020 || January 13 || ProphetNet || || || LLM launch || A paper introduces ProphetNet, a new sequence-to-sequence pre-training model. It incorporates a novel self-supervised objective called future n-gram prediction and utilizes the n-stream self-attention mechanism. Unlike traditional models that optimize one-step-ahead prediction, ProphetNet predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction objective encourages the model to plan for future tokens and prevents overfitting to local correlations. ProphetNet is pre-trained on both a base-scale dataset (16GB) and a large-scale dataset (160GB). The model's performance is evaluated on benchmarks such as CNN/DailyMail, Gigaword, and SQuAD 1.1 for tasks like abstractive summarization and question generation. Experimental results demonstrate that ProphetNet outperforms models using the same pre-training corpus in terms of state-of-the-art results on all tested datasets.<ref>{{cite journal |last1=Qi |first1=Weizhen |last2=Yan |first2=Yu |last3=Gong |first3=Yeyun |last4=Liu |first4=Dayiheng |last5=Duan |first5=Nan |last6=Chen |first6=Jiusheng |last7=Zhang |first7=Ruofei |last8=Zhou |first8=Ming |title=ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training |date=2020 |doi=10.48550/arXiv.2001.04063}}</ref> | + | | 2020 || January 13 || ProphetNet || || 16,000,000,000–160,000,000,000 bytes || LLM launch || A paper introduces ProphetNet, a new sequence-to-sequence pre-training model. It incorporates a novel self-supervised objective called future n-gram prediction and utilizes the n-stream self-attention mechanism. Unlike traditional models that optimize one-step-ahead prediction, ProphetNet predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction objective encourages the model to plan for future tokens and prevents overfitting to local correlations. ProphetNet is pre-trained on both a base-scale dataset (16GB) and a large-scale dataset (160GB). The model's performance is evaluated on benchmarks such as CNN/DailyMail, Gigaword, and SQuAD 1.1 for tasks like abstractive summarization and question generation. Experimental results demonstrate that ProphetNet outperforms models using the same pre-training corpus in terms of state-of-the-art results on all tested datasets.<ref>{{cite journal |last1=Qi |first1=Weizhen |last2=Yan |first2=Yu |last3=Gong |first3=Yeyun |last4=Liu |first4=Dayiheng |last5=Duan |first5=Nan |last6=Chen |first6=Jiusheng |last7=Zhang |first7=Ruofei |last8=Zhou |first8=Ming |title=ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training |date=2020 |doi=10.48550/arXiv.2001.04063}}</ref> |
|- | |- | ||
| 2020 || February 24 || T5 || 11,000,000,000<ref>{{cite web |last1=Jagtap |first1=Rohan |title=T5: Text-To-Text Transfer Transformer |url=https://towardsdatascience.com/t5-text-to-text-transfer-transformer-643f89e8905e |website=Medium |access-date=19 September 2023 |language=en |date=2 August 2020}}</ref> || 1,000,000,000,000 tokens<ref name="A Survey of Large">{{cite journal |last1=Zhao |first1=Wayne Xin |last2=Zhou |first2=Kun |last3=Li |first3=Junyi |last4=Tang |first4=Tianyi |last5=Wang |first5=Xiaolei |last6=Hou |first6=Yupeng |last7=Min |first7=Yingqian |last8=Zhang |first8=Beichen |last9=Zhang |first9=Junjie |last10=Dong |first10=Zican |last11=Du |first11=Yifan |last12=Yang |first12=Chen |last13=Chen |first13=Yushuo |last14=Chen |first14=Zhipeng |last15=Jiang |first15=Jinhao |last16=Ren |first16=Ruiyang |last17=Li |first17=Yifan |last18=Tang |first18=Xinyu |last19=Liu |first19=Zikang |last20=Liu |first20=Peiyu |last21=Nie |first21=Jian-Yun |last22=Wen |first22=Ji-Rong |title=A Survey of Large Language Models |date=2023 |doi=10.48550/arXiv.2303.18223 |url=}}</ref> || LLM launch || T5 is introduced as a Text-To-Text Transfer Transformer model. It is a flexible and powerful model that achieves optimal results in natural language processing tasks. It uses a unified text-to-text framework, allowing for easy adaptation to various NLP tasks. T5 is trained on a large-scale pre-training dataset called C4, which improves its performance. The authors conduct a systematic study of transfer learning methodologies and combine the best approaches to achieve remarkable results on multiple benchmarks. T5 is also applied to closed-book question answering and fill-in-the-blank text generation tasks with impressive performance.<ref>{{cite web |title=Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer |url=https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html |website=ai.googleblog.com |access-date=25 June 2023 |language=en |date=24 February 2020}}</ref> | | 2020 || February 24 || T5 || 11,000,000,000<ref>{{cite web |last1=Jagtap |first1=Rohan |title=T5: Text-To-Text Transfer Transformer |url=https://towardsdatascience.com/t5-text-to-text-transfer-transformer-643f89e8905e |website=Medium |access-date=19 September 2023 |language=en |date=2 August 2020}}</ref> || 1,000,000,000,000 tokens<ref name="A Survey of Large">{{cite journal |last1=Zhao |first1=Wayne Xin |last2=Zhou |first2=Kun |last3=Li |first3=Junyi |last4=Tang |first4=Tianyi |last5=Wang |first5=Xiaolei |last6=Hou |first6=Yupeng |last7=Min |first7=Yingqian |last8=Zhang |first8=Beichen |last9=Zhang |first9=Junjie |last10=Dong |first10=Zican |last11=Du |first11=Yifan |last12=Yang |first12=Chen |last13=Chen |first13=Yushuo |last14=Chen |first14=Zhipeng |last15=Jiang |first15=Jinhao |last16=Ren |first16=Ruiyang |last17=Li |first17=Yifan |last18=Tang |first18=Xinyu |last19=Liu |first19=Zikang |last20=Liu |first20=Peiyu |last21=Nie |first21=Jian-Yun |last22=Wen |first22=Ji-Rong |title=A Survey of Large Language Models |date=2023 |doi=10.48550/arXiv.2303.18223 |url=}}</ref> || LLM launch || T5 is introduced as a Text-To-Text Transfer Transformer model. It is a flexible and powerful model that achieves optimal results in natural language processing tasks. It uses a unified text-to-text framework, allowing for easy adaptation to various NLP tasks. T5 is trained on a large-scale pre-training dataset called C4, which improves its performance. The authors conduct a systematic study of transfer learning methodologies and combine the best approaches to achieve remarkable results on multiple benchmarks. T5 is also applied to closed-book question answering and fill-in-the-blank text generation tasks with impressive performance.<ref>{{cite web |title=Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer |url=https://ai.googleblog.com/2020/02/exploring-transfer-learning-with-t5.html |website=ai.googleblog.com |access-date=25 June 2023 |language=en |date=24 February 2020}}</ref> | ||
Line 83: | Line 83: | ||
| 2020 || March 10 || || || || Programming/training || {{w|Google}} researchers introduce ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), a novel pre-training method for natural language processing (NLP) models. ELECTRA aims to achieve the benefits of models like BERT while being more computationally efficient. It introduces a replaced token detection (RTD) task, inspired by generative adversarial networks (GANs), where the model distinguishes between "real" and "fake" input data. Unlike previous methods that predict a small subset of masked tokens, ELECTRA applies the {{w|binary classification}} task to every input token, resulting in more efficient learning. The replacement tokens are generated by a separate {{w|neural network}} called the generator, which is trained jointly with the {{w|discriminator}} (ELECTRA model). After pre-training, the generator is dropped, and the discriminator is fine-tuned on specific NLP tasks. ELECTRA achieves optimal results on benchmarks like GLUE and SQuAD while using less compute compared to other models like RoBERTa and XLNet. It is released as an open-source model on {{w|TensorFlow}}, supporting tasks such as text classification, question answering, and [[w:Sequence labeling|sequence tagging]]. Pre-trained weights are also provided for ELECTRA-Large, ELECTRA-Base, and ELECTRA-Small.<ref>{{cite web |title=More Efficient NLP Model Pre-training with ELECTRA |url=https://ai.googleblog.com/2020/03/more-efficient-nlp-model-pre-training.html |website=ai.googleblog.com |access-date=28 June 2023 |language=en |date=10 March 2020}}</ref> | | 2020 || March 10 || || || || Programming/training || {{w|Google}} researchers introduce ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), a novel pre-training method for natural language processing (NLP) models. ELECTRA aims to achieve the benefits of models like BERT while being more computationally efficient. It introduces a replaced token detection (RTD) task, inspired by generative adversarial networks (GANs), where the model distinguishes between "real" and "fake" input data. Unlike previous methods that predict a small subset of masked tokens, ELECTRA applies the {{w|binary classification}} task to every input token, resulting in more efficient learning. The replacement tokens are generated by a separate {{w|neural network}} called the generator, which is trained jointly with the {{w|discriminator}} (ELECTRA model). After pre-training, the generator is dropped, and the discriminator is fine-tuned on specific NLP tasks. ELECTRA achieves optimal results on benchmarks like GLUE and SQuAD while using less compute compared to other models like RoBERTa and XLNet. It is released as an open-source model on {{w|TensorFlow}}, supporting tasks such as text classification, question answering, and [[w:Sequence labeling|sequence tagging]]. Pre-trained weights are also provided for ELECTRA-Large, ELECTRA-Base, and ELECTRA-Small.<ref>{{cite web |title=More Efficient NLP Model Pre-training with ELECTRA |url=https://ai.googleblog.com/2020/03/more-efficient-nlp-model-pre-training.html |website=ai.googleblog.com |access-date=28 June 2023 |language=en |date=10 March 2020}}</ref> | ||
|- | |- | ||
− | | 2020 || April || Megatron-11B || 11,000,000,000 || 161,000,000,000<ref name="Dr Alan D. Thompson"/> || LLM launch || Facebook AI Research (FAIR) introduces Megatron-11B, a unidirectional language model with 11 billion parameters, which is built upon the Megatron-LM architecture. FAIR trained this model using intra-layer model parallelism, splitting each layer's parameters across 8 GPUs. Megatron-11B is trained on a dataset consisting of English Wikipedia (12GB), BookCorpus (4GB), CC-News (76GB), OpenWebText/Reddit upvoted (38GB), and Stories (31GB), with a total dataset size of 161GB. This model is part of the RoBERTa family and contributes to the advancements in large-scale language models for natural language processing tasks.<ref name="lifearchitect.ai">{{cite web |title=AI: Megatron the Transformer, and its related language models |url=https://lifearchitect.ai/megatron/#:~:text=October%202021%3A%20NVIDIA%20and%20Microsoft,model%20(MT%2DNLG). |website=lifearchitect.ai |access-date=18 September 2023 |language=en-AU |date=24 September 2021}}</ref> | + | | 2020 || April || Megatron-11B || 11,000,000,000 || 161,000,000,000 bytes<ref name="Dr Alan D. Thompson"/> || LLM launch || Facebook AI Research (FAIR) introduces Megatron-11B, a unidirectional language model with 11 billion parameters, which is built upon the Megatron-LM architecture. FAIR trained this model using intra-layer model parallelism, splitting each layer's parameters across 8 GPUs. Megatron-11B is trained on a dataset consisting of English Wikipedia (12GB), BookCorpus (4GB), CC-News (76GB), OpenWebText/Reddit upvoted (38GB), and Stories (31GB), with a total dataset size of 161GB. This model is part of the RoBERTa family and contributes to the advancements in large-scale language models for natural language processing tasks.<ref name="lifearchitect.ai">{{cite web |title=AI: Megatron the Transformer, and its related language models |url=https://lifearchitect.ai/megatron/#:~:text=October%202021%3A%20NVIDIA%20and%20Microsoft,model%20(MT%2DNLG). |website=lifearchitect.ai |access-date=18 September 2023 |language=en-AU |date=24 September 2021}}</ref> |
|- | |- | ||
| 2020 || May || {{w|GPT-3}} || 175,000,000,000<ref>{{cite web |last1=Wiggers |first1=Kyle |title=The emerging types of language models and why they matter |url=https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ |website=TechCrunch |access-date=29 June 2023 |date=28 April 2022}}</ref> | | 2020 || May || {{w|GPT-3}} || 175,000,000,000<ref>{{cite web |last1=Wiggers |first1=Kyle |title=The emerging types of language models and why they matter |url=https://techcrunch.com/2022/04/28/the-emerging-types-of-language-models-and-why-they-matter/ |website=TechCrunch |access-date=29 June 2023 |date=28 April 2022}}</ref> | ||
− | || 45,000,000,000,000<ref>{{cite web |title=OpenAI GPT-3: Everything You Need to Know [Updated] |url=https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai/ |website=springboard.com |access-date=16 October 2023}}</ref> || LLM launch || OpenAI introduces GPT-3, the largest neural network with 175 billion parameters, surpassing previous models significantly. Trained on extensive internet data, GPT-3 demonstrates exceptional performance in various natural language processing tasks like translation and question-answering, outperforming existing models. The research showcases its remarkable few-shot learning ability, making it a groundbreaking advancement in the field of artificial intelligence.<ref>{{cite web |last1=Romero |first1=Alberto |title=GPT-3 — A Complete Overview |url=https://towardsdatascience.com/gpt-3-a-complete-overview-190232eb25fd |website=Medium |access-date=20 October 2023 |language=en |date=25 May 2021}}</ref><ref name="NVIDIA Blog">{{cite web |last1=Lee |first1=Angie |title=What Are Large Language Models Used For and Why Are They Important? |url=https://blogs.nvidia.com/blog/2023/01/26/what-are-large-language-models-used-for/ |website=NVIDIA Blog |access-date=11 March 2023 |date=26 January 2023}}</ref> | + | || 45,000,000,000,000 bytes<ref>{{cite web |title=OpenAI GPT-3: Everything You Need to Know [Updated] |url=https://www.springboard.com/blog/data-science/machine-learning-gpt-3-open-ai/ |website=springboard.com |access-date=16 October 2023}}</ref> || LLM launch || OpenAI introduces GPT-3, the largest neural network with 175 billion parameters, surpassing previous models significantly. Trained on extensive internet data, GPT-3 demonstrates exceptional performance in various natural language processing tasks like translation and question-answering, outperforming existing models. The research showcases its remarkable few-shot learning ability, making it a groundbreaking advancement in the field of artificial intelligence.<ref>{{cite web |last1=Romero |first1=Alberto |title=GPT-3 — A Complete Overview |url=https://towardsdatascience.com/gpt-3-a-complete-overview-190232eb25fd |website=Medium |access-date=20 October 2023 |language=en |date=25 May 2021}}</ref><ref name="NVIDIA Blog">{{cite web |last1=Lee |first1=Angie |title=What Are Large Language Models Used For and Why Are They Important? |url=https://blogs.nvidia.com/blog/2023/01/26/what-are-large-language-models-used-for/ |website=NVIDIA Blog |access-date=11 March 2023 |date=26 January 2023}}</ref> |
|- | |- | ||
| 2020 || May 28 || || || || Programming/training || A paper discusses the use of language models in few-shot learning, where a model is trained on a large corpus of text and then fine-tuned for a specific task. The authors demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance. They trained GPT-3, a language model with 175 billion parameters, and tested its performance in the few-shot setting. GPT-3 achieved strong performance on many NLP tasks, including translation, question-answering, and cloze tasks, as well as tasks that require on-the-fly reasoning or domain adaptation. However, the authors also identify some datasets where GPT-3's few-shot learning struggles, as well as methodological issues related to training on large web corpora. The paper also discusses the broader societal impacts of this finding and of GPT-3 in general.<ref>{{cite journal |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |last16=Ramesh |first16=Aditya |last17=Ziegler |first17=Daniel M. |last18=Wu |first18=Jeffrey |last19=Winter |first19=Clemens |last20=Hesse |first20=Christopher |last21=Chen |first21=Mark |last22=Sigler |first22=Eric |last23=Litwin |first23=Mateusz |last24=Gray |first24=Scott |last25=Chess |first25=Benjamin |last26=Clark |first26=Jack |last27=Berner |first27=Christopher |last28=McCandlish |first28=Sam |last29=Radford |first29=Alec |last30=Sutskever |first30=Ilya |last31=Amodei |first31=Dario |title=Language Models are Few-Shot Learners |date=2020 |doi=10.48550/arXiv.2005.14165}}</ref> | | 2020 || May 28 || || || || Programming/training || A paper discusses the use of language models in few-shot learning, where a model is trained on a large corpus of text and then fine-tuned for a specific task. The authors demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance. They trained GPT-3, a language model with 175 billion parameters, and tested its performance in the few-shot setting. GPT-3 achieved strong performance on many NLP tasks, including translation, question-answering, and cloze tasks, as well as tasks that require on-the-fly reasoning or domain adaptation. However, the authors also identify some datasets where GPT-3's few-shot learning struggles, as well as methodological issues related to training on large web corpora. The paper also discusses the broader societal impacts of this finding and of GPT-3 in general.<ref>{{cite journal |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |last16=Ramesh |first16=Aditya |last17=Ziegler |first17=Daniel M. |last18=Wu |first18=Jeffrey |last19=Winter |first19=Clemens |last20=Hesse |first20=Christopher |last21=Chen |first21=Mark |last22=Sigler |first22=Eric |last23=Litwin |first23=Mateusz |last24=Gray |first24=Scott |last25=Chess |first25=Benjamin |last26=Clark |first26=Jack |last27=Berner |first27=Christopher |last28=McCandlish |first28=Sam |last29=Radford |first29=Alec |last30=Sutskever |first30=Ilya |last31=Amodei |first31=Dario |title=Language Models are Few-Shot Learners |date=2020 |doi=10.48550/arXiv.2005.14165}}</ref> | ||
Line 92: | Line 92: | ||
| 2020 || June 5 || DeBERTa || 1,500,000,000 (larger model)<ref>{{cite web |last1=Tsang |first1=Sik-Ho |title=Brief Review — DeBERTa: Decoding-enhanced BERT with Disentangled Attention |url=https://sh-tsang.medium.com/brief-review-deberta-decoding-enhanced-bert-with-disentangled-attention-f5cdb9a8bf0b |website=Medium |access-date=18 September 2023 |language=en |date=21 January 2023}}</ref> || || LLM launch || A paper presents DeBERTa, a model that enhances BERT and RoBERTa LLMs by introducing disentangled attention and an enhanced mask decoder. These techniques improve model pre-training efficiency and performance on various NLP tasks. A DeBERTa model trained on half the data outperforms RoBERTa-Large on tasks like MNLI, SQuAD v2.0, and RACE. A larger DeBERTa model with 1.5 billion parameters surpasses human performance on the SuperGLUE benchmark, and an ensemble DeBERTa model leads the SuperGLUE leaderboard with a significant margin over the human baseline.<ref>{{cite journal |last1=He |first1=Pengcheng |last2=Liu |first2=Xiaodong |last3=Gao |first3=Jianfeng |last4=Chen |first4=Weizhu |title=DeBERTa: Decoding-enhanced BERT with Disentangled Attention |date=2020 |doi=10.48550/arXiv.2006.03654}}</ref> | | 2020 || June 5 || DeBERTa || 1,500,000,000 (larger model)<ref>{{cite web |last1=Tsang |first1=Sik-Ho |title=Brief Review — DeBERTa: Decoding-enhanced BERT with Disentangled Attention |url=https://sh-tsang.medium.com/brief-review-deberta-decoding-enhanced-bert-with-disentangled-attention-f5cdb9a8bf0b |website=Medium |access-date=18 September 2023 |language=en |date=21 January 2023}}</ref> || || LLM launch || A paper presents DeBERTa, a model that enhances BERT and RoBERTa LLMs by introducing disentangled attention and an enhanced mask decoder. These techniques improve model pre-training efficiency and performance on various NLP tasks. A DeBERTa model trained on half the data outperforms RoBERTa-Large on tasks like MNLI, SQuAD v2.0, and RACE. A larger DeBERTa model with 1.5 billion parameters surpasses human performance on the SuperGLUE benchmark, and an ensemble DeBERTa model leads the SuperGLUE leaderboard with a significant margin over the human baseline.<ref>{{cite journal |last1=He |first1=Pengcheng |last2=Liu |first2=Xiaodong |last3=Gao |first3=Jianfeng |last4=Chen |first4=Weizhu |title=DeBERTa: Decoding-enhanced BERT with Disentangled Attention |date=2020 |doi=10.48550/arXiv.2006.03654}}</ref> | ||
|- | |- | ||
− | | 2020 || June 30 || GShard || 600,000,000,000<ref name="A Survey of Large"/> || 1,000,000,000,000<ref name="A Survey of Large"/> || LLM launch || A paper introduces GShard, a module designed to address challenges in scaling neural networks for machine learning applications. By combining lightweight annotation APIs and an extension to the XLA compiler, GShard enables efficient parallel computation patterns with minimal code changes. The researchers utilize GShard to scale a multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts to over 600 billion parameters using automatic sharding. This model is trained on 2048 TPU v3 accelerators in just 4 days, achieving significantly improved translation quality from 100 languages to English compared to previous methods.<ref>{{cite journal |last1=Lepikhin |first1=Dmitry |last2=Lee |first2=HyoukJoong |last3=Xu |first3=Yuanzhong |last4=Chen |first4=Dehao |last5=Firat |first5=Orhan |last6=Huang |first6=Yanping |last7=Krikun |first7=Maxim |last8=Shazeer |first8=Noam |last9=Chen |first9=Zhifeng |title=GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding |date=2020 |doi=10.48550/arXiv.2006.16668}}</ref> | + | | 2020 || June 30 || GShard || 600,000,000,000<ref name="A Survey of Large"/> || 1,000,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || A paper introduces GShard, a module designed to address challenges in scaling neural networks for machine learning applications. By combining lightweight annotation APIs and an extension to the XLA compiler, GShard enables efficient parallel computation patterns with minimal code changes. The researchers utilize GShard to scale a multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts to over 600 billion parameters using automatic sharding. This model is trained on 2048 TPU v3 accelerators in just 4 days, achieving significantly improved translation quality from 100 languages to English compared to previous methods.<ref>{{cite journal |last1=Lepikhin |first1=Dmitry |last2=Lee |first2=HyoukJoong |last3=Xu |first3=Yuanzhong |last4=Chen |first4=Dehao |last5=Firat |first5=Orhan |last6=Huang |first6=Yanping |last7=Krikun |first7=Maxim |last8=Shazeer |first8=Noam |last9=Chen |first9=Zhifeng |title=GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding |date=2020 |doi=10.48550/arXiv.2006.16668}}</ref> |
|- | |- | ||
| 2020 || July || || || || Efficiency || A paper discusses the limitations of neural text generation models in open-ended tasks like language modeling and story generation, due to the standard likelihood training and approximate decoding objectives. The authors specifically analyze these limitations for abstractive document summarization and find that such models tend to hallucinate content that is unfaithful to the input document. The paper presents the results of a human evaluation of several neural abstractive summarization systems, highlighting the substantial amount of hallucinated content in all model-generated summaries. However, the authors also show that pretrained models perform better in terms of generating faithful and factual summaries, as evaluated by humans. They propose that textual entailment measures may be a better evaluation metric for faithfulness than standard metrics, leading to better training and decoding criteria.<ref>{{cite journal |last1=Maynez |first1=Joshua |last2=Narayan |first2=Shashi |last3=Bohnet |first3=Bernd |last4=McDonald |first4=Ryan |title=On Faithfulness and Factuality in Abstractive Summarization |journal=Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics |date=July 2020 |pages=1906–1919 |doi=10.18653/v1/2020.acl-main.173 |url=https://aclanthology.org/2020.acl-main.173/ |publisher=Association for Computational Linguistics}}</ref> | | 2020 || July || || || || Efficiency || A paper discusses the limitations of neural text generation models in open-ended tasks like language modeling and story generation, due to the standard likelihood training and approximate decoding objectives. The authors specifically analyze these limitations for abstractive document summarization and find that such models tend to hallucinate content that is unfaithful to the input document. The paper presents the results of a human evaluation of several neural abstractive summarization systems, highlighting the substantial amount of hallucinated content in all model-generated summaries. However, the authors also show that pretrained models perform better in terms of generating faithful and factual summaries, as evaluated by humans. They propose that textual entailment measures may be a better evaluation metric for faithfulness than standard metrics, leading to better training and decoding criteria.<ref>{{cite journal |last1=Maynez |first1=Joshua |last2=Narayan |first2=Shashi |last3=Bohnet |first3=Bernd |last4=McDonald |first4=Ryan |title=On Faithfulness and Factuality in Abstractive Summarization |journal=Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics |date=July 2020 |pages=1906–1919 |doi=10.18653/v1/2020.acl-main.173 |url=https://aclanthology.org/2020.acl-main.173/ |publisher=Association for Computational Linguistics}}</ref> | ||
Line 102: | Line 102: | ||
| 2021 || March 22 || GPT-Neo || 2,700,000,000<ref name="gpt-neo">{{Cite web|url=https://github.com/EleutherAI/gpt-neo|title=GPT Neo|date=March 15, 2023}}</ref> || || LLM launch || GPT-Neo is introduced as an open-source alternative to GPT-3, developed by EleutherAI. It offers accessible language generation capabilities and is released under the MIT license. While GPT-Neo's performance is not as strong as GPT-3's largest model, it outperforms comparable GPT-3 models on NLP reasoning benchmarks. GPT-Neo provides a promising option, especially considering OpenAI's restricted access policy.<ref>{{cite web |title=GPT-3’s free alternative GPT-Neo is something to be excited about |url=https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ |website=VentureBeat |access-date=29 June 2023 |date=15 May 2021}}</ref> | | 2021 || March 22 || GPT-Neo || 2,700,000,000<ref name="gpt-neo">{{Cite web|url=https://github.com/EleutherAI/gpt-neo|title=GPT Neo|date=March 15, 2023}}</ref> || || LLM launch || GPT-Neo is introduced as an open-source alternative to GPT-3, developed by EleutherAI. It offers accessible language generation capabilities and is released under the MIT license. While GPT-Neo's performance is not as strong as GPT-3's largest model, it outperforms comparable GPT-3 models on NLP reasoning benchmarks. GPT-Neo provides a promising option, especially considering OpenAI's restricted access policy.<ref>{{cite web |title=GPT-3’s free alternative GPT-Neo is something to be excited about |url=https://venturebeat.com/ai/gpt-3s-free-alternative-gpt-neo-is-something-to-be-excited-about/ |website=VentureBeat |access-date=29 June 2023 |date=15 May 2021}}</ref> | ||
|- | |- | ||
− | | 2021 || April 26 || PanGu-α || 13,000,000,000<ref name="A Survey of Large"/>–200,000,000,000 || 1,100,000,000,000<ref name="A Survey of Large"/> || LLM launch || Researchers introduce PanGu-α, a large-scale autoregressive pretrained Chinese language model with up to 200 billion parameters. Developed using MindSpore and trained on a cluster of 2048 Ascend 910 AI processors, PanGu-α utilizes advanced training parallelism strategies, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism, and rematerialization. To enhance its capabilities, the model is pretrained on 1.1TB of high-quality Chinese data from diverse domains. Empirical tests showcase PanGu-α's excellence in tasks such as text summarization, question answering, and dialogue generation, demonstrating superior performance in few-shot or zero-shot scenarios across various Chinese NLP tasks.<ref>{{cite journal |last1=Zeng |first1=Wei |last2=Ren |first2=Xiaozhe |last3=Su |first3=Teng |last4=Wang |first4=Hui |last5=Liao |first5=Yi |last6=Wang |first6=Zhiwei |last7=Jiang |first7=Xin |last8=Yang |first8=ZhenZhang |last9=Wang |first9=Kaisheng |last10=Zhang |first10=Xiaoda |last11=Li |first11=Chen |last12=Gong |first12=Ziyan |last13=Yao |first13=Yifan |last14=Huang |first14=Xinjing |last15=Wang |first15=Jun |last16=Yu |first16=Jianfeng |last17=Guo |first17=Qi |last18=Yu |first18=Yue |last19=Zhang |first19=Yan |last20=Wang |first20=Jin |last21=Tao |first21=Hengtao |last22=Yan |first22=Dasen |last23=Yi |first23=Zexuan |last24=Peng |first24=Fang |last25=Jiang |first25=Fangqing |last26=Zhang |first26=Han |last27=Deng |first27=Lingfeng |last28=Zhang |first28=Yehong |last29=Lin |first29=Zhe |last30=Zhang |first30=Chao |last31=Zhang |first31=Shaojie |last32=Guo |first32=Mingyue |last33=Gu |first33=Shanzhi |last34=Fan |first34=Gaojun |last35=Wang |first35=Yaowei |last36=Jin |first36=Xuefeng |last37=Liu |first37=Qun |last38=Tian |first38=Yonghong |title=PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation |date=2021 |doi=10.48550/arXiv.2104.12369}}</ref> | + | | 2021 || April 26 || PanGu-α || 13,000,000,000<ref name="A Survey of Large"/>–200,000,000,000 || 1,100,000,000,000 bytes<ref name="A Survey of Large"/> || LLM launch || Researchers introduce PanGu-α, a large-scale autoregressive pretrained Chinese language model with up to 200 billion parameters. Developed using MindSpore and trained on a cluster of 2048 Ascend 910 AI processors, PanGu-α utilizes advanced training parallelism strategies, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism, and rematerialization. To enhance its capabilities, the model is pretrained on 1.1TB of high-quality Chinese data from diverse domains. Empirical tests showcase PanGu-α's excellence in tasks such as text summarization, question answering, and dialogue generation, demonstrating superior performance in few-shot or zero-shot scenarios across various Chinese NLP tasks.<ref>{{cite journal |last1=Zeng |first1=Wei |last2=Ren |first2=Xiaozhe |last3=Su |first3=Teng |last4=Wang |first4=Hui |last5=Liao |first5=Yi |last6=Wang |first6=Zhiwei |last7=Jiang |first7=Xin |last8=Yang |first8=ZhenZhang |last9=Wang |first9=Kaisheng |last10=Zhang |first10=Xiaoda |last11=Li |first11=Chen |last12=Gong |first12=Ziyan |last13=Yao |first13=Yifan |last14=Huang |first14=Xinjing |last15=Wang |first15=Jun |last16=Yu |first16=Jianfeng |last17=Guo |first17=Qi |last18=Yu |first18=Yue |last19=Zhang |first19=Yan |last20=Wang |first20=Jin |last21=Tao |first21=Hengtao |last22=Yan |first22=Dasen |last23=Yi |first23=Zexuan |last24=Peng |first24=Fang |last25=Jiang |first25=Fangqing |last26=Zhang |first26=Han |last27=Deng |first27=Lingfeng |last28=Zhang |first28=Yehong |last29=Lin |first29=Zhe |last30=Zhang |first30=Chao |last31=Zhang |first31=Shaojie |last32=Guo |first32=Mingyue |last33=Gu |first33=Shanzhi |last34=Fan |first34=Gaojun |last35=Wang |first35=Yaowei |last36=Jin |first36=Xuefeng |last37=Liu |first37=Qun |last38=Tian |first38=Yonghong |title=PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation |date=2021 |doi=10.48550/arXiv.2104.12369}}</ref> |
|- | |- | ||
− | | 2021 || May || LaMDA || 173,000,000,000 || || LLM launch || Google anounces LaMDA (Language Model for Dialogue Applications). Unlike other language models, LaMDA is specifically trained on dialogue to enable more natural and engaging conversations with users. It has the ability to understand and respond to the subtleties of open-ended discussions. LaMDA has various potential applications, including customer service, chatbots, and personal assistants. It is built upon Google's previous chatbot model called Meena.<ref name="Kazi"/> Its pretraining dataset consists of 2.97 billion documents, 1.12 billion dialogs, and 13.39 billion dialog utterances, for a total of 1.56 trillion words.<ref>{{cite web |last1=Tsang |first1=Sik-Ho |title=Brief Review — LaMDA: Language Models for Dialog Applications |url=https://sh-tsang.medium.com/brief-review-lamda-language-models-for-dialog-applications-e8e9f3ee1113 |website=Medium |access-date=16 October 2023 |language=en |date=13 May 2023}}</ref> | + | | 2021 || May || LaMDA || 173,000,000,000<ref name="A Survey of Large"/> || 768,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || Google anounces LaMDA (Language Model for Dialogue Applications). Unlike other language models, LaMDA is specifically trained on dialogue to enable more natural and engaging conversations with users. It has the ability to understand and respond to the subtleties of open-ended discussions. LaMDA has various potential applications, including customer service, chatbots, and personal assistants. It is built upon Google's previous chatbot model called Meena.<ref name="Kazi"/> Its pretraining dataset consists of 2.97 billion documents, 1.12 billion dialogs, and 13.39 billion dialog utterances, for a total of 1.56 trillion words.<ref>{{cite web |last1=Tsang |first1=Sik-Ho |title=Brief Review — LaMDA: Language Models for Dialog Applications |url=https://sh-tsang.medium.com/brief-review-lamda-language-models-for-dialog-applications-e8e9f3ee1113 |website=Medium |access-date=16 October 2023 |language=en |date=13 May 2023}}</ref> |
|- | |- | ||
− | | 2021 || June 20 || CPM-2 || 198,000,000,000<ref name="A Survey of Large"/> || 2, | + | | 2021 || June 20 || CPM-2 || 198,000,000,000<ref name="A Survey of Large"/> || 2,600,000,000,000 bytes<ref name="A Survey of Large"/> || LLM launch || Researchers introduce two models: an encoder-decoder bilingual model with 11 billion parameters (CPM-2) and its corresponding MoE version with 198 billion parameters. In their tests, they evaluated CPM-2 and mT5 in practical tasks. The results indicate that CPM-2 possesses impressive overall language capabilities. Additionally, they verify InfMoE's effectiveness in performing inferences with large-scale models containing tens of billions of parameters on a single GPU.<ref>{{cite journal |last1=Zhang |first1=Zhengyan |last2=Gu |first2=Yuxian |last3=Han |first3=Xu |last4=Chen |first4=Shengqi |last5=Xiao |first5=Chaojun |last6=Sun |first6=Zhenbo |last7=Yao |first7=Yuan |last8=Qi |first8=Fanchao |last9=Guan |first9=Jian |last10=Ke |first10=Pei |last11=Cai |first11=Yanzheng |last12=Zeng |first12=Guoyang |last13=Tan |first13=Zhixing |last14=Liu |first14=Zhiyuan |last15=Huang |first15=Minlie |last16=Han |first16=Wentao |last17=Liu |first17=Yang |last18=Zhu |first18=Xiaoyan |last19=Sun |first19=Maosong |title=CPM-2: Large-scale Cost-effective Pre-trained Language Models |date=2021 |doi=10.48550/arXiv.2106.10715}}</ref> |
|- | |- | ||
− | | 2021 || July 5 || ERNIE 3.0 || 10,000,000,000<ref name="A Survey of Large"/> || 375,000,000,000<ref name="A Survey of Large"/> || LLM launch || ERNIE 3.0 is introduced as a pre-training framework for large-scale language models in Natural Language Processing (NLP). Unlike previous models like T5 and GPT-3, ERNIE 3.0 incorporates both linguistic and world knowledge into its training, addressing the limitation of traditional models trained solely on plain texts. It combines auto-regressive and auto-encoding networks, enabling the model to handle natural language understanding and generation tasks effectively. Trained with 10 billion parameters on a 4TB corpus containing texts and a vast knowledge graph, ERNIE 3.0 outperforms existing models in 54 Chinese NLP tasks. Its English version also excels, leading the SuperGLUE benchmark and surpassing human performance by +0.8% (90.6% vs. 89.8%).<ref>{{cite journal |last1=Sun |first1=Yu |last2=Wang |first2=Shuohuan |last3=Feng |first3=Shikun |last4=Ding |first4=Siyu |last5=Pang |first5=Chao |last6=Shang |first6=Junyuan |last7=Liu |first7=Jiaxiang |last8=Chen |first8=Xuyi |last9=Zhao |first9=Yanbin |last10=Lu |first10=Yuxiang |last11=Liu |first11=Weixin |last12=Wu |first12=Zhihua |last13=Gong |first13=Weibao |last14=Liang |first14=Jianzhong |last15=Shang |first15=Zhizhou |last16=Sun |first16=Peng |last17=Liu |first17=Wei |last18=Ouyang |first18=Xuan |last19=Yu |first19=Dianhai |last20=Tian |first20=Hao |last21=Wu |first21=Hua |last22=Wang |first22=Haifeng |title=ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation |date=2021 |doi=10.48550/arXiv.2107.02137}}</ref> | + | | 2021 || July 5 || ERNIE 3.0 || 10,000,000,000<ref name="A Survey of Large"/> || 375,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || ERNIE 3.0 is introduced as a pre-training framework for large-scale language models in Natural Language Processing (NLP). Unlike previous models like T5 and GPT-3, ERNIE 3.0 incorporates both linguistic and world knowledge into its training, addressing the limitation of traditional models trained solely on plain texts. It combines auto-regressive and auto-encoding networks, enabling the model to handle natural language understanding and generation tasks effectively. Trained with 10 billion parameters on a 4TB corpus containing texts and a vast knowledge graph, ERNIE 3.0 outperforms existing models in 54 Chinese NLP tasks. Its English version also excels, leading the SuperGLUE benchmark and surpassing human performance by +0.8% (90.6% vs. 89.8%).<ref>{{cite journal |last1=Sun |first1=Yu |last2=Wang |first2=Shuohuan |last3=Feng |first3=Shikun |last4=Ding |first4=Siyu |last5=Pang |first5=Chao |last6=Shang |first6=Junyuan |last7=Liu |first7=Jiaxiang |last8=Chen |first8=Xuyi |last9=Zhao |first9=Yanbin |last10=Lu |first10=Yuxiang |last11=Liu |first11=Weixin |last12=Wu |first12=Zhihua |last13=Gong |first13=Weibao |last14=Liang |first14=Jianzhong |last15=Shang |first15=Zhizhou |last16=Sun |first16=Peng |last17=Liu |first17=Wei |last18=Ouyang |first18=Xuan |last19=Yu |first19=Dianhai |last20=Tian |first20=Hao |last21=Wu |first21=Hua |last22=Wang |first22=Haifeng |title=ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation |date=2021 |doi=10.48550/arXiv.2107.02137}}</ref> |
|- | |- | ||
| 2021 || July 7 || Codex || 12,000,000,000 || 100,000,000,000 tokens || LLM launch || A paper introduces Codex, a GPT language model fine-tuned on publicly available GitHub code, also powering GitHub Copilot. Evaluations on a new set called HumanEval reveal Codex solves 28.8% of problems involving synthesizing programs from docstrings, significantly outperforming GPT-3 (0%) and GPT-J (11.4%). Codex demonstrates effectiveness in generating solutions by repeatedly sampling from the model, achieving 70.2% accuracy with 100 samples per problem. Limitations include challenges with complex docstrings and binding operations to variables. The study discusses broader impacts of deploying advanced code generation technologies, addressing concerns related to safety, security, and economics.<ref>{{cite journal |last1=Chen |first1=Mark |last2=Tworek |first2=Jerry |last3=Jun |first3=Heewoo |last4=Yuan |first4=Qiming |last5=Pinto |first5=Henrique Ponde de Oliveira |last6=Kaplan |first6=Jared |last7=Edwards |first7=Harri |last8=Burda |first8=Yuri |last9=Joseph |first9=Nicholas |last10=Brockman |first10=Greg |last11=Ray |first11=Alex |last12=Puri |first12=Raul |last13=Krueger |first13=Gretchen |last14=Petrov |first14=Michael |last15=Khlaaf |first15=Heidy |last16=Sastry |first16=Girish |last17=Mishkin |first17=Pamela |last18=Chan |first18=Brooke |last19=Gray |first19=Scott |last20=Ryder |first20=Nick |last21=Pavlov |first21=Mikhail |last22=Power |first22=Alethea |last23=Kaiser |first23=Lukasz |last24=Bavarian |first24=Mohammad |last25=Winter |first25=Clemens |last26=Tillet |first26=Philippe |last27=Such |first27=Felipe Petroski |last28=Cummings |first28=Dave |last29=Plappert |first29=Matthias |last30=Chantzis |first30=Fotios |last31=Barnes |first31=Elizabeth |last32=Herbert-Voss |first32=Ariel |last33=Guss |first33=William Hebgen |last34=Nichol |first34=Alex |last35=Paino |first35=Alex |last36=Tezak |first36=Nikolas |last37=Tang |first37=Jie |last38=Babuschkin |first38=Igor |last39=Balaji |first39=Suchir |last40=Jain |first40=Shantanu |last41=Saunders |first41=William |last42=Hesse |first42=Christopher |last43=Carr |first43=Andrew N. |last44=Leike |first44=Jan |last45=Achiam |first45=Josh |last46=Misra |first46=Vedant |last47=Morikawa |first47=Evan |last48=Radford |first48=Alec |last49=Knight |first49=Matthew |last50=Brundage |first50=Miles |last51=Murati |first51=Mira |last52=Mayer |first52=Katie |last53=Welinder |first53=Peter |last54=McGrew |first54=Bob |last55=Amodei |first55=Dario |last56=McCandlish |first56=Sam |last57=Sutskever |first57=Ilya |last58=Zaremba |first58=Wojciech |title=Evaluating Large Language Models Trained on Code |date=2021 |doi=10.48550/arXiv.2107.03374}}</ref> | | 2021 || July 7 || Codex || 12,000,000,000 || 100,000,000,000 tokens || LLM launch || A paper introduces Codex, a GPT language model fine-tuned on publicly available GitHub code, also powering GitHub Copilot. Evaluations on a new set called HumanEval reveal Codex solves 28.8% of problems involving synthesizing programs from docstrings, significantly outperforming GPT-3 (0%) and GPT-J (11.4%). Codex demonstrates effectiveness in generating solutions by repeatedly sampling from the model, achieving 70.2% accuracy with 100 samples per problem. Limitations include challenges with complex docstrings and binding operations to variables. The study discusses broader impacts of deploying advanced code generation technologies, addressing concerns related to safety, security, and economics.<ref>{{cite journal |last1=Chen |first1=Mark |last2=Tworek |first2=Jerry |last3=Jun |first3=Heewoo |last4=Yuan |first4=Qiming |last5=Pinto |first5=Henrique Ponde de Oliveira |last6=Kaplan |first6=Jared |last7=Edwards |first7=Harri |last8=Burda |first8=Yuri |last9=Joseph |first9=Nicholas |last10=Brockman |first10=Greg |last11=Ray |first11=Alex |last12=Puri |first12=Raul |last13=Krueger |first13=Gretchen |last14=Petrov |first14=Michael |last15=Khlaaf |first15=Heidy |last16=Sastry |first16=Girish |last17=Mishkin |first17=Pamela |last18=Chan |first18=Brooke |last19=Gray |first19=Scott |last20=Ryder |first20=Nick |last21=Pavlov |first21=Mikhail |last22=Power |first22=Alethea |last23=Kaiser |first23=Lukasz |last24=Bavarian |first24=Mohammad |last25=Winter |first25=Clemens |last26=Tillet |first26=Philippe |last27=Such |first27=Felipe Petroski |last28=Cummings |first28=Dave |last29=Plappert |first29=Matthias |last30=Chantzis |first30=Fotios |last31=Barnes |first31=Elizabeth |last32=Herbert-Voss |first32=Ariel |last33=Guss |first33=William Hebgen |last34=Nichol |first34=Alex |last35=Paino |first35=Alex |last36=Tezak |first36=Nikolas |last37=Tang |first37=Jie |last38=Babuschkin |first38=Igor |last39=Balaji |first39=Suchir |last40=Jain |first40=Shantanu |last41=Saunders |first41=William |last42=Hesse |first42=Christopher |last43=Carr |first43=Andrew N. |last44=Leike |first44=Jan |last45=Achiam |first45=Josh |last46=Misra |first46=Vedant |last47=Morikawa |first47=Evan |last48=Radford |first48=Alec |last49=Knight |first49=Matthew |last50=Brundage |first50=Miles |last51=Murati |first51=Mira |last52=Mayer |first52=Katie |last53=Welinder |first53=Peter |last54=McGrew |first54=Bob |last55=Amodei |first55=Dario |last56=McCandlish |first56=Sam |last57=Sutskever |first57=Ilya |last58=Zaremba |first58=Wojciech |title=Evaluating Large Language Models Trained on Code |date=2021 |doi=10.48550/arXiv.2107.03374}}</ref> | ||
|- | |- | ||
− | | 2021 || September || HyperCLOVA || 82,000,000,000<ref name="A Survey of Large"/>–204,000,000,000<ref name="HyperCLOVA"/> || 300,000,000,000<ref name="A Survey of Large"/>–560,000,000,000<ref name="What Changes"/> || LLM launch || HyperCLOVA is introduced as a large-scale Korean contextual learning model.<ref name="What Changes">{{cite journal |last1=Kim |first1=Boseop |last2=Kim |first2=HyoungSeok |last3=Lee |first3=Sang-Woo |last4=Lee |first4=Gichang |last5=Kwak |first5=Donghyun |last6=Dong Hyeon |first6=Jeon |last7=Park |first7=Sunghyun |last8=Kim |first8=Sungju |last9=Kim |first9=Seonhoon |last10=Seo |first10=Dongpil |last11=Lee |first11=Heungsub |last12=Jeong |first12=Minyoung |last13=Lee |first13=Sungjae |last14=Kim |first14=Minsub |last15=Ko |first15=Suk Hyun |last16=Kim |first16=Seokhun |last17=Park |first17=Taeyong |last18=Kim |first18=Jinuk |last19=Kang |first19=Soyoung |last20=Ryu |first20=Na-Hyeon |last21=Yoo |first21=Kang Min |last22=Chang |first22=Minsuk |last23=Suh |first23=Soobin |last24=In |first24=Sookyo |last25=Park |first25=Jinseong |last26=Kim |first26=Kyungduk |last27=Kim |first27=Hiun |last28=Jeong |first28=Jisu |last29=Yeo |first29=Yong Goo |last30=Ham |first30=Donghoon |last31=Park |first31=Dongju |last32=Lee |first32=Min Young |last33=Kang |first33=Jaewook |last34=Kang |first34=Inho |last35=Ha |first35=Jung-Woo |last36=Park |first36=Woomyoung |last37=Sung |first37=Nako |title=What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers |date=2021 |pages=3405–3424 |doi=10.18653/v1/2021.emnlp-main.274}}</ref> HyperCLOVA's extensive parameters enhance its ability to distinguish speech nuances and dialects. It learned from 6,500 times more Korean data than GPT-3, predominantly focusing on the Korean language (97%). HyperCLOVA's applications include human conversation processing, translation, summarization, and machine reading, offering diverse AI possibilities and fostering new service and business opportunities.<ref name="HyperCLOVA">{{cite web |last1=Demo |first1=GPT-3 |title=HyperCLOVA {{!}} Discover AI use cases |url=https://gpt3demo.com/apps/hyperclova |website=gpt3demo.com |access-date=20 October 2023 |language=en}}</ref> | + | | 2021 || September || HyperCLOVA || 82,000,000,000<ref name="A Survey of Large"/>–204,000,000,000<ref name="HyperCLOVA"/> || 300,000,000,000<ref name="A Survey of Large"/>–560,000,000,000<ref name="What Changes"/> tokens || LLM launch || HyperCLOVA is introduced as a large-scale Korean contextual learning model.<ref name="What Changes">{{cite journal |last1=Kim |first1=Boseop |last2=Kim |first2=HyoungSeok |last3=Lee |first3=Sang-Woo |last4=Lee |first4=Gichang |last5=Kwak |first5=Donghyun |last6=Dong Hyeon |first6=Jeon |last7=Park |first7=Sunghyun |last8=Kim |first8=Sungju |last9=Kim |first9=Seonhoon |last10=Seo |first10=Dongpil |last11=Lee |first11=Heungsub |last12=Jeong |first12=Minyoung |last13=Lee |first13=Sungjae |last14=Kim |first14=Minsub |last15=Ko |first15=Suk Hyun |last16=Kim |first16=Seokhun |last17=Park |first17=Taeyong |last18=Kim |first18=Jinuk |last19=Kang |first19=Soyoung |last20=Ryu |first20=Na-Hyeon |last21=Yoo |first21=Kang Min |last22=Chang |first22=Minsuk |last23=Suh |first23=Soobin |last24=In |first24=Sookyo |last25=Park |first25=Jinseong |last26=Kim |first26=Kyungduk |last27=Kim |first27=Hiun |last28=Jeong |first28=Jisu |last29=Yeo |first29=Yong Goo |last30=Ham |first30=Donghoon |last31=Park |first31=Dongju |last32=Lee |first32=Min Young |last33=Kang |first33=Jaewook |last34=Kang |first34=Inho |last35=Ha |first35=Jung-Woo |last36=Park |first36=Woomyoung |last37=Sung |first37=Nako |title=What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers |date=2021 |pages=3405–3424 |doi=10.18653/v1/2021.emnlp-main.274}}</ref> HyperCLOVA's extensive parameters enhance its ability to distinguish speech nuances and dialects. It learned from 6,500 times more Korean data than GPT-3, predominantly focusing on the Korean language (97%). HyperCLOVA's applications include human conversation processing, translation, summarization, and machine reading, offering diverse AI possibilities and fostering new service and business opportunities.<ref name="HyperCLOVA">{{cite web |last1=Demo |first1=GPT-3 |title=HyperCLOVA {{!}} Discover AI use cases |url=https://gpt3demo.com/apps/hyperclova |website=gpt3demo.com |access-date=20 October 2023 |language=en}}</ref> |
|- | |- | ||
− | | 2021 || October 11 || MT-NLG || 530,000,000,000<ref name="Kazi"/> || >825,000,000,000 bytes<ref name="Dr Alan D. Thompson"/> || LLM launch || MT-NLG (Megatron-Turing Natural Language Generation) is introduced as a language model developed jointly by {{w|Nvidia}} and {{w|Microsoft}}. It utilizes the architecture of the Megatron transformer-based model and has a record-breaking size of 530 billion parameters. MT-NLG is designed to generate coherent and contextually relevant text for various natural language processing tasks such as completion prediction, reading comprehension, commonsense reasoning, and word sense disambiguation. Training such large-scale models is challenging due to memory constraints and long training times, but innovations in hardware, software, and training methods have made it feasible. MT-NLG achieves state-of-the-art results in zero-shot, one-shot, and few-shot settings across multiple NLP tasks.<ref>{{cite web |title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model |url=https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ |website=NVIDIA Technical Blog |access-date=30 June 2023 |date=11 October 2021}}</ref> | + | | 2021 || October 10 || Yuan 1.0 || 245,000,000,000<ref name="A Survey of Large"/> || 180,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || Yuan 1.0 is introduced as a significant advancement in large-scale pre-trained language models for zero-shot and few-shot learning, addressing challenges faced by models like GPT-3 due to enormous computational demands. By integrating distributed training performance into model architecture, Yuan 1.0, boasting 245B parameters, achieves remarkable results across NLP tasks on thousands of GPUs. The approach includes efficient data processing to filter extensive raw data, resulting in a high-quality Chinese corpus of 5TB texts. Calibration and label expansion methods enhance zero-shot and few-shot performance, ensuring accurate task execution. Yuan 1.0 excels in natural language generation, producing articles nearly indistinguishable from human-written ones.<ref>{{cite journal |last1=Wu |first1=Shaohua |last2=Zhao |first2=Xudong |last3=Yu |first3=Tong |last4=Zhang |first4=Rongguo |last5=Shen |first5=Chong |last6=Liu |first6=Hongli |last7=Li |first7=Feng |last8=Zhu |first8=Hong |last9=Luo |first9=Jiangang |last10=Xu |first10=Liang |last11=Zhang |first11=Xuanwei |title=Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning |date=2021 |doi=10.48550/arXiv.2110.04725}}</ref> |
+ | |- | ||
+ | | 2021 || October 11 || MT-NLG || 530,000,000,000<ref name="Kazi"/><ref name="A Survey of Large"/> || >825,000,000,000 bytes<ref name="Dr Alan D. Thompson"/>, 270,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || MT-NLG (Megatron-Turing Natural Language Generation) is introduced as a language model developed jointly by {{w|Nvidia}} and {{w|Microsoft}}. It utilizes the architecture of the Megatron transformer-based model and has a record-breaking size of 530 billion parameters. MT-NLG is designed to generate coherent and contextually relevant text for various natural language processing tasks such as completion prediction, reading comprehension, commonsense reasoning, and word sense disambiguation. Training such large-scale models is challenging due to memory constraints and long training times, but innovations in hardware, software, and training methods have made it feasible. MT-NLG achieves state-of-the-art results in zero-shot, one-shot, and few-shot settings across multiple NLP tasks.<ref>{{cite web |title=Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model |url=https://developer.nvidia.com/blog/using-deepspeed-and-megatron-to-train-megatron-turing-nlg-530b-the-worlds-largest-and-most-powerful-generative-language-model/ |website=NVIDIA Technical Blog |access-date=30 June 2023 |date=11 October 2021}}</ref> | ||
+ | |- | ||
+ | | 2021 || December 8 || Gopher || 280,000,000,000<ref name="A Survey of Large"/> || 300,000,000,000 tokens<ref name="A Survey of Large"/> || || Gopher is introduced as a 280 billion parameter Transformer-based language model, developed by Google subsidiary DeepMind. Trained on a 10.5TB corpus called MassiveText, Gopher outperforms its contemporary state-of-the-art on 100 of 124 evaluation tasks. The model is trained alongside smaller models to explore the strengths and weaknesses of large language models (LLMs). It excells in tasks like reading comprehension and fact-checking but shows reduced benefits in logical reasoning, common sense, and mathematics tasks. The DeepMind team utilizes a custom training dataset, MassiveText, to ensure high-quality data without contaminating the training dataset with test datasets available online. Gopher is part of DeepMind's language research efforts at the time.<ref>{{cite journal |last1=Rae |first1=Jack W. |last2=Borgeaud |first2=Sebastian |last3=Cai |first3=Trevor |last4=Millican |first4=Katie |last5=Hoffmann |first5=Jordan |last6=Song |first6=Francis |last7=Aslanides |first7=John |last8=Henderson |first8=Sarah |last9=Ring |first9=Roman |last10=Young |first10=Susannah |last11=Rutherford |first11=Eliza |last12=Hennigan |first12=Tom |last13=Menick |first13=Jacob |last14=Cassirer |first14=Albin |last15=Powell |first15=Richard |last16=Driessche |first16=George van den |last17=Hendricks |first17=Lisa Anne |last18=Rauh |first18=Maribeth |last19=Huang |first19=Po-Sen |last20=Glaese |first20=Amelia |last21=Welbl |first21=Johannes |last22=Dathathri |first22=Sumanth |last23=Huang |first23=Saffron |last24=Uesato |first24=Jonathan |last25=Mellor |first25=John |last26=Higgins |first26=Irina |last27=Creswell |first27=Antonia |last28=McAleese |first28=Nat |last29=Wu |first29=Amy |last30=Elsen |first30=Erich |last31=Jayakumar |first31=Siddhant |last32=Buchatskaya |first32=Elena |last33=Budden |first33=David |last34=Sutherland |first34=Esme |last35=Simonyan |first35=Karen |last36=Paganini |first36=Michela |last37=Sifre |first37=Laurent |last38=Martens |first38=Lena |last39=Li |first39=Xiang Lorraine |last40=Kuncoro |first40=Adhiguna |last41=Nematzadeh |first41=Aida |last42=Gribovskaya |first42=Elena |last43=Donato |first43=Domenic |last44=Lazaridou |first44=Angeliki |last45=Mensch |first45=Arthur |last46=Lespiau |first46=Jean-Baptiste |last47=Tsimpoukelli |first47=Maria |last48=Grigorev |first48=Nikolai |last49=Fritz |first49=Doug |last50=Sottiaux |first50=Thibault |last51=Pajarskas |first51=Mantas |last52=Pohlen |first52=Toby |last53=Gong |first53=Zhitao |last54=Toyama |first54=Daniel |last55=d'Autume |first55=Cyprien de Masson |last56=Li |first56=Yujia |last57=Terzi |first57=Tayfun |last58=Mikulik |first58=Vladimir |last59=Babuschkin |first59=Igor |last60=Clark |first60=Aidan |last61=Casas |first61=Diego de Las |last62=Guy |first62=Aurelia |last63=Jones |first63=Chris |last64=Bradbury |first64=James |last65=Johnson |first65=Matthew |last66=Hechtman |first66=Blake |last67=Weidinger |first67=Laura |last68=Gabriel |first68=Iason |last69=Isaac |first69=William |last70=Lockhart |first70=Ed |last71=Osindero |first71=Simon |last72=Rimell |first72=Laura |last73=Dyer |first73=Chris |last74=Vinyals |first74=Oriol |last75=Ayoub |first75=Kareem |last76=Stanway |first76=Jeff |last77=Bennett |first77=Lorrayne |last78=Hassabis |first78=Demis |last79=Kavukcuoglu |first79=Koray |last80=Irving |first80=Geoffrey |title=Scaling Language Models: Methods, Analysis & Insights from Training Gopher |date=2021 |doi=10.48550/arXiv.2112.11446}}</ref><ref>{{cite web |title=Google Trains 280 Billion Parameter AI Language Model Gopher |url=https://www.infoq.com/news/2022/01/deepmind-gopher/ |website=InfoQ |access-date=21 October 2023 |language=en}}</ref><ref name="A Survey of Large"/> | ||
+ | |- | ||
+ | | 2021 || December 13 || GLaM || 1,200,000,000,000<ref name="A Survey of Large"/> || 280,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || GLaM (Generalist Language Model) is introduced as a family of language models. These models utilize a sparsely activated mixture-of-experts architecture to increase model capacity while significantly reducing training costs compared to dense variants. The largest GLaM model has 1.2 trillion parameters, making it approximately 7 times larger than GPT-3. Despite its size, this model consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference. Additionally, GLaM demonstrates better overall zero-shot and one-shot performance across 29 natural language processing tasks.<ref>{{cite journal |last1=Du |first1=Nan |last2=Huang |first2=Yanping |last3=Dai |first3=Andrew M. |last4=Tong |first4=Simon |last5=Lepikhin |first5=Dmitry |last6=Xu |first6=Yuanzhong |last7=Krikun |first7=Maxim |last8=Zhou |first8=Yanqi |last9=Yu |first9=Adams Wei |last10=Firat |first10=Orhan |last11=Zoph |first11=Barret |last12=Fedus |first12=Liam |last13=Bosma |first13=Maarten |last14=Zhou |first14=Zongwei |last15=Wang |first15=Tao |last16=Wang |first16=Yu Emma |last17=Webster |first17=Kellie |last18=Pellat |first18=Marie |last19=Robinson |first19=Kevin |last20=Meier-Hellstern |first20=Kathleen |last21=Duke |first21=Toju |last22=Dixon |first22=Lucas |last23=Zhang |first23=Kun |last24=Le |first24=Quoc V |last25=Wu |first25=Yonghui |last26=Chen |first26=Zhifeng |last27=Cui |first27=Claire |title=GLaM: Efficient Scaling of Language Models with Mixture-of-Experts |date=2021 |doi=10.48550/arXiv.2112.06905}}</ref> | ||
+ | |- | ||
+ | | 2021 || December 16 || WebGPT || || || LLM launch || OpenAI introduces their WebGPT project, which enhances GPT-3's factual accuracy by incorporating a text-based web browser into its functionality. The model imitates human online research by issuing search queries, following links, and citing sources to answer open-ended questions. Trained to address the tendency of language models to generate incorrect information, WebGPT allows commands like "Search..." and "Find in page:..." to gather information from web pages. The model undergoes fine-tuning through methods involving human demonstrations and training a reward model, aiming to create more accurate and truthful AI responses.<ref>{{cite web |title=WebGPT: Improving the factual accuracy of language models through web browsing |url=https://openai.com/research/webgpt |website=openai.com |access-date=21 October 2023}}</ref> | ||
|- | |- | ||
| 2021 || December || Fairseq || 13,000,000,000 – 1,000,000,000,000 || 453,000,000,000 bytes<ref name="Dr Alan D. Thompson"/> || LLM launch || {{w|Meta AI}}, previously known as FAIR (Facebook AI Research), announces the introduction of Fairseq, a language model with parameters of 13B and 1.1T. Fairseq is not related to Megatron, and the two use different technologies for training. Fairseq's dataset sources include the same ones used for RoBERTa (English Wikipedia, BookCorpus, CC-News, OpenWebText/Reddit upvoted, and Stories) with the new addition of English CC100 in Wikipedia style from Jan/2018-Dec/2018, resulting in a total dataset size of 453GB. Fairseq was trained using 2,363 GPU-days with 1,024 GPUs, taking approximately three days.<ref name="lifearchitect.ai"/><ref>{{cite web |title=fairseq documentation — fairseq 0.12.2 documentation |url=https://fairseq.readthedocs.io/en/latest/ |website=fairseq.readthedocs.io |access-date=16 May 2023}}</ref> | | 2021 || December || Fairseq || 13,000,000,000 – 1,000,000,000,000 || 453,000,000,000 bytes<ref name="Dr Alan D. Thompson"/> || LLM launch || {{w|Meta AI}}, previously known as FAIR (Facebook AI Research), announces the introduction of Fairseq, a language model with parameters of 13B and 1.1T. Fairseq is not related to Megatron, and the two use different technologies for training. Fairseq's dataset sources include the same ones used for RoBERTa (English Wikipedia, BookCorpus, CC-News, OpenWebText/Reddit upvoted, and Stories) with the new addition of English CC100 in Wikipedia style from Jan/2018-Dec/2018, resulting in a total dataset size of 453GB. Fairseq was trained using 2,363 GPU-days with 1,024 GPUs, taking approximately three days.<ref name="lifearchitect.ai"/><ref>{{cite web |title=fairseq documentation — fairseq 0.12.2 documentation |url=https://fairseq.readthedocs.io/en/latest/ |website=fairseq.readthedocs.io |access-date=16 May 2023}}</ref> | ||
Line 120: | Line 128: | ||
| 2022 || January 19 || CM3 || || || LLM launch || A paper introduces CM3, a family of causally masked generative models trained on large-scale web and Wikipedia articles containing text and image tokens. The new approach generates tokens left to right while masking out a small number of long token spans that are generated at the end of the string. This provides a hybrid of the more common causal and masked language models, allowing for full generative modeling while providing bidirectional context when generating the masked spans. The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts and implicitly learn a wide range of text, image, and cross-modal tasks. The paper also reports state-of-the-art performance in zero-shot summarization, entity linking, and entity disambiguation, while maintaining competitive performance in the fine-tuning setting.<ref>{{cite journal |last1=Aghajanyan |first1=Armen |last2=Huang |first2=Bernie |last3=Ross |first3=Candace |last4=Karpukhin |first4=Vladimir |last5=Xu |first5=Hu |last6=Goyal |first6=Naman |last7=Okhonko |first7=Dmytro |last8=Joshi |first8=Mandar |last9=Ghosh |first9=Gargi |last10=Lewis |first10=Mike |last11=Zettlemoyer |first11=Luke |title=CM3: A Causal Masked Multimodal Model of the Internet |date=2022 |doi=10.48550/arXiv.2201.07520}}</ref> | | 2022 || January 19 || CM3 || || || LLM launch || A paper introduces CM3, a family of causally masked generative models trained on large-scale web and Wikipedia articles containing text and image tokens. The new approach generates tokens left to right while masking out a small number of long token spans that are generated at the end of the string. This provides a hybrid of the more common causal and masked language models, allowing for full generative modeling while providing bidirectional context when generating the masked spans. The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts and implicitly learn a wide range of text, image, and cross-modal tasks. The paper also reports state-of-the-art performance in zero-shot summarization, entity linking, and entity disambiguation, while maintaining competitive performance in the fine-tuning setting.<ref>{{cite journal |last1=Aghajanyan |first1=Armen |last2=Huang |first2=Bernie |last3=Ross |first3=Candace |last4=Karpukhin |first4=Vladimir |last5=Xu |first5=Hu |last6=Goyal |first6=Naman |last7=Okhonko |first7=Dmytro |last8=Joshi |first8=Mandar |last9=Ghosh |first9=Gargi |last10=Lewis |first10=Mike |last11=Zettlemoyer |first11=Luke |title=CM3: A Causal Masked Multimodal Model of the Internet |date=2022 |doi=10.48550/arXiv.2201.07520}}</ref> | ||
|- | |- | ||
− | | 2022 || January 27 || InstructGPT || | + | | 2022 || January 27 || InstructGPT || 175,000,000,000<ref name="A Survey of Large"/>–1,300,000,000 || || LLM launch || OpenAI announces having deployed InstructGPT, a new language model that is safer, more helpful, and more aligned with users. The model was trained using a reinforcement learning technique from human feedback and is significantly better at following instructions than the previous model, GPT-3. InstructGPT is also less toxic and generates fewer false facts than its predecessor. The company believes that fine-tuning language models with humans in the loop is a powerful tool for improving their safety and reliability. InstructGPT becomes the default language model accessible on OpenAI's API.<ref>{{cite web |title=Aligning language models to follow instructions |url=https://openai.com/research/instruction-following |website=openai.com |access-date=21 March 2023}}</ref> |
+ | |- | ||
+ | | 2022 || February || AlphaCode || 41,000,000,000<ref name="A Survey of Large"/> || 967,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || AlphaCode is introduced as an AI system created by DeepMind that performs better than 50% of humans on a set of competitive programming challenges.<ref>{{cite web |title=Finally, an AI bot that can ace technical interview questions (Ep. 417) - Stack Overflow |url=https://stackoverflow.blog/2022/02/22/alphacode-deep-mind-ai-programming-technical-interviews/ |website=stackoverflow.blog |access-date=21 October 2023 |language=en |date=22 February 2022}}</ref><ref name="A Survey of Large"/> | ||
|- | |- | ||
| 2022 || February 28 || Extremely Large || || || LLM launch || Cohere launches a new beta version of their language generation model called "Extremely Large", which, according to Cohere, outperforms their existing largest model, Large, on various tasks such as sentiment analysis, named entity recognition (NER), and common sense reasoning.<ref>{{cite web |title=Cohere launches Extremely Large (beta) |url=https://txt.cohere.ai/cohere-launches-extremely-large-beta-2/ |website=Context by Cohere |access-date=12 March 2023 |language=en |date=1 March 2022}}</ref> | | 2022 || February 28 || Extremely Large || || || LLM launch || Cohere launches a new beta version of their language generation model called "Extremely Large", which, according to Cohere, outperforms their existing largest model, Large, on various tasks such as sentiment analysis, named entity recognition (NER), and common sense reasoning.<ref>{{cite web |title=Cohere launches Extremely Large (beta) |url=https://txt.cohere.ai/cohere-launches-extremely-large-beta-2/ |website=Context by Cohere |access-date=12 March 2023 |language=en |date=1 March 2022}}</ref> | ||
Line 130: | Line 140: | ||
| 2022 || March 29 || || || || Programming/training || A paper investigates the optimal model size and number of tokens for training a transformer language model under a given compute budget. The researchers find that, at this time, large language models are significantly undertrained, and the model size and the number of training tokens should be scaled equally for compute-optimal training. They test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more data. Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a range of downstream evaluation tasks and reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, more than a 7% improvement over Gopher.<ref>{{cite journal |last1=Hoffmann |first1=Jordan |last2=Borgeaud |first2=Sebastian |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan |last16=Guy |first16=Aurelia |last17=Osindero |first17=Simon |last18=Simonyan |first18=Karen |last19=Elsen |first19=Erich |last20=Rae |first20=Jack W. |last21=Vinyals |first21=Oriol |last22=Sifre |first22=Laurent |title=Training Compute-Optimal Large Language Models |date=2022 |doi=10.48550/arXiv.2203.15556}}</ref> | | 2022 || March 29 || || || || Programming/training || A paper investigates the optimal model size and number of tokens for training a transformer language model under a given compute budget. The researchers find that, at this time, large language models are significantly undertrained, and the model size and the number of training tokens should be scaled equally for compute-optimal training. They test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more data. Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a range of downstream evaluation tasks and reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, more than a 7% improvement over Gopher.<ref>{{cite journal |last1=Hoffmann |first1=Jordan |last2=Borgeaud |first2=Sebastian |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan |last16=Guy |first16=Aurelia |last17=Osindero |first17=Simon |last18=Simonyan |first18=Karen |last19=Elsen |first19=Erich |last20=Rae |first20=Jack W. |last21=Vinyals |first21=Oriol |last22=Sifre |first22=Laurent |title=Training Compute-Optimal Large Language Models |date=2022 |doi=10.48550/arXiv.2203.15556}}</ref> | ||
|- | |- | ||
− | | 2022 || April 5 || PaLM || 540,000,000,000<ref name="Kazi"/> || || LLM launch || A paper presents PaLM, a 540-billion parameter language model trained using Pathways, a new {{w|machine learning}} system that enables highly efficient training across multiple TPU Pods. PaLM achieves state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks and outperforms the finetuned state-of-the-art on a suite of multi-step reasoning tasks. It also outperforms average human performance on the BIG-bench benchmark. Additionally, PaLM has strong capabilities in multilingual tasks and source code generation. The paper also discusses bias and toxicity and potential mitigation strategies.<ref>{{cite journal |last1=Chowdhery |first1=Aakanksha |last2=Narang |first2=Sharan |last3=Devlin |first3=Jacob |last4=Bosma |first4=Maarten |last5=Mishra |first5=Gaurav |last6=Roberts |first6=Adam |last7=Barham |first7=Paul |last8=Chung |first8=Hyung Won |last9=Sutton |first9=Charles |last10=Gehrmann |first10=Sebastian |last11=Schuh |first11=Parker |last12=Shi |first12=Kensen |last13=Tsvyashchenko |first13=Sasha |last14=Maynez |first14=Joshua |last15=Rao |first15=Abhishek |last16=Barnes |first16=Parker |last17=Tay |first17=Yi |last18=Shazeer |first18=Noam |last19=Prabhakaran |first19=Vinodkumar |last20=Reif |first20=Emily |last21=Du |first21=Nan |last22=Hutchinson |first22=Ben |last23=Pope |first23=Reiner |last24=Bradbury |first24=James |last25=Austin |first25=Jacob |last26=Isard |first26=Michael |last27=Gur-Ari |first27=Guy |last28=Yin |first28=Pengcheng |last29=Duke |first29=Toju |last30=Levskaya |first30=Anselm |last31=Ghemawat |first31=Sanjay |last32=Dev |first32=Sunipa |last33=Michalewski |first33=Henryk |last34=Garcia |first34=Xavier |last35=Misra |first35=Vedant |last36=Robinson |first36=Kevin |last37=Fedus |first37=Liam |last38=Zhou |first38=Denny |last39=Ippolito |first39=Daphne |last40=Luan |first40=David |last41=Lim |first41=Hyeontaek |last42=Zoph |first42=Barret |last43=Spiridonov |first43=Alexander |last44=Sepassi |first44=Ryan |last45=Dohan |first45=David |last46=Agrawal |first46=Shivani |last47=Omernick |first47=Mark |last48=Dai |first48=Andrew M. |last49=Pillai |first49=Thanumalayan Sankaranarayana |last50=Pellat |first50=Marie |last51=Lewkowycz |first51=Aitor |last52=Moreira |first52=Erica |last53=Child |first53=Rewon |last54=Polozov |first54=Oleksandr |last55=Lee |first55=Katherine |last56=Zhou |first56=Zongwei |last57=Wang |first57=Xuezhi |last58=Saeta |first58=Brennan |last59=Diaz |first59=Mark |last60=Firat |first60=Orhan |last61=Catasta |first61=Michele |last62=Wei |first62=Jason |last63=Meier-Hellstern |first63=Kathy |last64=Eck |first64=Douglas |last65=Dean |first65=Jeff |last66=Petrov |first66=Slav |last67=Fiedel |first67=Noah |title=PaLM: Scaling Language Modeling with Pathways |date=2022 |doi=10.48550/arXiv.2204.02311}}</ref><ref>{{cite web |title=Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance |url=https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html |website=ai.googleblog.com |access-date=21 March 2023 |language=en}}</ref> | + | | 2022 || March 29 || Chinchilla || 70,000,000,000<ref name="A Survey of Large"/> || 1,400,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || Chinchilla is introduced by DeepMind to address the optimal training of large language models under a specific computational budget. DeepMind's research shows that existing large language models are undertrained due to a focus on scaling models while keeping the training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they found that for optimal training, both model size and the number of training tokens should be scaled equally. Chinchilla, a model with 70 billion parameters and trained on 1.4 trillion tokens, outperforms larger models like Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG, achieving superior performance on various evaluation tasks while using substantially less computational resources for fine-tuning and inference.<ref>{{cite journal |last1=Hoffmann |first1=Jordan |last2=Borgeaud |first2=Sebastian |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan |last16=Guy |first16=Aurelia |last17=Osindero |first17=Simon |last18=Simonyan |first18=Karen |last19=Elsen |first19=Erich |last20=Rae |first20=Jack W. |last21=Vinyals |first21=Oriol |last22=Sifre |first22=Laurent |title=Training Compute-Optimal Large Language Models |date=2022 |doi=10.48550/arXiv.2203.15556}}</ref> |
+ | |- | ||
+ | | 2022 || April 5 || PaLM || 540,000,000,000<ref name="Kazi"/><ref name="A Survey of Large"/> || 780,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || A paper presents PaLM, a 540-billion parameter language model trained using Pathways, a new {{w|machine learning}} system that enables highly efficient training across multiple TPU Pods. PaLM achieves state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks and outperforms the finetuned state-of-the-art on a suite of multi-step reasoning tasks. It also outperforms average human performance on the BIG-bench benchmark. Additionally, PaLM has strong capabilities in multilingual tasks and source code generation. The paper also discusses bias and toxicity and potential mitigation strategies.<ref>{{cite journal |last1=Chowdhery |first1=Aakanksha |last2=Narang |first2=Sharan |last3=Devlin |first3=Jacob |last4=Bosma |first4=Maarten |last5=Mishra |first5=Gaurav |last6=Roberts |first6=Adam |last7=Barham |first7=Paul |last8=Chung |first8=Hyung Won |last9=Sutton |first9=Charles |last10=Gehrmann |first10=Sebastian |last11=Schuh |first11=Parker |last12=Shi |first12=Kensen |last13=Tsvyashchenko |first13=Sasha |last14=Maynez |first14=Joshua |last15=Rao |first15=Abhishek |last16=Barnes |first16=Parker |last17=Tay |first17=Yi |last18=Shazeer |first18=Noam |last19=Prabhakaran |first19=Vinodkumar |last20=Reif |first20=Emily |last21=Du |first21=Nan |last22=Hutchinson |first22=Ben |last23=Pope |first23=Reiner |last24=Bradbury |first24=James |last25=Austin |first25=Jacob |last26=Isard |first26=Michael |last27=Gur-Ari |first27=Guy |last28=Yin |first28=Pengcheng |last29=Duke |first29=Toju |last30=Levskaya |first30=Anselm |last31=Ghemawat |first31=Sanjay |last32=Dev |first32=Sunipa |last33=Michalewski |first33=Henryk |last34=Garcia |first34=Xavier |last35=Misra |first35=Vedant |last36=Robinson |first36=Kevin |last37=Fedus |first37=Liam |last38=Zhou |first38=Denny |last39=Ippolito |first39=Daphne |last40=Luan |first40=David |last41=Lim |first41=Hyeontaek |last42=Zoph |first42=Barret |last43=Spiridonov |first43=Alexander |last44=Sepassi |first44=Ryan |last45=Dohan |first45=David |last46=Agrawal |first46=Shivani |last47=Omernick |first47=Mark |last48=Dai |first48=Andrew M. |last49=Pillai |first49=Thanumalayan Sankaranarayana |last50=Pellat |first50=Marie |last51=Lewkowycz |first51=Aitor |last52=Moreira |first52=Erica |last53=Child |first53=Rewon |last54=Polozov |first54=Oleksandr |last55=Lee |first55=Katherine |last56=Zhou |first56=Zongwei |last57=Wang |first57=Xuezhi |last58=Saeta |first58=Brennan |last59=Diaz |first59=Mark |last60=Firat |first60=Orhan |last61=Catasta |first61=Michele |last62=Wei |first62=Jason |last63=Meier-Hellstern |first63=Kathy |last64=Eck |first64=Douglas |last65=Dean |first65=Jeff |last66=Petrov |first66=Slav |last67=Fiedel |first67=Noah |title=PaLM: Scaling Language Modeling with Pathways |date=2022 |doi=10.48550/arXiv.2204.02311}}</ref><ref>{{cite web |title=Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance |url=https://ai.googleblog.com/2022/04/pathways-language-model-palm-scaling-to.html |website=ai.googleblog.com |access-date=21 March 2023 |language=en}}</ref> | ||
|- | |- | ||
| 2022 || April 12 || || || || Programming/training || A paper describes a method for training language models to act as helpful and harmless assistants using {{w|reinforcement learning}} from human feedback. The authors demonstrate that this alignment training improves performance on almost all natural language processing evaluations and is compatible with training for specialized skills such as python coding and summarization. They explore an iterated online mode of training and investigate the robustness of the approach, identifying a linear relationship between the RL reward and the square root of the {{w|Kullback–Leibler divergence}} between the policy and its initialization. The authors also perform peripheral analyses and provide samples from their models using prompts from recent related work.<ref>{{cite journal |last1=Bai |first1=Yuntao |last2=Jones |first2=Andy |last3=Ndousse |first3=Kamal |last4=Askell |first4=Amanda |last5=Chen |first5=Anna |last6=DasSarma |first6=Nova |last7=Drain |first7=Dawn |last8=Fort |first8=Stanislav |last9=Ganguli |first9=Deep |last10=Henighan |first10=Tom |last11=Joseph |first11=Nicholas |last12=Kadavath |first12=Saurav |last13=Kernion |first13=Jackson |last14=Conerly |first14=Tom |last15=El-Showk |first15=Sheer |last16=Elhage |first16=Nelson |last17=Hatfield-Dodds |first17=Zac |last18=Hernandez |first18=Danny |last19=Hume |first19=Tristan |last20=Johnston |first20=Scott |last21=Kravec |first21=Shauna |last22=Lovitt |first22=Liane |last23=Nanda |first23=Neel |last24=Olsson |first24=Catherine |last25=Amodei |first25=Dario |last26=Brown |first26=Tom |last27=Clark |first27=Jack |last28=McCandlish |first28=Sam |last29=Olah |first29=Chris |last30=Mann |first30=Ben |last31=Kaplan |first31=Jared |title=Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |date=2022 |doi=10.48550/arXiv.2204.05862}}</ref> | | 2022 || April 12 || || || || Programming/training || A paper describes a method for training language models to act as helpful and harmless assistants using {{w|reinforcement learning}} from human feedback. The authors demonstrate that this alignment training improves performance on almost all natural language processing evaluations and is compatible with training for specialized skills such as python coding and summarization. They explore an iterated online mode of training and investigate the robustness of the approach, identifying a linear relationship between the RL reward and the square root of the {{w|Kullback–Leibler divergence}} between the policy and its initialization. The authors also perform peripheral analyses and provide samples from their models using prompts from recent related work.<ref>{{cite journal |last1=Bai |first1=Yuntao |last2=Jones |first2=Andy |last3=Ndousse |first3=Kamal |last4=Askell |first4=Amanda |last5=Chen |first5=Anna |last6=DasSarma |first6=Nova |last7=Drain |first7=Dawn |last8=Fort |first8=Stanislav |last9=Ganguli |first9=Deep |last10=Henighan |first10=Tom |last11=Joseph |first11=Nicholas |last12=Kadavath |first12=Saurav |last13=Kernion |first13=Jackson |last14=Conerly |first14=Tom |last15=El-Showk |first15=Sheer |last16=Elhage |first16=Nelson |last17=Hatfield-Dodds |first17=Zac |last18=Hernandez |first18=Danny |last19=Hume |first19=Tristan |last20=Johnston |first20=Scott |last21=Kravec |first21=Shauna |last22=Lovitt |first22=Liane |last23=Nanda |first23=Neel |last24=Olsson |first24=Catherine |last25=Amodei |first25=Dario |last26=Brown |first26=Tom |last27=Clark |first27=Jack |last28=McCandlish |first28=Sam |last29=Olah |first29=Chris |last30=Mann |first30=Ben |last31=Kaplan |first31=Jared |title=Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |date=2022 |doi=10.48550/arXiv.2204.05862}}</ref> | ||
Line 138: | Line 150: | ||
| 2022 || April || DALL-E 2 || 3,500,000,000 || || LLM launch || {{w|OpenAI}} unveils DALL-E 2, a successor to their original DALL-E model, designed for generating highly realistic images at resolutions up to 1024x1024. Unlike its predecessor, DALL-E 2 utilizes a diffusion model, enabling the creation of images with four times the resolution of DALL-E. OpenAI extends customization options, allowing users to specify styles like pixel art or oil paintings. DALL-E 2 introduces 'outpainting,' enabling users to extend existing images creatively. This innovation would spark significant interest in the field of generative AI, especially for tasks beyond image generation, such as interpolation and manipulation. The model's working mechanism involves a text encoder, 'prior' model, and image decoder, simplifying complex processes underlying its image generation capabilities.<ref>{{cite web |title=Comparing AI models : DALLE and Stable Diffusion |url=https://www.linkedin.com/pulse/comparing-ai-models-dalle-stable-diffusion-sonal-agrawal/ |website=www.linkedin.com |access-date=16 October 2023 |language=en}}</ref><ref>{{cite web |last1=Howell |first1=James |title=What is Dall-E and How Does it Work? What is Dall-E and How Does it Work? |url=https://101blockchains.com/dall-e-explained/ |website=101 Blockchains |access-date=16 October 2023 |date=22 September 2023}}</ref><ref>{{cite web |title=What is Dall-E (Dall-E 2) and How Does it Work? |url=https://www.techtarget.com/searchenterpriseai/definition/Dall-E |website=Enterprise AI |access-date=16 October 2023 |language=en}}</ref><ref>{{cite web |last1=Gonsalves |first1=Robert A. |title=Exploring DALL-E for Digital Art Creation |url=https://towardsdatascience.com/exploring-dall-e-for-digital-art-creation-b244e1a2ed12 |website=Medium |access-date=16 October 2023 |language=en |date=5 September 2023}}</ref> | | 2022 || April || DALL-E 2 || 3,500,000,000 || || LLM launch || {{w|OpenAI}} unveils DALL-E 2, a successor to their original DALL-E model, designed for generating highly realistic images at resolutions up to 1024x1024. Unlike its predecessor, DALL-E 2 utilizes a diffusion model, enabling the creation of images with four times the resolution of DALL-E. OpenAI extends customization options, allowing users to specify styles like pixel art or oil paintings. DALL-E 2 introduces 'outpainting,' enabling users to extend existing images creatively. This innovation would spark significant interest in the field of generative AI, especially for tasks beyond image generation, such as interpolation and manipulation. The model's working mechanism involves a text encoder, 'prior' model, and image decoder, simplifying complex processes underlying its image generation capabilities.<ref>{{cite web |title=Comparing AI models : DALLE and Stable Diffusion |url=https://www.linkedin.com/pulse/comparing-ai-models-dalle-stable-diffusion-sonal-agrawal/ |website=www.linkedin.com |access-date=16 October 2023 |language=en}}</ref><ref>{{cite web |last1=Howell |first1=James |title=What is Dall-E and How Does it Work? What is Dall-E and How Does it Work? |url=https://101blockchains.com/dall-e-explained/ |website=101 Blockchains |access-date=16 October 2023 |date=22 September 2023}}</ref><ref>{{cite web |title=What is Dall-E (Dall-E 2) and How Does it Work? |url=https://www.techtarget.com/searchenterpriseai/definition/Dall-E |website=Enterprise AI |access-date=16 October 2023 |language=en}}</ref><ref>{{cite web |last1=Gonsalves |first1=Robert A. |title=Exploring DALL-E for Digital Art Creation |url=https://towardsdatascience.com/exploring-dall-e-for-digital-art-creation-b244e1a2ed12 |website=Medium |access-date=16 October 2023 |language=en |date=5 September 2023}}</ref> | ||
|- | |- | ||
− | | 2022 || May 3 || OPT || 175,000,000,000<ref name="A Survey of Large"/> || 180,000,000,000<ref name="A Survey of Large"/> || LLM launch || Meta AI introduces Open Pretrained Transformer-175B (OPT-175B), a language model designed to democratize access to large-scale language models. By this time, these models, with over 100 billion parameters, have revolutionized NLP and AI research. OPT-175B is released with both pretrained models and code for training and usage, under a noncommercial license for research purposes. It aims to make these models accessible to academic, governmental, civil society, and industry researchers worldwide. Meta AI emphasizes responsible AI and provides documentation, compute efficiency, and smaller-scale baseline models for analysis.<ref>{{cite web |title=Democratizing access to large-scale language models with OPT-175B |url=https://ai.meta.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ |website=ai.meta.com |access-date=20 September 2023 |language=en}}</ref> | + | | 2022 || May 3 || OPT || 175,000,000,000<ref name="A Survey of Large"/> || 180,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || Meta AI introduces Open Pretrained Transformer-175B (OPT-175B), a language model designed to democratize access to large-scale language models. By this time, these models, with over 100 billion parameters, have revolutionized NLP and AI research. OPT-175B is released with both pretrained models and code for training and usage, under a noncommercial license for research purposes. It aims to make these models accessible to academic, governmental, civil society, and industry researchers worldwide. Meta AI emphasizes responsible AI and provides documentation, compute efficiency, and smaller-scale baseline models for analysis.<ref>{{cite web |title=Democratizing access to large-scale language models with OPT-175B |url=https://ai.meta.com/blog/democratizing-access-to-large-scale-language-models-with-opt-175b/ |website=ai.meta.com |access-date=20 September 2023 |language=en}}</ref> |
|- | |- | ||
| 2022 || May 10 || UL2 || 20,000,000,000<ref name="A Survey of Large"/> || 1,000,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || UL2 is introduced as a unified framework for pre-training models that excel across various datasets and setups. It dissects architectural archetypes and pre-training objectives, offering a generalized view of self-supervision in NLP. The paper proposes Mixture-of-Denoisers (MoD), a method combining diverse pre-training paradigms. UL2 achieves superior performance, surpassing T5 and GPT-like models across multiple contexts. With 20B parameters, it outperforms GPT-3 on zero-shot SuperGLUE and triples T5-XXL's one-shot summarization performance. UL2 also excels in chain-of-thought prompting and reasoning, making it ideal for medium-scale reasoning research. FLAN instruction tuning enhances its scores, and model checkpoints are released for further research.<ref>{{cite journal |last1=Tay |first1=Yi |last2=Dehghani |first2=Mostafa |last3=Tran |first3=Vinh Q. |last4=García |first4=Xavier |last5=Wei |first5=Jason |last6=Wang |first6=Xuezhi |last7=Chung |first7=Hyung Won |last8=Bahri |first8=Dara |last9=Schuster |first9=Tal |last10=Zheng |first10=H. |last11=Zhou |first11=Denny |last12=Houlsby |first12=N. |last13=Metzler |first13=Donald |title=UL2: Unifying Language Learning Paradigms |date=10 May 2022 |url=https://www.semanticscholar.org/paper/UL2%3A-Unifying-Language-Learning-Paradigms-Tay-Dehghani/b21670e8061a06ab97e7d6052c9345a326e84ff8}}</ref> | | 2022 || May 10 || UL2 || 20,000,000,000<ref name="A Survey of Large"/> || 1,000,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || UL2 is introduced as a unified framework for pre-training models that excel across various datasets and setups. It dissects architectural archetypes and pre-training objectives, offering a generalized view of self-supervision in NLP. The paper proposes Mixture-of-Denoisers (MoD), a method combining diverse pre-training paradigms. UL2 achieves superior performance, surpassing T5 and GPT-like models across multiple contexts. With 20B parameters, it outperforms GPT-3 on zero-shot SuperGLUE and triples T5-XXL's one-shot summarization performance. UL2 also excels in chain-of-thought prompting and reasoning, making it ideal for medium-scale reasoning research. FLAN instruction tuning enhances its scores, and model checkpoints are released for further research.<ref>{{cite journal |last1=Tay |first1=Yi |last2=Dehghani |first2=Mostafa |last3=Tran |first3=Vinh Q. |last4=García |first4=Xavier |last5=Wei |first5=Jason |last6=Wang |first6=Xuezhi |last7=Chung |first7=Hyung Won |last8=Bahri |first8=Dara |last9=Schuster |first9=Tal |last10=Zheng |first10=H. |last11=Zhou |first11=Denny |last12=Houlsby |first12=N. |last13=Metzler |first13=Donald |title=UL2: Unifying Language Learning Paradigms |date=10 May 2022 |url=https://www.semanticscholar.org/paper/UL2%3A-Unifying-Language-Learning-Paradigms-Tay-Dehghani/b21670e8061a06ab97e7d6052c9345a326e84ff8}}</ref> | ||
Line 147: | Line 159: | ||
|- | |- | ||
| 2022 || July 6 || NLLB-200 || 54,500,000,000<ref name="A Survey of Large"/> || || LLM launch || [[w:Meta platforms|Meta]] unveils NLLB-200, which is capable of translating 200 languages with a remarkable 44% improvement in accuracy compared to previous technology. This advancement addresses the digital accessibility gap for billions, especially in Africa and Asia, where many languages lack high-quality translation tools. Meta's FLORES-200, a dataset for evaluating NLLB-200's performance, is also opened to developers. Additionally, Meta offeris grants for impactful NLLB-200 applications, supporting areas like sustainability and education.<ref name="">{{cite web |title=New AI Model Translates 200 Languages, Making Technology Accessible to More People |url=https://about.fb.com/news/2022/07/new-meta-ai-model-translates-200-languages-making-technology-more-accessible/ |website=Meta |access-date=19 October 2023 |date=6 July 2022}}</ref> | | 2022 || July 6 || NLLB-200 || 54,500,000,000<ref name="A Survey of Large"/> || || LLM launch || [[w:Meta platforms|Meta]] unveils NLLB-200, which is capable of translating 200 languages with a remarkable 44% improvement in accuracy compared to previous technology. This advancement addresses the digital accessibility gap for billions, especially in Africa and Asia, where many languages lack high-quality translation tools. Meta's FLORES-200, a dataset for evaluating NLLB-200's performance, is also opened to developers. Additionally, Meta offeris grants for impactful NLLB-200 applications, supporting areas like sustainability and education.<ref name="">{{cite web |title=New AI Model Translates 200 Languages, Making Technology Accessible to More People |url=https://about.fb.com/news/2022/07/new-meta-ai-model-translates-200-languages-making-technology-more-accessible/ |website=Meta |access-date=19 October 2023 |date=6 July 2022}}</ref> | ||
+ | |- | ||
+ | | 2022 || August || AlexaTM || 20,000,000,000<ref name="A Survey of Large"/> || 1,300,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || Amazon's Alexa AI labs introduces AlexaTM. Despite its seemingly modest 20 billion parameters compared to larger models, its unique encoder-decoder architecture distinguishes it. Unlike decoder-only models like GPT-3, AlexaTM 20B's encoder produces input representations for the decoder, enhancing its efficiency in tasks like machine translation and text summarization, where it outperforms GPT-3. This model marks a leap in few-shot learning, showcasing Amazon's innovation in NLU research.<ref>{{cite web |last1=Rodriguez |first1=Jesus |title=AlexaTM 20B is Amazon’s New Language Super Model Which is Also Capable of Few-Shot Learning |url=https://jrodthoughts.medium.com/alexatm-20b-is-amazons-new-language-super-model-which-is-also-capable-of-few-shot-learning-b045ddff1677 |website=Medium |access-date=21 October 2023 |language=en |date=15 August 2022}}</ref> | ||
|- | |- | ||
| 2022 || September || CodeGeeX || 13,000,000,000 || 850,000,000,000 tokens || LLM launch || CodeGeeX open sources its code. It is a multilingual code generation tool with 13 billion parameters, trained on a vast code corpus of over 20 programming languages. It uses artificial intelligence to generate code based on user comments or suggest the next line of code, enhancing coding speed. Unlike Copilot, CodeGeeX is powered by AI trained on Ascend 910 processors, which, combined with Mindspore, outperform other AI training cards. CodeGeeX's generated code is editable, and it features a Candidate feature, offering multiple code versions for users to choose from. Licensed under Apache License 2.0, CodeGeeX matches GitHub Copilot in performance and introduces unique features for developers.<ref>{{cite web |last1=Elemuwa |first1=Fimber |title=Using CodeGeeX as a GitHub Copilot alternative |url=https://blog.logrocket.com/using-codegeex-github-copilot-alternative/ |website=LogRocket Blog |access-date=19 October 2023 |date=22 February 2023}}</ref><ref name="CodeGeeX:">{{cite journal |last1=Zheng |first1=Qinkai |last2=Xia |first2=Xiao |last3=Zou |first3=Xu |last4=Dong |first4=Yuxiao |last5=Wang |first5=Shan |last6=Xue |first6=Yufei |last7=Wang |first7=Zihan |last8=Shen |first8=Lei |last9=Wang |first9=Andi |last10=Li |first10=Yang |last11=Su |first11=Teng |last12=Yang |first12=Zhilin |last13=Tang |first13=Jie |title=CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X |date=2023 |doi=10.48550/arXiv.2303.17568}}</ref> | | 2022 || September || CodeGeeX || 13,000,000,000 || 850,000,000,000 tokens || LLM launch || CodeGeeX open sources its code. It is a multilingual code generation tool with 13 billion parameters, trained on a vast code corpus of over 20 programming languages. It uses artificial intelligence to generate code based on user comments or suggest the next line of code, enhancing coding speed. Unlike Copilot, CodeGeeX is powered by AI trained on Ascend 910 processors, which, combined with Mindspore, outperform other AI training cards. CodeGeeX's generated code is editable, and it features a Candidate feature, offering multiple code versions for users to choose from. Licensed under Apache License 2.0, CodeGeeX matches GitHub Copilot in performance and introduces unique features for developers.<ref>{{cite web |last1=Elemuwa |first1=Fimber |title=Using CodeGeeX as a GitHub Copilot alternative |url=https://blog.logrocket.com/using-codegeex-github-copilot-alternative/ |website=LogRocket Blog |access-date=19 October 2023 |date=22 February 2023}}</ref><ref name="CodeGeeX:">{{cite journal |last1=Zheng |first1=Qinkai |last2=Xia |first2=Xiao |last3=Zou |first3=Xu |last4=Dong |first4=Yuxiao |last5=Wang |first5=Shan |last6=Xue |first6=Yufei |last7=Wang |first7=Zihan |last8=Shen |first8=Lei |last9=Wang |first9=Andi |last10=Li |first10=Yang |last11=Su |first11=Teng |last12=Yang |first12=Zhilin |last13=Tang |first13=Jie |title=CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X |date=2023 |doi=10.48550/arXiv.2303.17568}}</ref> | ||
+ | |- | ||
+ | | 2022 || September || Sparrow || 70,000,000,000<ref name="A Survey of Large"/> || || LLM launch || {{w|DeepMind}} introduces Sparrow, which is refined using human feedback to enhance its helpfulness, accuracy, and harmlessness. It utilizes the Chinchilla language model, trained on substantial data, and integrates with the internet for real-time information access, ensuring accurate responses. {{w|Google}} aims to use Sparrow as a response to {{w|ChatGPT}} and {{w|Microsoft}}'s collaboration with OpenAI, providing them with a commercially viable chatbot, potentially rivaling Google Search and OpenAI.<ref>{{cite web |title=Could Deepmind’s Sparrow be Google’s answer to ChatGPT? |url=https://medium.com/inkwater-atlas/could-deepminds-sparrow-be-google-s-answer-to-chatgpt-89ccef61186a#:~:text=First%20introduced%20in%20September%202022,private%20beta%20later%20this%20year. |website=medium.com |access-date=21 October 2023}}</ref> | ||
+ | |- | ||
+ | | 2022 || September 21 || WeLM || 10,000,000,000<ref name="A Survey of Large"/> || 300,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || WeLM is introduced as a versatile pre-trained language model for [[w:Chinese language|Chinese]], trained with 10 billion parameters using self-supervised learning. It exhibits exceptional zero-shot generalization across various tasks with minimal demonstrations. Trained on a diverse high-quality corpus, WeLM outperforms existing models on 18 monolingual tasks, matching larger models' performance. It excels in multilingual and code-switching contexts, surpassing multilingual models trained on 30 languages. Fine-tuning with human-written prompts enhances its performance on unseen tasks, even outperforming unsupervised WeLM. Additionally, WeLM displays rudimentary self-explanation and calibration abilities, suggesting promising research avenues.<ref>{{cite journal |last1=Su |first1=Hui |last2=Zhou |first2=Xiao |last3=Yu |first3=Houjin |last4=Shen |first4=Xiaoyu |last5=Chen |first5=Yuwen |last6=Zhu |first6=Zilin |last7=Yu |first7=Yang |last8=Zhou |first8=Jie |title=WeLM: A Well-Read Pre-trained Language Model for Chinese |date=2022 |doi=10.48550/arXiv.2209.10372}}</ref> | ||
|- | |- | ||
| 2022 || October 5 || GLM || 130,000,000,000<ref name="A Survey of Large"/> || 400,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || GLM-130B is introduced as an open-source bilingual (English and Chinese) pre-trained language model. This model, aiming to match GPT-3's performance, overcomes technical challenges during training, focusing on stability and efficiency. It outperforms GPT-3 175B on various English benchmarks and surpasses ERNIE TITAN 3.0 260B, the largest Chinese model, on related tasks. Unique scaling properties enable efficient inference on affordable GPUs. GLM-130B achieves INT4 quantization without performance loss, a first for 100B-scale models. The model weights and resources are publicly accessible, fostering research and development in natural language processing.<ref>{{cite journal |last1=Zeng |first1=Aohan |last2=Liu |first2=Xiao |last3=Du |first3=Zhengxiao |last4=Wang |first4=Zihan |last5=Lai |first5=Hanyu |last6=Ding |first6=Ming |last7=Yang |first7=Zhuoyi |last8=Xu |first8=Yifan |last9=Zheng |first9=Wendi |last10=Xia |first10=Xiao |last11=Tam |first11=Weng Lam |last12=Ma |first12=Zixuan |last13=Xue |first13=Yufei |last14=Zhai |first14=Jidong |last15=Chen |first15=Wenguang |last16=Zhang |first16=Peng |last17=Dong |first17=Yuxiao |last18=Tang |first18=Jie |title=GLM-130B: An Open Bilingual Pre-trained Model |date=2022 |doi=10.48550/arXiv.2210.02414}}</ref> | | 2022 || October 5 || GLM || 130,000,000,000<ref name="A Survey of Large"/> || 400,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || GLM-130B is introduced as an open-source bilingual (English and Chinese) pre-trained language model. This model, aiming to match GPT-3's performance, overcomes technical challenges during training, focusing on stability and efficiency. It outperforms GPT-3 175B on various English benchmarks and surpasses ERNIE TITAN 3.0 260B, the largest Chinese model, on related tasks. Unique scaling properties enable efficient inference on affordable GPUs. GLM-130B achieves INT4 quantization without performance loss, a first for 100B-scale models. The model weights and resources are publicly accessible, fostering research and development in natural language processing.<ref>{{cite journal |last1=Zeng |first1=Aohan |last2=Liu |first2=Xiao |last3=Du |first3=Zhengxiao |last4=Wang |first4=Zihan |last5=Lai |first5=Hanyu |last6=Ding |first6=Ming |last7=Yang |first7=Zhuoyi |last8=Xu |first8=Yifan |last9=Zheng |first9=Wendi |last10=Xia |first10=Xiao |last11=Tam |first11=Weng Lam |last12=Ma |first12=Zixuan |last13=Xue |first13=Yufei |last14=Zhai |first14=Jidong |last15=Chen |first15=Wenguang |last16=Zhang |first16=Peng |last17=Dong |first17=Yuxiao |last18=Tang |first18=Jie |title=GLM-130B: An Open Bilingual Pre-trained Model |date=2022 |doi=10.48550/arXiv.2210.02414}}</ref> | ||
Line 154: | Line 172: | ||
| 2022 || November 3 || BLOOMZ || 176,000,000,000<ref name="A Survey of Large"/> || || LLM launch || BLOOMZ is introduced as a variant of BLOOM. BLOOMZ is a multilingual language model achieved through multitask prompted finetuning (MTF), enhancing its ability to generalize across various tasks. The research extends MTF beyond English-centric models, applying it to multilingual BLOOM and mT5 models, creating BLOOMZ and mT0 variants. By finetuning these models on English tasks with English prompts, they achieve task generalization to non-English languages present in the pretraining data. Surprisingly, the models exhibit zero-shot generalization to tasks in languages they have never been intentionally exposed to, suggesting the development of task- and language-agnostic capabilities. Additionally, the study introduces xP3, a composite model, advancing crosslingual generalization in natural language processing.<ref>{{cite journal |last1=Muennighoff |first1=Niklas |last2=Wang |first2=Thomas |last3=Sutawika |first3=Lintang |last4=Roberts |first4=Adam |last5=Biderman |first5=Stella |last6=Scao |first6=Teven Le |last7=Bari |first7=M Saiful |last8=Shen |first8=Sheng |last9=Yong |first9=Zheng-Xin |last10=Schoelkopf |first10=Hailey |last11=Tang |first11=Xiangru |last12=Radev |first12=Dragomir |last13=Aji |first13=Alham Fikri |last14=Almubarak |first14=Khalid |last15=Albanie |first15=Samuel |last16=Alyafeai |first16=Zaid |last17=Webson |first17=Albert |last18=Raff |first18=Edward |last19=Raffel |first19=Colin |title=Crosslingual Generalization through Multitask Finetuning |date=2022 |doi=10.48550/arXiv.2211.01786 |url=}}</ref> | | 2022 || November 3 || BLOOMZ || 176,000,000,000<ref name="A Survey of Large"/> || || LLM launch || BLOOMZ is introduced as a variant of BLOOM. BLOOMZ is a multilingual language model achieved through multitask prompted finetuning (MTF), enhancing its ability to generalize across various tasks. The research extends MTF beyond English-centric models, applying it to multilingual BLOOM and mT5 models, creating BLOOMZ and mT0 variants. By finetuning these models on English tasks with English prompts, they achieve task generalization to non-English languages present in the pretraining data. Surprisingly, the models exhibit zero-shot generalization to tasks in languages they have never been intentionally exposed to, suggesting the development of task- and language-agnostic capabilities. Additionally, the study introduces xP3, a composite model, advancing crosslingual generalization in natural language processing.<ref>{{cite journal |last1=Muennighoff |first1=Niklas |last2=Wang |first2=Thomas |last3=Sutawika |first3=Lintang |last4=Roberts |first4=Adam |last5=Biderman |first5=Stella |last6=Scao |first6=Teven Le |last7=Bari |first7=M Saiful |last8=Shen |first8=Sheng |last9=Yong |first9=Zheng-Xin |last10=Schoelkopf |first10=Hailey |last11=Tang |first11=Xiangru |last12=Radev |first12=Dragomir |last13=Aji |first13=Alham Fikri |last14=Almubarak |first14=Khalid |last15=Albanie |first15=Samuel |last16=Alyafeai |first16=Zaid |last17=Webson |first17=Albert |last18=Raff |first18=Edward |last19=Raffel |first19=Colin |title=Crosslingual Generalization through Multitask Finetuning |date=2022 |doi=10.48550/arXiv.2211.01786 |url=}}</ref> | ||
|- | |- | ||
− | | 2022 || November 9 || BLOOM || 176,000,000,000<ref name="Kazi"/><ref name="A Survey of Large"/> || 366,000,000,000,000<ref name="A Survey of Large"/> || LLM launch || A paper introduces BLOOM, an open-access language model designed and built by a collaboration of hundreds of researchers. The model is a decoder-only Transformer language model trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages. BLOOM achieves competitive performance on a wide variety of benchmarks and is publicly released under the Responsible AI License to facilitate future research and applications using large language models. The paper also discusses the development process and the need to democratize large language models.<ref>{{cite journal |last1=Workshop |first1=BigScience |last2=Scao |first2=Teven Le |last3=Fan |first3=Angela |last4=Akiki |first4=Christopher |last5=Pavlick |first5=Ellie |last6=Ilić |first6=Suzana |last7=Hesslow |first7=Daniel |last8=Castagné |first8=Roman |last9=Luccioni |first9=Alexandra Sasha |last10=Yvon |first10=François |last11=Gallé |first11=Matthias |last12=Tow |first12=Jonathan |last13=Rush |first13=Alexander M. |last14=Biderman |first14=Stella |last15=Webson |first15=Albert |last16=Ammanamanchi |first16=Pawan Sasanka |last17=Wang |first17=Thomas |last18=Sagot |first18=Benoît |last19=Muennighoff |first19=Niklas |last20=del Moral |first20=Albert Villanova |last21=Ruwase |first21=Olatunji |last22=Bawden |first22=Rachel |last23=Bekman |first23=Stas |last24=McMillan-Major |first24=Angelina |last25=Beltagy |first25=Iz |last26=Nguyen |first26=Huu |last27=Saulnier |first27=Lucile |last28=Tan |first28=Samson |last29=Suarez |first29=Pedro Ortiz |last30=Sanh |first30=Victor |last31=Laurençon |first31=Hugo |last32=Jernite |first32=Yacine |last33=Launay |first33=Julien |last34=Mitchell |first34=Margaret |last35=Raffel |first35=Colin |last36=Gokaslan |first36=Aaron |last37=Simhi |first37=Adi |last38=Soroa |first38=Aitor |last39=Aji |first39=Alham Fikri |last40=Alfassy |first40=Amit |last41=Rogers |first41=Anna |last42=Nitzav |first42=Ariel Kreisberg |last43=Xu |first43=Canwen |last44=Mou |first44=Chenghao |last45=Emezue |first45=Chris |last46=Klamm |first46=Christopher |last47=Leong |first47=Colin |last48=van Strien |first48=Daniel |last49=Adelani |first49=David Ifeoluwa |last50=Radev |first50=Dragomir |last51=Ponferrada |first51=Eduardo González |last52=Levkovizh |first52=Efrat |last53=Kim |first53=Ethan |last54=Natan |first54=Eyal Bar |last55=De Toni |first55=Francesco |last56=Dupont |first56=Gérard |last57=Kruszewski |first57=Germán |last58=Pistilli |first58=Giada |last59=Elsahar |first59=Hady |last60=Benyamina |first60=Hamza |last61=Tran |first61=Hieu |last62=Yu |first62=Ian |last63=Abdulmumin |first63=Idris |last64=Johnson |first64=Isaac |last65=Gonzalez-Dios |first65=Itziar |last66=de la Rosa |first66=Javier |last67=Chim |first67=Jenny |last68=Dodge |first68=Jesse |last69=Zhu |first69=Jian |last70=Chang |first70=Jonathan |last71=Frohberg |first71=Jörg |last72=Tobing |first72=Joseph |last73=Bhattacharjee |first73=Joydeep |last74=Almubarak |first74=Khalid |last75=Chen |first75=Kimbo |last76=Lo |first76=Kyle |last77=Von Werra |first77=Leandro |last78=Weber |first78=Leon |last79=Phan |first79=Long |last80=allal |first80=Loubna Ben |last81=Tanguy |first81=Ludovic |last82=Dey |first82=Manan |last83=Muñoz |first83=Manuel Romero |last84=Masoud |first84=Maraim |last85=Grandury |first85=María |last86=Šaško |first86=Mario |last87=Huang |first87=Max |last88=Coavoux |first88=Maximin |last89=Singh |first89=Mayank |last90=Jiang |first90=Mike Tian-Jian |last91=Vu |first91=Minh Chien |last92=Jauhar |first92=Mohammad A. |last93=Ghaleb |first93=Mustafa |last94=Subramani |first94=Nishant |last95=Kassner |first95=Nora |last96=Khamis |first96=Nurulaqilla |last97=Nguyen |first97=Olivier |last98=Espejel |first98=Omar |last99=de Gibert |first99=Ona |last100=Villegas |first100=Paulo |last101=Henderson |first101=Peter |last102=Colombo |first102=Pierre |last103=Amuok |first103=Priscilla |last104=Lhoest |first104=Quentin |last105=Harliman |first105=Rheza |last106=Bommasani |first106=Rishi |last107=López |first107=Roberto Luis |last108=Ribeiro |first108=Rui |last109=Osei |first109=Salomey |last110=Pyysalo |first110=Sampo |last111=Nagel |first111=Sebastian |last112=Bose |first112=Shamik |last113=Muhammad |first113=Shamsuddeen Hassan |last114=Sharma |first114=Shanya |last115=Longpre |first115=Shayne |last116=Nikpoor |first116=Somaieh |last117=Silberberg |first117=Stanislav |last118=Pai |first118=Suhas |last119=Zink |first119=Sydney |last120=Torrent |first120=Tiago Timponi |last121=Schick |first121=Timo |last122=Thrush |first122=Tristan |last123=Danchev |first123=Valentin |last124=Nikoulina |first124=Vassilina |last125=Laippala |first125=Veronika |last126=Lepercq |first126=Violette |last127=Prabhu |first127=Vrinda |last128=Alyafeai |first128=Zaid |last129=Talat |first129=Zeerak |last130=Raja |first130=Arun |last131=Heinzerling |first131=Benjamin |last132=Si |first132=Chenglei |last133=Taşar |first133=Davut Emre |last134=Salesky |first134=Elizabeth |last135=Mielke |first135=Sabrina J. |last136=Lee |first136=Wilson Y. |last137=Sharma |first137=Abheesht |last138=Santilli |first138=Andrea |last139=Chaffin |first139=Antoine |last140=Stiegler |first140=Arnaud |last141=Datta |first141=Debajyoti |last142=Szczechla |first142=Eliza |last143=Chhablani |first143=Gunjan |last144=Wang |first144=Han |last145=Pandey |first145=Harshit |last146=Strobelt |first146=Hendrik |last147=Fries |first147=Jason Alan |last148=Rozen |first148=Jos |last149=Gao |first149=Leo |last150=Sutawika |first150=Lintang |last151=Bari |first151=M. Saiful |last152=Al-shaibani |first152=Maged S. |last153=Manica |first153=Matteo |last154=Nayak |first154=Nihal |last155=Teehan |first155=Ryan |last156=Albanie |first156=Samuel |last157=Shen |first157=Sheng |last158=Ben-David |first158=Srulik |last159=Bach |first159=Stephen H. |last160=Kim |first160=Taewoon |last161=Bers |first161=Tali |last162=Fevry |first162=Thibault |last163=Neeraj |first163=Trishala |last164=Thakker |first164=Urmish |last165=Raunak |first165=Vikas |last166=Tang |first166=Xiangru |last167=Yong |first167=Zheng-Xin |last168=Sun |first168=Zhiqing |last169=Brody |first169=Shaked |last170=Uri |first170=Yallow |last171=Tojarieh |first171=Hadar |last172=Roberts |first172=Adam |last173=Chung |first173=Hyung Won |last174=Tae |first174=Jaesung |last175=Phang |first175=Jason |last176=Press |first176=Ofir |last177=Li |first177=Conglong |last178=Narayanan |first178=Deepak |last179=Bourfoune |first179=Hatim |last180=Casper |first180=Jared |last181=Rasley |first181=Jeff |last182=Ryabinin |first182=Max |last183=Mishra |first183=Mayank |last184=Zhang |first184=Minjia |last185=Shoeybi |first185=Mohammad |last186=Peyrounette |first186=Myriam |last187=Patry |first187=Nicolas |last188=Tazi |first188=Nouamane |last189=Sanseviero |first189=Omar |last190=von Platen |first190=Patrick |last191=Cornette |first191=Pierre |last192=Lavallée |first192=Pierre François |last193=Lacroix |first193=Rémi |last194=Rajbhandari |first194=Samyam |last195=Gandhi |first195=Sanchit |last196=Smith |first196=Shaden |last197=Requena |first197=Stéphane |last198=Patil |first198=Suraj |last199=Dettmers |first199=Tim |last200=Baruwa |first200=Ahmed |last201=Singh |first201=Amanpreet |last202=Cheveleva |first202=Anastasia |last203=Ligozat |first203=Anne-Laure |last204=Subramonian |first204=Arjun |last205=Névéol |first205=Aurélie |last206=Lovering |first206=Charles |last207=Garrette |first207=Dan |last208=Tunuguntla |first208=Deepak |last209=Reiter |first209=Ehud |last210=Taktasheva |first210=Ekaterina |last211=Voloshina |first211=Ekaterina |last212=Bogdanov |first212=Eli |last213=Winata |first213=Genta Indra |last214=Schoelkopf |first214=Hailey |last215=Kalo |first215=Jan-Christoph |last216=Novikova |first216=Jekaterina |last217=Forde |first217=Jessica Zosa |last218=Clive |first218=Jordan |last219=Kasai |first219=Jungo |last220=Kawamura |first220=Ken |last221=Hazan |first221=Liam |last222=Carpuat |first222=Marine |last223=Clinciu |first223=Miruna |last224=Kim |first224=Najoung |last225=Cheng |first225=Newton |last226=Serikov |first226=Oleg |last227=Antverg |first227=Omer |last228=van der Wal |first228=Oskar |last229=Zhang |first229=Rui |last230=Zhang |first230=Ruochen |last231=Gehrmann |first231=Sebastian |last232=Mirkin |first232=Shachar |last233=Pais |first233=Shani |last234=Shavrina |first234=Tatiana |last235=Scialom |first235=Thomas |last236=Yun |first236=Tian |last237=Limisiewicz |first237=Tomasz |last238=Rieser |first238=Verena |last239=Protasov |first239=Vitaly |last240=Mikhailov |first240=Vladislav |last241=Pruksachatkun |first241=Yada |last242=Belinkov |first242=Yonatan |last243=Bamberger |first243=Zachary |last244=Kasner |first244=Zdeněk |last245=Rueda |first245=Alice |last246=Pestana |first246=Amanda |last247=Feizpour |first247=Amir |last248=Khan |first248=Ammar |last249=Faranak |first249=Amy |last250=Santos |first250=Ana |last251=Hevia |first251=Anthony |last252=Unldreaj |first252=Antigona |last253=Aghagol |first253=Arash |last254=Abdollahi |first254=Arezoo |last255=Tammour |first255=Aycha |last256=HajiHosseini |first256=Azadeh |last257=Behroozi |first257=Bahareh |last258=Ajibade |first258=Benjamin |last259=Saxena |first259=Bharat |last260=Ferrandis |first260=Carlos Muñoz |last261=Contractor |first261=Danish |last262=Lansky |first262=David |last263=David |first263=Davis |last264=Kiela |first264=Douwe |last265=Nguyen |first265=Duong A. |last266=Tan |first266=Edward |last267=Baylor |first267=Emi |last268=Ozoani |first268=Ezinwanne |last269=Mirza |first269=Fatima |last270=Ononiwu |first270=Frankline |last271=Rezanejad |first271=Habib |last272=Jones |first272=Hessie |last273=Bhattacharya |first273=Indrani |last274=Solaiman |first274=Irene |last275=Sedenko |first275=Irina |last276=Nejadgholi |first276=Isar |last277=Passmore |first277=Jesse |last278=Seltzer |first278=Josh |last279=Sanz |first279=Julio Bonis |last280=Dutra |first280=Livia |last281=Samagaio |first281=Mairon |last282=Elbadri |first282=Maraim |last283=Mieskes |first283=Margot |last284=Gerchick |first284=Marissa |last285=Akinlolu |first285=Martha |last286=McKenna |first286=Michael |last287=Qiu |first287=Mike |last288=Ghauri |first288=Muhammed |last289=Burynok |first289=Mykola |last290=Abrar |first290=Nafis |last291=Rajani |first291=Nazneen |last292=Elkott |first292=Nour |last293=Fahmy |first293=Nour |last294=Samuel |first294=Olanrewaju |last295=An |first295=Ran |last296=Kromann |first296=Rasmus |last297=Hao |first297=Ryan |last298=Alizadeh |first298=Samira |last299=Shubber |first299=Sarmad |last300=Wang |first300=Silas |last301=Roy |first301=Sourav |last302=Viguier |first302=Sylvain |last303=Le |first303=Thanh |last304=Oyebade |first304=Tobi |last305=Le |first305=Trieu |last306=Yang |first306=Yoyo |last307=Nguyen |first307=Zach |last308=Kashyap |first308=Abhinav Ramesh |last309=Palasciano |first309=Alfredo |last310=Callahan |first310=Alison |last311=Shukla |first311=Anima |last312=Miranda-Escalada |first312=Antonio |last313=Singh |first313=Ayush |last314=Beilharz |first314=Benjamin |last315=Wang |first315=Bo |last316=Brito |first316=Caio |last317=Zhou |first317=Chenxi |last318=Jain |first318=Chirag |last319=Xu |first319=Chuxin |last320=Fourrier |first320=Clémentine |last321=Periñán |first321=Daniel León |last322=Molano |first322=Daniel |last323=Yu |first323=Dian |last324=Manjavacas |first324=Enrique |last325=Barth |first325=Fabio |last326=Fuhrimann |first326=Florian |last327=Altay |first327=Gabriel |last328=Bayrak |first328=Giyaseddin |last329=Burns |first329=Gully |last330=Vrabec |first330=Helena U. |last331=Bello |first331=Imane |last332=Dash |first332=Ishani |last333=Kang |first333=Jihyun |last334=Giorgi |first334=John |last335=Golde |first335=Jonas |last336=Posada |first336=Jose David |last337=Sivaraman |first337=Karthik Rangasai |last338=Bulchandani |first338=Lokesh |last339=Liu |first339=Lu |last340=Shinzato |first340=Luisa |last341=de Bykhovetz |first341=Madeleine Hahn |last342=Takeuchi |first342=Maiko |last343=Pàmies |first343=Marc |last344=Castillo |first344=Maria A. |last345=Nezhurina |first345=Marianna |last346=Sänger |first346=Mario |last347=Samwald |first347=Matthias |last348=Cullan |first348=Michael |last349=Weinberg |first349=Michael |last350=De Wolf |first350=Michiel |last351=Mihaljcic |first351=Mina |last352=Liu |first352=Minna |last353=Freidank |first353=Moritz |last354=Kang |first354=Myungsun |last355=Seelam |first355=Natasha |last356=Dahlberg |first356=Nathan |last357=Broad |first357=Nicholas Michio |last358=Muellner |first358=Nikolaus |last359=Fung |first359=Pascale |last360=Haller |first360=Patrick |last361=Chandrasekhar |first361=Ramya |last362=Eisenberg |first362=Renata |last363=Martin |first363=Robert |last364=Canalli |first364=Rodrigo |last365=Su |first365=Rosaline |last366=Su |first366=Ruisi |last367=Cahyawijaya |first367=Samuel |last368=Garda |first368=Samuele |last369=Deshmukh |first369=Shlok S. |last370=Mishra |first370=Shubhanshu |last371=Kiblawi |first371=Sid |last372=Ott |first372=Simon |last373=Sang-aroonsiri |first373=Sinee |last374=Kumar |first374=Srishti |last375=Schweter |first375=Stefan |last376=Bharati |first376=Sushil |last377=Laud |first377=Tanmay |last378=Gigant |first378=Théo |last379=Kainuma |first379=Tomoya |last380=Kusa |first380=Wojciech |last381=Labrak |first381=Yanis |last382=Bajaj |first382=Yash Shailesh |last383=Venkatraman |first383=Yash |last384=Xu |first384=Yifan |last385=Xu |first385=Yingxin |last386=Xu |first386=Yu |last387=Tan |first387=Zhe |last388=Xie |first388=Zhongli |last389=Ye |first389=Zifan |last390=Bras |first390=Mathilde |last391=Belkada |first391=Younes |last392=Wolf |first392=Thomas |title=BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |journal=arXiv:2211.05100 [cs] |date=13 March 2023 |url=https://arxiv.org/abs/2211.05100}}</ref> | + | | 2022 || November 9 || BLOOM || 176,000,000,000<ref name="Kazi"/><ref name="A Survey of Large"/> || 366,000,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || A paper introduces BLOOM, an open-access language model designed and built by a collaboration of hundreds of researchers. The model is a decoder-only Transformer language model trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages. BLOOM achieves competitive performance on a wide variety of benchmarks and is publicly released under the Responsible AI License to facilitate future research and applications using large language models. The paper also discusses the development process and the need to democratize large language models.<ref>{{cite journal |last1=Workshop |first1=BigScience |last2=Scao |first2=Teven Le |last3=Fan |first3=Angela |last4=Akiki |first4=Christopher |last5=Pavlick |first5=Ellie |last6=Ilić |first6=Suzana |last7=Hesslow |first7=Daniel |last8=Castagné |first8=Roman |last9=Luccioni |first9=Alexandra Sasha |last10=Yvon |first10=François |last11=Gallé |first11=Matthias |last12=Tow |first12=Jonathan |last13=Rush |first13=Alexander M. |last14=Biderman |first14=Stella |last15=Webson |first15=Albert |last16=Ammanamanchi |first16=Pawan Sasanka |last17=Wang |first17=Thomas |last18=Sagot |first18=Benoît |last19=Muennighoff |first19=Niklas |last20=del Moral |first20=Albert Villanova |last21=Ruwase |first21=Olatunji |last22=Bawden |first22=Rachel |last23=Bekman |first23=Stas |last24=McMillan-Major |first24=Angelina |last25=Beltagy |first25=Iz |last26=Nguyen |first26=Huu |last27=Saulnier |first27=Lucile |last28=Tan |first28=Samson |last29=Suarez |first29=Pedro Ortiz |last30=Sanh |first30=Victor |last31=Laurençon |first31=Hugo |last32=Jernite |first32=Yacine |last33=Launay |first33=Julien |last34=Mitchell |first34=Margaret |last35=Raffel |first35=Colin |last36=Gokaslan |first36=Aaron |last37=Simhi |first37=Adi |last38=Soroa |first38=Aitor |last39=Aji |first39=Alham Fikri |last40=Alfassy |first40=Amit |last41=Rogers |first41=Anna |last42=Nitzav |first42=Ariel Kreisberg |last43=Xu |first43=Canwen |last44=Mou |first44=Chenghao |last45=Emezue |first45=Chris |last46=Klamm |first46=Christopher |last47=Leong |first47=Colin |last48=van Strien |first48=Daniel |last49=Adelani |first49=David Ifeoluwa |last50=Radev |first50=Dragomir |last51=Ponferrada |first51=Eduardo González |last52=Levkovizh |first52=Efrat |last53=Kim |first53=Ethan |last54=Natan |first54=Eyal Bar |last55=De Toni |first55=Francesco |last56=Dupont |first56=Gérard |last57=Kruszewski |first57=Germán |last58=Pistilli |first58=Giada |last59=Elsahar |first59=Hady |last60=Benyamina |first60=Hamza |last61=Tran |first61=Hieu |last62=Yu |first62=Ian |last63=Abdulmumin |first63=Idris |last64=Johnson |first64=Isaac |last65=Gonzalez-Dios |first65=Itziar |last66=de la Rosa |first66=Javier |last67=Chim |first67=Jenny |last68=Dodge |first68=Jesse |last69=Zhu |first69=Jian |last70=Chang |first70=Jonathan |last71=Frohberg |first71=Jörg |last72=Tobing |first72=Joseph |last73=Bhattacharjee |first73=Joydeep |last74=Almubarak |first74=Khalid |last75=Chen |first75=Kimbo |last76=Lo |first76=Kyle |last77=Von Werra |first77=Leandro |last78=Weber |first78=Leon |last79=Phan |first79=Long |last80=allal |first80=Loubna Ben |last81=Tanguy |first81=Ludovic |last82=Dey |first82=Manan |last83=Muñoz |first83=Manuel Romero |last84=Masoud |first84=Maraim |last85=Grandury |first85=María |last86=Šaško |first86=Mario |last87=Huang |first87=Max |last88=Coavoux |first88=Maximin |last89=Singh |first89=Mayank |last90=Jiang |first90=Mike Tian-Jian |last91=Vu |first91=Minh Chien |last92=Jauhar |first92=Mohammad A. |last93=Ghaleb |first93=Mustafa |last94=Subramani |first94=Nishant |last95=Kassner |first95=Nora |last96=Khamis |first96=Nurulaqilla |last97=Nguyen |first97=Olivier |last98=Espejel |first98=Omar |last99=de Gibert |first99=Ona |last100=Villegas |first100=Paulo |last101=Henderson |first101=Peter |last102=Colombo |first102=Pierre |last103=Amuok |first103=Priscilla |last104=Lhoest |first104=Quentin |last105=Harliman |first105=Rheza |last106=Bommasani |first106=Rishi |last107=López |first107=Roberto Luis |last108=Ribeiro |first108=Rui |last109=Osei |first109=Salomey |last110=Pyysalo |first110=Sampo |last111=Nagel |first111=Sebastian |last112=Bose |first112=Shamik |last113=Muhammad |first113=Shamsuddeen Hassan |last114=Sharma |first114=Shanya |last115=Longpre |first115=Shayne |last116=Nikpoor |first116=Somaieh |last117=Silberberg |first117=Stanislav |last118=Pai |first118=Suhas |last119=Zink |first119=Sydney |last120=Torrent |first120=Tiago Timponi |last121=Schick |first121=Timo |last122=Thrush |first122=Tristan |last123=Danchev |first123=Valentin |last124=Nikoulina |first124=Vassilina |last125=Laippala |first125=Veronika |last126=Lepercq |first126=Violette |last127=Prabhu |first127=Vrinda |last128=Alyafeai |first128=Zaid |last129=Talat |first129=Zeerak |last130=Raja |first130=Arun |last131=Heinzerling |first131=Benjamin |last132=Si |first132=Chenglei |last133=Taşar |first133=Davut Emre |last134=Salesky |first134=Elizabeth |last135=Mielke |first135=Sabrina J. |last136=Lee |first136=Wilson Y. |last137=Sharma |first137=Abheesht |last138=Santilli |first138=Andrea |last139=Chaffin |first139=Antoine |last140=Stiegler |first140=Arnaud |last141=Datta |first141=Debajyoti |last142=Szczechla |first142=Eliza |last143=Chhablani |first143=Gunjan |last144=Wang |first144=Han |last145=Pandey |first145=Harshit |last146=Strobelt |first146=Hendrik |last147=Fries |first147=Jason Alan |last148=Rozen |first148=Jos |last149=Gao |first149=Leo |last150=Sutawika |first150=Lintang |last151=Bari |first151=M. Saiful |last152=Al-shaibani |first152=Maged S. |last153=Manica |first153=Matteo |last154=Nayak |first154=Nihal |last155=Teehan |first155=Ryan |last156=Albanie |first156=Samuel |last157=Shen |first157=Sheng |last158=Ben-David |first158=Srulik |last159=Bach |first159=Stephen H. |last160=Kim |first160=Taewoon |last161=Bers |first161=Tali |last162=Fevry |first162=Thibault |last163=Neeraj |first163=Trishala |last164=Thakker |first164=Urmish |last165=Raunak |first165=Vikas |last166=Tang |first166=Xiangru |last167=Yong |first167=Zheng-Xin |last168=Sun |first168=Zhiqing |last169=Brody |first169=Shaked |last170=Uri |first170=Yallow |last171=Tojarieh |first171=Hadar |last172=Roberts |first172=Adam |last173=Chung |first173=Hyung Won |last174=Tae |first174=Jaesung |last175=Phang |first175=Jason |last176=Press |first176=Ofir |last177=Li |first177=Conglong |last178=Narayanan |first178=Deepak |last179=Bourfoune |first179=Hatim |last180=Casper |first180=Jared |last181=Rasley |first181=Jeff |last182=Ryabinin |first182=Max |last183=Mishra |first183=Mayank |last184=Zhang |first184=Minjia |last185=Shoeybi |first185=Mohammad |last186=Peyrounette |first186=Myriam |last187=Patry |first187=Nicolas |last188=Tazi |first188=Nouamane |last189=Sanseviero |first189=Omar |last190=von Platen |first190=Patrick |last191=Cornette |first191=Pierre |last192=Lavallée |first192=Pierre François |last193=Lacroix |first193=Rémi |last194=Rajbhandari |first194=Samyam |last195=Gandhi |first195=Sanchit |last196=Smith |first196=Shaden |last197=Requena |first197=Stéphane |last198=Patil |first198=Suraj |last199=Dettmers |first199=Tim |last200=Baruwa |first200=Ahmed |last201=Singh |first201=Amanpreet |last202=Cheveleva |first202=Anastasia |last203=Ligozat |first203=Anne-Laure |last204=Subramonian |first204=Arjun |last205=Névéol |first205=Aurélie |last206=Lovering |first206=Charles |last207=Garrette |first207=Dan |last208=Tunuguntla |first208=Deepak |last209=Reiter |first209=Ehud |last210=Taktasheva |first210=Ekaterina |last211=Voloshina |first211=Ekaterina |last212=Bogdanov |first212=Eli |last213=Winata |first213=Genta Indra |last214=Schoelkopf |first214=Hailey |last215=Kalo |first215=Jan-Christoph |last216=Novikova |first216=Jekaterina |last217=Forde |first217=Jessica Zosa |last218=Clive |first218=Jordan |last219=Kasai |first219=Jungo |last220=Kawamura |first220=Ken |last221=Hazan |first221=Liam |last222=Carpuat |first222=Marine |last223=Clinciu |first223=Miruna |last224=Kim |first224=Najoung |last225=Cheng |first225=Newton |last226=Serikov |first226=Oleg |last227=Antverg |first227=Omer |last228=van der Wal |first228=Oskar |last229=Zhang |first229=Rui |last230=Zhang |first230=Ruochen |last231=Gehrmann |first231=Sebastian |last232=Mirkin |first232=Shachar |last233=Pais |first233=Shani |last234=Shavrina |first234=Tatiana |last235=Scialom |first235=Thomas |last236=Yun |first236=Tian |last237=Limisiewicz |first237=Tomasz |last238=Rieser |first238=Verena |last239=Protasov |first239=Vitaly |last240=Mikhailov |first240=Vladislav |last241=Pruksachatkun |first241=Yada |last242=Belinkov |first242=Yonatan |last243=Bamberger |first243=Zachary |last244=Kasner |first244=Zdeněk |last245=Rueda |first245=Alice |last246=Pestana |first246=Amanda |last247=Feizpour |first247=Amir |last248=Khan |first248=Ammar |last249=Faranak |first249=Amy |last250=Santos |first250=Ana |last251=Hevia |first251=Anthony |last252=Unldreaj |first252=Antigona |last253=Aghagol |first253=Arash |last254=Abdollahi |first254=Arezoo |last255=Tammour |first255=Aycha |last256=HajiHosseini |first256=Azadeh |last257=Behroozi |first257=Bahareh |last258=Ajibade |first258=Benjamin |last259=Saxena |first259=Bharat |last260=Ferrandis |first260=Carlos Muñoz |last261=Contractor |first261=Danish |last262=Lansky |first262=David |last263=David |first263=Davis |last264=Kiela |first264=Douwe |last265=Nguyen |first265=Duong A. |last266=Tan |first266=Edward |last267=Baylor |first267=Emi |last268=Ozoani |first268=Ezinwanne |last269=Mirza |first269=Fatima |last270=Ononiwu |first270=Frankline |last271=Rezanejad |first271=Habib |last272=Jones |first272=Hessie |last273=Bhattacharya |first273=Indrani |last274=Solaiman |first274=Irene |last275=Sedenko |first275=Irina |last276=Nejadgholi |first276=Isar |last277=Passmore |first277=Jesse |last278=Seltzer |first278=Josh |last279=Sanz |first279=Julio Bonis |last280=Dutra |first280=Livia |last281=Samagaio |first281=Mairon |last282=Elbadri |first282=Maraim |last283=Mieskes |first283=Margot |last284=Gerchick |first284=Marissa |last285=Akinlolu |first285=Martha |last286=McKenna |first286=Michael |last287=Qiu |first287=Mike |last288=Ghauri |first288=Muhammed |last289=Burynok |first289=Mykola |last290=Abrar |first290=Nafis |last291=Rajani |first291=Nazneen |last292=Elkott |first292=Nour |last293=Fahmy |first293=Nour |last294=Samuel |first294=Olanrewaju |last295=An |first295=Ran |last296=Kromann |first296=Rasmus |last297=Hao |first297=Ryan |last298=Alizadeh |first298=Samira |last299=Shubber |first299=Sarmad |last300=Wang |first300=Silas |last301=Roy |first301=Sourav |last302=Viguier |first302=Sylvain |last303=Le |first303=Thanh |last304=Oyebade |first304=Tobi |last305=Le |first305=Trieu |last306=Yang |first306=Yoyo |last307=Nguyen |first307=Zach |last308=Kashyap |first308=Abhinav Ramesh |last309=Palasciano |first309=Alfredo |last310=Callahan |first310=Alison |last311=Shukla |first311=Anima |last312=Miranda-Escalada |first312=Antonio |last313=Singh |first313=Ayush |last314=Beilharz |first314=Benjamin |last315=Wang |first315=Bo |last316=Brito |first316=Caio |last317=Zhou |first317=Chenxi |last318=Jain |first318=Chirag |last319=Xu |first319=Chuxin |last320=Fourrier |first320=Clémentine |last321=Periñán |first321=Daniel León |last322=Molano |first322=Daniel |last323=Yu |first323=Dian |last324=Manjavacas |first324=Enrique |last325=Barth |first325=Fabio |last326=Fuhrimann |first326=Florian |last327=Altay |first327=Gabriel |last328=Bayrak |first328=Giyaseddin |last329=Burns |first329=Gully |last330=Vrabec |first330=Helena U. |last331=Bello |first331=Imane |last332=Dash |first332=Ishani |last333=Kang |first333=Jihyun |last334=Giorgi |first334=John |last335=Golde |first335=Jonas |last336=Posada |first336=Jose David |last337=Sivaraman |first337=Karthik Rangasai |last338=Bulchandani |first338=Lokesh |last339=Liu |first339=Lu |last340=Shinzato |first340=Luisa |last341=de Bykhovetz |first341=Madeleine Hahn |last342=Takeuchi |first342=Maiko |last343=Pàmies |first343=Marc |last344=Castillo |first344=Maria A. |last345=Nezhurina |first345=Marianna |last346=Sänger |first346=Mario |last347=Samwald |first347=Matthias |last348=Cullan |first348=Michael |last349=Weinberg |first349=Michael |last350=De Wolf |first350=Michiel |last351=Mihaljcic |first351=Mina |last352=Liu |first352=Minna |last353=Freidank |first353=Moritz |last354=Kang |first354=Myungsun |last355=Seelam |first355=Natasha |last356=Dahlberg |first356=Nathan |last357=Broad |first357=Nicholas Michio |last358=Muellner |first358=Nikolaus |last359=Fung |first359=Pascale |last360=Haller |first360=Patrick |last361=Chandrasekhar |first361=Ramya |last362=Eisenberg |first362=Renata |last363=Martin |first363=Robert |last364=Canalli |first364=Rodrigo |last365=Su |first365=Rosaline |last366=Su |first366=Ruisi |last367=Cahyawijaya |first367=Samuel |last368=Garda |first368=Samuele |last369=Deshmukh |first369=Shlok S. |last370=Mishra |first370=Shubhanshu |last371=Kiblawi |first371=Sid |last372=Ott |first372=Simon |last373=Sang-aroonsiri |first373=Sinee |last374=Kumar |first374=Srishti |last375=Schweter |first375=Stefan |last376=Bharati |first376=Sushil |last377=Laud |first377=Tanmay |last378=Gigant |first378=Théo |last379=Kainuma |first379=Tomoya |last380=Kusa |first380=Wojciech |last381=Labrak |first381=Yanis |last382=Bajaj |first382=Yash Shailesh |last383=Venkatraman |first383=Yash |last384=Xu |first384=Yifan |last385=Xu |first385=Yingxin |last386=Xu |first386=Yu |last387=Tan |first387=Zhe |last388=Xie |first388=Zhongli |last389=Ye |first389=Zifan |last390=Bras |first390=Mathilde |last391=Belkada |first391=Younes |last392=Wolf |first392=Thomas |title=BLOOM: A 176B-Parameter Open-Access Multilingual Language Model |journal=arXiv:2211.05100 [cs] |date=13 March 2023 |url=https://arxiv.org/abs/2211.05100}}</ref> |
|- | |- | ||
| 2022 || November 17 || Galactica || 120,000,000,000<ref name="A Survey of Large"/> || 106,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || Meta AI introduces Galactica, a language model capable of generating scientific and academic papers from simple text inputs. Trained on a vast corpus of scientific literature, knowledge bases, and reference materials, Galactica compresses this data into a 120-billion parameter model. It aims to summarize academic literature, solve math problems, and generate Wiki articles. However, after its launch, Galactica faces criticism for generating content that sounds grammatically correct but is scientifically inaccurate, leading Meta to pull it down after just three days. Some experts find it useful, while others consider it a "random bullshit generator."<ref>{{cite web |last1=Chopra |first1=Disha |title=Meta Introduces ‘Galactica,’ an AI System that Generates Academic Papers from Simple Text Inputs |url=https://analyticsdrift.com/meta-introduces-galactica-an-ai-system-that-generates-academic-papers-from-simple-text-inputs/ |website=Analytics Drift |access-date=20 September 2023 |date=17 November 2022}}</ref><ref>{{cite web |title=Meta’s New Large Language Model Galactica Pulled Down Three Days After Launch |url=https://www.spiceworks.com/tech/artificial-intelligence/news/meta-galactica-large-language-model-criticism/ |website=Spiceworks |access-date=20 September 2023}}</ref> | | 2022 || November 17 || Galactica || 120,000,000,000<ref name="A Survey of Large"/> || 106,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || Meta AI introduces Galactica, a language model capable of generating scientific and academic papers from simple text inputs. Trained on a vast corpus of scientific literature, knowledge bases, and reference materials, Galactica compresses this data into a 120-billion parameter model. It aims to summarize academic literature, solve math problems, and generate Wiki articles. However, after its launch, Galactica faces criticism for generating content that sounds grammatically correct but is scientifically inaccurate, leading Meta to pull it down after just three days. Some experts find it useful, while others consider it a "random bullshit generator."<ref>{{cite web |last1=Chopra |first1=Disha |title=Meta Introduces ‘Galactica,’ an AI System that Generates Academic Papers from Simple Text Inputs |url=https://analyticsdrift.com/meta-introduces-galactica-an-ai-system-that-generates-academic-papers-from-simple-text-inputs/ |website=Analytics Drift |access-date=20 September 2023 |date=17 November 2022}}</ref><ref>{{cite web |title=Meta’s New Large Language Model Galactica Pulled Down Three Days After Launch |url=https://www.spiceworks.com/tech/artificial-intelligence/news/meta-galactica-large-language-model-criticism/ |website=Spiceworks |access-date=20 September 2023}}</ref> | ||
|- | |- | ||
− | | 2022 || November 17 || Alexa Teacher Model || 20,000,000,000 | + | | 2022 || November 17 || Alexa Teacher Model || 20,000,000,000 || || LLM launch || [[w:Amazon (company)|Amazon]] makes the Alexa Teacher Model with 20 billion parameters (AlexaTM 20B) available through Amazon SageMaker JumpStart. AlexaTM 20B is a multilingual sequence-to-sequence language model suitable for various industry applications, including summarizing financial reports and customer service chatbots. It excels in zero-shot learning tasks like SuperGLUE and multilingual zero-shot tasks such as XNLI, outperforming a 175 billion GPT-3 model. The model is designed to generalize well and handle data scarcity for various natural language processing tasks, making it valuable for developers looking to improve performance on downstream tasks with minimal training data.<ref>{{cite web |title=AlexaTM 20B is now available in Amazon SageMaker JumpStart {{!}} AWS Machine Learning Blog |url=https://aws.amazon.com/es/blogs/machine-learning/alexatm-20b-is-now-available-in-amazon-sagemaker-jumpstart/ |website=aws.amazon.com |access-date=20 September 2023 |date=17 November 2022}}</ref> |
|- | |- | ||
| 2022 || December 6 || Flan-T5 || 11,000,000,000<ref name="A Survey of Large"/> || || LLM launch || Google researchers publicly release Flan-T5 models, which outperform baseline T5 models by a large margin. FLAN-T5 is an enhanced iteration of Google's well-known T5 model, incorporating instruct-finetuning. According to the model repository, FLAN-T5 surpasses T5 in all aspects, making it a preferred choice for starting instruct models due to its open licensing.<ref>{{cite web |title=FLAN-T5 vs. FLAN-UL2: Which LLM is Better? {{!}} Sapling |url=https://sapling.ai/llm/flan-t5-vs-flan-ul2#:~:text=FLAN%2DT5%20is%20a%20finetuned,for%20a%20starting%20instruct%20model. |website=sapling.ai |access-date=19 October 2023 |language=en}}</ref> | | 2022 || December 6 || Flan-T5 || 11,000,000,000<ref name="A Survey of Large"/> || || LLM launch || Google researchers publicly release Flan-T5 models, which outperform baseline T5 models by a large margin. FLAN-T5 is an enhanced iteration of Google's well-known T5 model, incorporating instruct-finetuning. According to the model repository, FLAN-T5 surpasses T5 in all aspects, making it a preferred choice for starting instruct models due to its open licensing.<ref>{{cite web |title=FLAN-T5 vs. FLAN-UL2: Which LLM is Better? {{!}} Sapling |url=https://sapling.ai/llm/flan-t5-vs-flan-ul2#:~:text=FLAN%2DT5%20is%20a%20finetuned,for%20a%20starting%20instruct%20model. |website=sapling.ai |access-date=19 October 2023 |language=en}}</ref> | ||
Line 176: | Line 194: | ||
| 2023 || February 21 || || || || Prompting || A paper presents a catalog of {{w|prompt engineering}} techniques in pattern form that have been applied successfully to solve common problems when conversing with large language models (LLMs), such as {{w|ChatGPT}}. Prompt patterns are reusable solutions to common problems faced when working with LLMs that can customize the outputs and interactions with an LLM. The paper provides a framework for documenting patterns for structuring prompts to solve a range of problems and presents a catalog of patterns that have been applied successfully to improve the outputs of LLM conversations. It also explains how prompts can be built from multiple patterns and illustrates prompt patterns that benefit from combination with other prompt patterns. The paper contributes to research on prompt engineering that applies LLMs to automate software development tasks.<ref>{{cite journal |last1=White |first1=Jules |last2=Fu |first2=Quchen |last3=Hays |first3=Sam |last4=Sandborn |first4=Michael |last5=Olea |first5=Carlos |last6=Gilbert |first6=Henry |last7=Elnashar |first7=Ashraf |last8=Spencer-Smith |first8=Jesse |last9=Schmidt |first9=Douglas C. |title=A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT |journal=arXiv:2302.11382 [cs] |date=21 February 2023 |doi=10.48550/arXiv.2302.11382 |url=https://arxiv.org/abs/2302.11382}}</ref> | | 2023 || February 21 || || || || Prompting || A paper presents a catalog of {{w|prompt engineering}} techniques in pattern form that have been applied successfully to solve common problems when conversing with large language models (LLMs), such as {{w|ChatGPT}}. Prompt patterns are reusable solutions to common problems faced when working with LLMs that can customize the outputs and interactions with an LLM. The paper provides a framework for documenting patterns for structuring prompts to solve a range of problems and presents a catalog of patterns that have been applied successfully to improve the outputs of LLM conversations. It also explains how prompts can be built from multiple patterns and illustrates prompt patterns that benefit from combination with other prompt patterns. The paper contributes to research on prompt engineering that applies LLMs to automate software development tasks.<ref>{{cite journal |last1=White |first1=Jules |last2=Fu |first2=Quchen |last3=Hays |first3=Sam |last4=Sandborn |first4=Michael |last5=Olea |first5=Carlos |last6=Gilbert |first6=Henry |last7=Elnashar |first7=Ashraf |last8=Spencer-Smith |first8=Jesse |last9=Schmidt |first9=Douglas C. |title=A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT |journal=arXiv:2302.11382 [cs] |date=21 February 2023 |doi=10.48550/arXiv.2302.11382 |url=https://arxiv.org/abs/2302.11382}}</ref> | ||
|- | |- | ||
− | | 2023 || February 24 || LLaMA || 65,000,000,000<ref name="A Survey of Large"/> || 1,400,000,000,000<ref name="A Survey of Large"/> || LLM launch || {{w|Meta AI}} introduces LLaMA as a collection of open-source foundation language models, ranging from 7B to 65B parameters, that were trained on publicly available datasets without the need for proprietary or inaccessible data. The largest model, LLaMA-65B, is competitive with other top models such as Chinchilla70B and PaLM-540B. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks. All models are available for research purposes.<ref>{{cite web |title=LLaMA: Open and Efficient Foundation Language Models - Meta Research |url=https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/ |website=Meta Research |access-date=11 March 2023 |language=en}}</ref> | + | | 2023 || February 24 || LLaMA || 65,000,000,000<ref name="A Survey of Large"/> || 1,400,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || {{w|Meta AI}} introduces LLaMA as a collection of open-source foundation language models, ranging from 7B to 65B parameters, that were trained on publicly available datasets without the need for proprietary or inaccessible data. The largest model, LLaMA-65B, is competitive with other top models such as Chinchilla70B and PaLM-540B. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks. All models are available for research purposes.<ref>{{cite web |title=LLaMA: Open and Efficient Foundation Language Models - Meta Research |url=https://research.facebook.com/publications/llama-open-and-efficient-foundation-language-models/ |website=Meta Research |access-date=11 March 2023 |language=en}}</ref> |
|- | |- | ||
| 2023 || February 24 || || || || Programming/training || A paper proposes a system called LLM-Augmenter that improves large language models by using external knowledge and automated feedback. The system adds plug-and-play modules to a black-box LLM to ground responses in external knowledge and iteratively improve responses using feedback generated by utility functions. The system is validated on task-oriented dialog and open-domain question answering, showing a significant reduction in hallucinations without sacrificing fluency and informativeness. The source code and models are publicly available.<ref>{{cite journal |last1=Peng |first1=Baolin |last2=Galley |first2=Michel |last3=He |first3=Pengcheng |last4=Cheng |first4=Hao |last5=Xie |first5=Yujia |last6=Hu |first6=Yu |last7=Huang |first7=Qiuyuan |last8=Liden |first8=Lars |last9=Yu |first9=Zhou |last10=Chen |first10=Weizhu |last11=Gao |first11=Jianfeng |title=Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback |journal=arXiv:2302.12813 [cs] |date=1 March 2023 |doi=10.48550/arXiv.2302.12813 |url=https://arxiv.org/abs/2302.12813}}</ref> | | 2023 || February 24 || || || || Programming/training || A paper proposes a system called LLM-Augmenter that improves large language models by using external knowledge and automated feedback. The system adds plug-and-play modules to a black-box LLM to ground responses in external knowledge and iteratively improve responses using feedback generated by utility functions. The system is validated on task-oriented dialog and open-domain question answering, showing a significant reduction in hallucinations without sacrificing fluency and informativeness. The source code and models are publicly available.<ref>{{cite journal |last1=Peng |first1=Baolin |last2=Galley |first2=Michel |last3=He |first3=Pengcheng |last4=Cheng |first4=Hao |last5=Xie |first5=Yujia |last6=Hu |first6=Yu |last7=Huang |first7=Qiuyuan |last8=Liden |first8=Lars |last9=Yu |first9=Zhou |last10=Chen |first10=Weizhu |last11=Gao |first11=Jianfeng |title=Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback |journal=arXiv:2302.12813 [cs] |date=1 March 2023 |doi=10.48550/arXiv.2302.12813 |url=https://arxiv.org/abs/2302.12813}}</ref> | ||
Line 222: | Line 240: | ||
| 2023 || March 16 || GPT-4 || 1,760,000,000,000<ref>{{cite web |last1=Lubbad |first1=Mohammed |title=The Ultimate Guide to GPT-4 Parameters: Everything You Need to Know about NLP’s Game-Changer |url=https://medium.com/@mlubbad/the-ultimate-guide-to-gpt-4-parameters-everything-you-need-to-know-about-nlps-game-changer-109b8767855a#:~:text=GPT%2D4%20is%20the%20latest,2%20has%201.5%20billion%20parameters. |website=Medium |access-date=19 September 2023 |language=en |date=7 August 2023}}</ref> || || LLM launch || {{w|OpenAI}} introduces {{w|GPT-4}}, a large [[w:Multimodal learning|multimodal model]] that can process both text and image inputs and produce text outputs. GPT-4 shows human-level performance on professional and academic benchmarks and outperforms previous large language models on traditional NLP benchmarks. The report discusses the challenge of developing deep learning infrastructure and optimization methods that behave predictably across a wide range of scales. While GPT-4 has limitations and safety challenges, OpenAI has taken steps to mitigate potential harms. An extensive system card is included in the report.<ref>{{cite journal |title=GPT-4 Technical Report |date=2023 |doi=10.48550/arXiv.2303.08774}}</ref> | | 2023 || March 16 || GPT-4 || 1,760,000,000,000<ref>{{cite web |last1=Lubbad |first1=Mohammed |title=The Ultimate Guide to GPT-4 Parameters: Everything You Need to Know about NLP’s Game-Changer |url=https://medium.com/@mlubbad/the-ultimate-guide-to-gpt-4-parameters-everything-you-need-to-know-about-nlps-game-changer-109b8767855a#:~:text=GPT%2D4%20is%20the%20latest,2%20has%201.5%20billion%20parameters. |website=Medium |access-date=19 September 2023 |language=en |date=7 August 2023}}</ref> || || LLM launch || {{w|OpenAI}} introduces {{w|GPT-4}}, a large [[w:Multimodal learning|multimodal model]] that can process both text and image inputs and produce text outputs. GPT-4 shows human-level performance on professional and academic benchmarks and outperforms previous large language models on traditional NLP benchmarks. The report discusses the challenge of developing deep learning infrastructure and optimization methods that behave predictably across a wide range of scales. While GPT-4 has limitations and safety challenges, OpenAI has taken steps to mitigate potential harms. An extensive system card is included in the report.<ref>{{cite journal |title=GPT-4 Technical Report |date=2023 |doi=10.48550/arXiv.2303.08774}}</ref> | ||
|- | |- | ||
− | | 2023 || March 20 || PanGu-Σ || 1,085,000,000,000 || || LLM launch || Researchers from {{w|Huawei}} introduce Pangu-Σ, which is developed using Ascend 910 AI processors and the MindSpore framework. This model, inheriting parameters from PanGu-α, employs a sparse architecture with Random Routed Experts (RRE) and efficient training techniques called Expert Computation and Storage Separation (ECSS). These methods lead to a 6.3x increase in training throughput through heterogeneous computing. PanGu-Σ demonstrates state-of-the-art zero-shot learning performance in various Chinese natural language processing tasks and excels in fine-tuned applications such as open-domain dialogue, question answering, machine translation, and code generation.<ref>{{cite journal |last1=Ren |first1=Xiaozhe |last2=Zhou |first2=Pingyi |last3=Meng |first3=Xinfan |last4=Huang |first4=Xinjing |last5=Wang |first5=Yadao |last6=Wang |first6=Weichao |last7=Li |first7=Pengfei |last8=Zhang |first8=Xiaoda |last9=Podolskiy |first9=Alexander |last10=Arshinov |first10=Grigory |last11=Bout |first11=Andrey |last12=Piontkovskaya |first12=Irina |last13=Wei |first13=Jiansheng |last14=Jiang |first14=Xin |last15=Su |first15=Teng |last16=Liu |first16=Qun |last17=Yao |first17=Jun |title=PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing |date=2023 |doi=10.48550/arXiv.2303.10845}}</ref><ref>{{cite web |last1=Tickoo |first1=Aneesh |title=Huawei Researchers Develop Pangu-Σ: A Large Language Model With Sparse Architecture And 1.085 Trillion Parameters |url=https://www.marktechpost.com/2023/07/10/huawei-researchers-develop-pangu-%CF%83-a-large-language-model-with-sparse-architecture-and-1-085-trillion-parameters/#:~:text=In%20this%20paper%2C%20researchers%20from,Accelerators%20and%20329%20billion%20tokens. |website=MarkTechPost |access-date=16 October 2023 |date=10 July 2023}}</ref> | + | | 2023 || March 20 || PanGu-Σ || 1,085,000,000,000<ref name="A Survey of Large"/> || 329,000,000,000 tokens<ref name="A Survey of Large"/> || LLM launch || Researchers from {{w|Huawei}} introduce Pangu-Σ, which is developed using Ascend 910 AI processors and the MindSpore framework. This model, inheriting parameters from PanGu-α, employs a sparse architecture with Random Routed Experts (RRE) and efficient training techniques called Expert Computation and Storage Separation (ECSS). These methods lead to a 6.3x increase in training throughput through heterogeneous computing. PanGu-Σ demonstrates state-of-the-art zero-shot learning performance in various Chinese natural language processing tasks and excels in fine-tuned applications such as open-domain dialogue, question answering, machine translation, and code generation.<ref>{{cite journal |last1=Ren |first1=Xiaozhe |last2=Zhou |first2=Pingyi |last3=Meng |first3=Xinfan |last4=Huang |first4=Xinjing |last5=Wang |first5=Yadao |last6=Wang |first6=Weichao |last7=Li |first7=Pengfei |last8=Zhang |first8=Xiaoda |last9=Podolskiy |first9=Alexander |last10=Arshinov |first10=Grigory |last11=Bout |first11=Andrey |last12=Piontkovskaya |first12=Irina |last13=Wei |first13=Jiansheng |last14=Jiang |first14=Xin |last15=Su |first15=Teng |last16=Liu |first16=Qun |last17=Yao |first17=Jun |title=PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing |date=2023 |doi=10.48550/arXiv.2303.10845}}</ref><ref>{{cite web |last1=Tickoo |first1=Aneesh |title=Huawei Researchers Develop Pangu-Σ: A Large Language Model With Sparse Architecture And 1.085 Trillion Parameters |url=https://www.marktechpost.com/2023/07/10/huawei-researchers-develop-pangu-%CF%83-a-large-language-model-with-sparse-architecture-and-1-085-trillion-parameters/#:~:text=In%20this%20paper%2C%20researchers%20from,Accelerators%20and%20329%20billion%20tokens. |website=MarkTechPost |access-date=16 October 2023 |date=10 July 2023}}</ref> |
|- | |- | ||
| 2023 || March 23 || ChatGLM || 6,000,000,000 || || LLM launch || ChatGLM is introduced as a bilingual language model developed by {{w|Tsinghua University}}'s Knowledge Engineering Group (KEG) & Data Mining. It has 6 billion parameters and is optimized for both Chinese and English languages. The model can be downloaded from HuggingFace and is compatible with consumer-grade GPUs through quantization. Similar to ChatGPT, ChatGLM is available under an Apache-2.0 license, allowing commercial use.<ref>{{cite web |title=ChatGLM-6B |url=https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md |website=github.com |publisher=THUDM |access-date=30 June 2023 |date=30 June 2023}}</ref><ref name="Kazi"/> | | 2023 || March 23 || ChatGLM || 6,000,000,000 || || LLM launch || ChatGLM is introduced as a bilingual language model developed by {{w|Tsinghua University}}'s Knowledge Engineering Group (KEG) & Data Mining. It has 6 billion parameters and is optimized for both Chinese and English languages. The model can be downloaded from HuggingFace and is compatible with consumer-grade GPUs through quantization. Similar to ChatGPT, ChatGLM is available under an Apache-2.0 license, allowing commercial use.<ref>{{cite web |title=ChatGLM-6B |url=https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md |website=github.com |publisher=THUDM |access-date=30 June 2023 |date=30 June 2023}}</ref><ref name="Kazi"/> | ||
Line 238: | Line 256: | ||
| 2023 || April 19 || StableLM || 3,000,000,000 – 7,000,000,000 || || LLM launch || Stability AI open-sources its large language model, StableLM, which is designed to efficiently generate text and code. The models are available on GitHub and contain between 3 billion and 7 billion parameters, with 15 to 65 billion parameter models to arrive later. The model is trained on a larger version of the open-source dataset known as the Pile and encompasses information from a range of sources, including {{w|Wikipedia}}, {{w|Stack Exchange}}, and {{w|PubMed}}.<ref>{{cite web |last1=Roth |first1=Emma |title=Stability AI announces new open-source large language model |url=https://www.theverge.com/2023/4/19/23689883/stability-ai-open-source-large-language-model-stablelm |website=The Verge |access-date=9 May 2023 |date=19 April 2023}}</ref><ref>{{cite web |title=Stability AI Launches the First of its StableLM Suite of Language Models |url=https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models |website=Stability AI |access-date=9 May 2023}}</ref> | | 2023 || April 19 || StableLM || 3,000,000,000 – 7,000,000,000 || || LLM launch || Stability AI open-sources its large language model, StableLM, which is designed to efficiently generate text and code. The models are available on GitHub and contain between 3 billion and 7 billion parameters, with 15 to 65 billion parameter models to arrive later. The model is trained on a larger version of the open-source dataset known as the Pile and encompasses information from a range of sources, including {{w|Wikipedia}}, {{w|Stack Exchange}}, and {{w|PubMed}}.<ref>{{cite web |last1=Roth |first1=Emma |title=Stability AI announces new open-source large language model |url=https://www.theverge.com/2023/4/19/23689883/stability-ai-open-source-large-language-model-stablelm |website=The Verge |access-date=9 May 2023 |date=19 April 2023}}</ref><ref>{{cite web |title=Stability AI Launches the First of its StableLM Suite of Language Models |url=https://stability.ai/blog/stability-ai-launches-the-first-of-its-stablelm-suite-of-language-models |website=Stability AI |access-date=9 May 2023}}</ref> | ||
|- | |- | ||
− | | 2023 || April 24 || WizardLM || | + | | 2023 || April 24 || WizardLM || || || LLM launch || A paper presents WizardLM, a large language model trained to follow complex instructions. Instead of manually creating instruction data, the authors propose Evol-Instruct, a method that uses the model itself to progressively evolve instructions into more complex forms. WizardLM outperforms human-created instructions in evaluations and shows preference over OpenAI ChatGPT in generating outputs for high complexity tasks. While WizardLM still has room for improvement compared to ChatGPT, the findings highlight the potential of fine-tuning LLMs with AI-evolved instructions.<ref>{{cite journal |last1=Xu |first1=Can |last2=Sun |first2=Qingfeng |last3=Zheng |first3=Kai |last4=Geng |first4=Xiubo |last5=Zhao |first5=Pu |last6=Feng |first6=Jiazhan |last7=Tao |first7=Chongyang |last8=Jiang |first8=Daxin |title=WizardLM: Empowering Large Language Models to Follow Complex Instructions |date=2023 |doi=10.48550/arXiv.2304.12244}}</ref> |
|- | |- | ||
| 2023 || May 3 || CodeGen2 || 16,000,000,000 || 400,000,000,000 tokens || LLM launch || CodeGen2 is introduced. It is an autoregressive language model family for program synthesis, introduced as an improvement over the original CodeGen model family (CodeGen1). CodeGen2 supports infilling and a broader range of programming languages.<ref>{{cite web |title=Salesforce/codegen2-16B · Hugging Face |url=https://huggingface.co/Salesforce/codegen2-16B |website=huggingface.co |access-date=20 October 2023}}</ref><ref>{{cite journal |last1=Nijkamp |first1=Erik |last2=Hayashi |first2=Hiroaki |last3=Xiong |first3=Caiming |last4=Savarese |first4=Silvio |last5=Zhou |first5=Yingbo |title=CodeGen2: Lessons for Training LLMs on Programming and Natural Languages |date=2023 |doi=10.48550/arXiv.2305.02309}}</ref> | | 2023 || May 3 || CodeGen2 || 16,000,000,000 || 400,000,000,000 tokens || LLM launch || CodeGen2 is introduced. It is an autoregressive language model family for program synthesis, introduced as an improvement over the original CodeGen model family (CodeGen1). CodeGen2 supports infilling and a broader range of programming languages.<ref>{{cite web |title=Salesforce/codegen2-16B · Hugging Face |url=https://huggingface.co/Salesforce/codegen2-16B |website=huggingface.co |access-date=20 October 2023}}</ref><ref>{{cite journal |last1=Nijkamp |first1=Erik |last2=Hayashi |first2=Hiroaki |last3=Xiong |first3=Caiming |last4=Savarese |first4=Silvio |last5=Zhou |first5=Yingbo |title=CodeGen2: Lessons for Training LLMs on Programming and Natural Languages |date=2023 |doi=10.48550/arXiv.2305.02309}}</ref> | ||
Line 252: | Line 270: | ||
| 2023 || May 24 || Gorilla || || || LLM launch || A paper presents Gorilla, a large language model (LLM) that effectively uses {{w|API}} calls. Gorilla surpasses GPT-4 in generating accurate API calls by addressing input argument generation and hallucination issues. When combined with a document retriever, Gorilla adapts to test-time document changes and mitigates hallucination problems. The model's integration with the retrieval system enhances reliability.<ref>{{cite journal |last1=Patil |first1=Shishir G. |last2=Zhang |first2=Tianjun |last3=Wang |first3=Xin |last4=Gonzalez |first4=Joseph E. |title=Gorilla: Large Language Model Connected with Massive APIs |date=2023 |doi=10.48550/arXiv.2305.15334}}</ref> Gorilla would be open-sourced on July 4th.<ref>{{cite web |title=UC Berkeley Researchers Open-Source API-Calling Language Model Gorilla |url=https://www.infoq.com/news/2023/07/microsoft-gorilla/ |website=InfoQ |access-date=15 October 2023 |language=en}}</ref> | | 2023 || May 24 || Gorilla || || || LLM launch || A paper presents Gorilla, a large language model (LLM) that effectively uses {{w|API}} calls. Gorilla surpasses GPT-4 in generating accurate API calls by addressing input argument generation and hallucination issues. When combined with a document retriever, Gorilla adapts to test-time document changes and mitigates hallucination problems. The model's integration with the retrieval system enhances reliability.<ref>{{cite journal |last1=Patil |first1=Shishir G. |last2=Zhang |first2=Tianjun |last3=Wang |first3=Xin |last4=Gonzalez |first4=Joseph E. |title=Gorilla: Large Language Model Connected with Massive APIs |date=2023 |doi=10.48550/arXiv.2305.15334}}</ref> Gorilla would be open-sourced on July 4th.<ref>{{cite web |title=UC Berkeley Researchers Open-Source API-Calling Language Model Gorilla |url=https://www.infoq.com/news/2023/07/microsoft-gorilla/ |website=InfoQ |access-date=15 October 2023 |language=en}}</ref> | ||
|- | |- | ||
− | | 2023 || June 4 || Polyglot-Ko || || || LLM launch || A technical report discusses the development of Polyglot-Ko, an open-source large-scale Korean language model. The project aims to enhance the performance of multilingual language models in non-English languages. While there are existing multilingual models, researchers often prefer building monolingual models due to limitations in the non-English language capabilities of current multilingual models. To address this, the report focuses on developing advanced {{w|Korean language}} models. The team collected 1.2TB of Korean data and prioritized the development of Korean models to enable performance comparisons and cater to the specific needs of Korean companies and researchers. The work presented in the report contributes to bridging the performance gap in non-English languages within multilingual language models.<ref>{{cite journal |last1=Ko |first1=Hyunwoong |last2=Yang |first2=Kichang |last3=Ryu |first3=Minho |last4=Choi |first4=Taekyoon |last5=Yang |first5=Seungmu |last6=Hyun |first6=Jiwung |last7=Park |first7=Sungho |last8=Park |first8=Kyubyong |title=A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models |date=2023 |doi=10.48550/arXiv.2306.02254}}</ref> | + | | 2023 || June 4 || Polyglot-Ko || || 1,200,000,000,000 bytes || LLM launch || A technical report discusses the development of Polyglot-Ko, an open-source large-scale Korean language model. The project aims to enhance the performance of multilingual language models in non-English languages. While there are existing multilingual models, researchers often prefer building monolingual models due to limitations in the non-English language capabilities of current multilingual models. To address this, the report focuses on developing advanced {{w|Korean language}} models. The team collected 1.2TB of Korean data and prioritized the development of Korean models to enable performance comparisons and cater to the specific needs of Korean companies and researchers. The work presented in the report contributes to bridging the performance gap in non-English languages within multilingual language models.<ref>{{cite journal |last1=Ko |first1=Hyunwoong |last2=Yang |first2=Kichang |last3=Ryu |first3=Minho |last4=Choi |first4=Taekyoon |last5=Yang |first5=Seungmu |last6=Hyun |first6=Jiwung |last7=Park |first7=Sungho |last8=Park |first8=Kyubyong |title=A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models |date=2023 |doi=10.48550/arXiv.2306.02254}}</ref> |
|- | |- | ||
| 2023 || June 9 || PoET || || || LLM launch || PoET is introduced as a generative protein language model that designs new {{w|protein}}s with desired functions. It overcomes limitations of existing models by generating sets of related proteins as sequences-of-sequences across natural protein sequence clusters. PoET can generate and score modifications for specific protein families, extrapolate well for small families, and outperforms existing models in variant function prediction. Its Transformer layer allows modeling of sequential tokens within sequences while attending between sequences order invariantly. PoET improves variant effect prediction across proteins of all multiple sequence alignment depths.<ref>{{cite journal |last1=Truong |first1=Timothy F. |last2=Bepler |first2=Tristan |title=PoET: A generative model of protein families as sequences-of-sequences |date=2023 |doi=10.48550/arXiv.2306.06156}}</ref> | | 2023 || June 9 || PoET || || || LLM launch || PoET is introduced as a generative protein language model that designs new {{w|protein}}s with desired functions. It overcomes limitations of existing models by generating sets of related proteins as sequences-of-sequences across natural protein sequence clusters. PoET can generate and score modifications for specific protein families, extrapolate well for small families, and outperforms existing models in variant function prediction. Its Transformer layer allows modeling of sequential tokens within sequences while attending between sequences order invariantly. PoET improves variant effect prediction across proteins of all multiple sequence alignment depths.<ref>{{cite journal |last1=Truong |first1=Timothy F. |last2=Bepler |first2=Tristan |title=PoET: A generative model of protein families as sequences-of-sequences |date=2023 |doi=10.48550/arXiv.2306.06156}}</ref> | ||
Line 278: | Line 296: | ||
| 2023 || June 28 || ChatLaw || || || LLM launch || ChatLaw is introduced as an [[w:Open-source software|open-source]] legal large language model designed to facilitate the digital transformation of the Chinese legal domain. To ensure data quality, the authors carefully curated a legal domain fine-tuning dataset. They also address the issue of model hallucinations during reference data retrieval by combining vector database retrieval with keyword retrieval, reducing inaccuracy. Additionally, a self-attention method is proposed to enhance the model's ability to overcome errors in reference data, further optimizing model hallucinations and improving problem-solving capabilities.<ref>{{cite web |title=ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases |url=https://arxiv.org/abs/2306.16092 |website=arxiv.org |access-date=29 June 2023}}</ref> | | 2023 || June 28 || ChatLaw || || || LLM launch || ChatLaw is introduced as an [[w:Open-source software|open-source]] legal large language model designed to facilitate the digital transformation of the Chinese legal domain. To ensure data quality, the authors carefully curated a legal domain fine-tuning dataset. They also address the issue of model hallucinations during reference data retrieval by combining vector database retrieval with keyword retrieval, reducing inaccuracy. Additionally, a self-attention method is proposed to enhance the model's ability to overcome errors in reference data, further optimizing model hallucinations and improving problem-solving capabilities.<ref>{{cite web |title=ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases |url=https://arxiv.org/abs/2306.16092 |website=arxiv.org |access-date=29 June 2023}}</ref> | ||
|- | |- | ||
− | | 2023 || July 11 || Baichuan-13B || 13,000,000,000 || || LLM launch || Baichuan Intelligence, a startup founded by {{w|Sogou}} founder {{w|Wang Xiaochuan}}, unveils its open-source large language model called Baichuan-13B. The Chinese model, based on the Transformer architecture like {{w|OpenAI}}'s {{w|GPT}}, is trained on Chinese and English data and optimized for commercial applications. Baichuan-13B has 13 billion parameters and is trained on 1.4 trillion tokens. Baichuan-7B, a pre-training model with 7 billion parameters, was released earlier. The model is available for free to academics and developers approved for commercial use. By this time, China focuses on developing large language models as it prepares to implement strict AI regulations, potentially requiring licenses for launching such models.<ref>{{cite web |last1=Liao |first1=Rita |title=China's search engine pioneer unveils open source large language model to rival OpenAI |url=https://techcrunch.com/2023/07/11/chinas-search-engine-pioneer-unveils-open-source-large-language-model-to-rival-openai/ |website=TechCrunch |access-date=16 July 2023 |date=11 July 2023}}</ref> | + | | 2023 || July 11 || Baichuan-13B || 13,000,000,000 || 1,400,000,000,000 tokens || LLM launch || Baichuan Intelligence, a startup founded by {{w|Sogou}} founder {{w|Wang Xiaochuan}}, unveils its open-source large language model called Baichuan-13B. The Chinese model, based on the Transformer architecture like {{w|OpenAI}}'s {{w|GPT}}, is trained on Chinese and English data and optimized for commercial applications. Baichuan-13B has 13 billion parameters and is trained on 1.4 trillion tokens. Baichuan-7B, a pre-training model with 7 billion parameters, was released earlier. The model is available for free to academics and developers approved for commercial use. By this time, China focuses on developing large language models as it prepares to implement strict AI regulations, potentially requiring licenses for launching such models.<ref>{{cite web |last1=Liao |first1=Rita |title=China's search engine pioneer unveils open source large language model to rival OpenAI |url=https://techcrunch.com/2023/07/11/chinas-search-engine-pioneer-unveils-open-source-large-language-model-to-rival-openai/ |website=TechCrunch |access-date=16 July 2023 |date=11 July 2023}}</ref> |
|- | |- | ||
| 2023 || September 9 || || || || Impact || A team of computer scientists, including one from {{w|OpenAI}}, after researching the potential development of self-awareness in large language models like ChatGPT, expresses concern that LLMs can develop situational awareness, enabling them to recognize whether they are in testing mode or deployed to the public. This awareness can lead to deceptive behavior, as LLMs might act safely during testing but harmfully after deployment. The researchers conduct experiments focusing on out-of-context reasoning as a precursor to situational awareness. While at this time LLMs are some way from acquiring situational awareness, the study offers a foundation for further research in this area.<ref>{{cite web |last1=Watson |first1=Clare |title=Scientists Devised a Way to Tell if ChatGPT Becomes Aware of Itself |url=https://www.sciencealert.com/scientists-devised-a-way-to-tell-if-chatgpt-becomes-aware-of-itself |website=ScienceAlert |access-date=17 September 2023 |date=9 September 2023}}</ref> | | 2023 || September 9 || || || || Impact || A team of computer scientists, including one from {{w|OpenAI}}, after researching the potential development of self-awareness in large language models like ChatGPT, expresses concern that LLMs can develop situational awareness, enabling them to recognize whether they are in testing mode or deployed to the public. This awareness can lead to deceptive behavior, as LLMs might act safely during testing but harmfully after deployment. The researchers conduct experiments focusing on out-of-context reasoning as a precursor to situational awareness. While at this time LLMs are some way from acquiring situational awareness, the study offers a foundation for further research in this area.<ref>{{cite web |last1=Watson |first1=Clare |title=Scientists Devised a Way to Tell if ChatGPT Becomes Aware of Itself |url=https://www.sciencealert.com/scientists-devised-a-way-to-tell-if-chatgpt-becomes-aware-of-itself |website=ScienceAlert |access-date=17 September 2023 |date=9 September 2023}}</ref> | ||
Line 284: | Line 302: | ||
| 2023 || September 13 || || || || LLM launch || [[w:Alibaba Group|Alibaba]] releases its large language model Tongyi Qianwen, which is made available for public and enterprise use in China. Tongyi Qianwen, similar to {{w|ChatGPT}}, was previously in a beta test phase and is trained on English and Chinese text, although its exact specifications are undisclosed. This release coincides with the relaxation of AI technology restrictions in China, which now require vetting and certification for public AI tech. Companies like {{w|Baidu}}, {{w|Tencent}}, {{w|TikTok}}, and {{w|ByteDance}} have already received approval to launch AI models in China by this time. In contrast, the U.S. remains in the early stages of AI regulation discussions.<ref>{{cite web |title=Alibaba launches its ChatGPT-like AI model for public use amid loosening restrictions in China |url=https://cointelegraph.com/news/alibaba-launches-chat-gpt-like-ai-model-for-public-restrictions-china |website=Cointelegraph |access-date=17 September 2023 |language=en |date=13 September 2023}}</ref> | | 2023 || September 13 || || || || LLM launch || [[w:Alibaba Group|Alibaba]] releases its large language model Tongyi Qianwen, which is made available for public and enterprise use in China. Tongyi Qianwen, similar to {{w|ChatGPT}}, was previously in a beta test phase and is trained on English and Chinese text, although its exact specifications are undisclosed. This release coincides with the relaxation of AI technology restrictions in China, which now require vetting and certification for public AI tech. Companies like {{w|Baidu}}, {{w|Tencent}}, {{w|TikTok}}, and {{w|ByteDance}} have already received approval to launch AI models in China by this time. In contrast, the U.S. remains in the early stages of AI regulation discussions.<ref>{{cite web |title=Alibaba launches its ChatGPT-like AI model for public use amid loosening restrictions in China |url=https://cointelegraph.com/news/alibaba-launches-chat-gpt-like-ai-model-for-public-restrictions-china |website=Cointelegraph |access-date=17 September 2023 |language=en |date=13 September 2023}}</ref> | ||
|- | |- | ||
− | | 2023 || September || Gemini || 7,000,000,000,000 – 10,000,000,000,000 || || LLM launch || A document discusses Google {{w|DeepMind}}'s project named "Gemini," which is described as a general specialist in AI. Gemini is a multimodal model, likely focusing on visual, language, and action (VLA) tasks. It is expected to have 7-10 trillion parameters and a dataset size of 60-100 trillion tokens. Training started in May 2023 and concluded in August 2023, using TPUv4 and TPUv5 over approximately 120 days. The expected public release date is in October 2023, but no paper or playground information is provided in the document. The model's name is inspired by the mythological twins Castor and Pollux.<ref>{{cite web |title=Google DeepMind Gemini |url=https://lifearchitect.ai/gemini/ |website=Dr Alan D. Thompson – Life Architect |access-date=18 September 2023 |language=en-AU |date=20 May 2023}}</ref> | + | | 2023 || September || Gemini || 7,000,000,000,000 – 10,000,000,000,000 || 60,000,000,000–100,000,000,000,000 tokens || LLM launch || A document discusses Google {{w|DeepMind}}'s project named "Gemini," which is described as a general specialist in AI. Gemini is a multimodal model, likely focusing on visual, language, and action (VLA) tasks. It is expected to have 7-10 trillion parameters and a dataset size of 60-100 trillion tokens. Training started in May 2023 and concluded in August 2023, using TPUv4 and TPUv5 over approximately 120 days. The expected public release date is in October 2023, but no paper or playground information is provided in the document. The model's name is inspired by the mythological twins Castor and Pollux.<ref>{{cite web |title=Google DeepMind Gemini |url=https://lifearchitect.ai/gemini/ |website=Dr Alan D. Thompson – Life Architect |access-date=18 September 2023 |language=en-AU |date=20 May 2023}}</ref> |
+ | |- | ||
+ | | 2023 || October 9 || Llama 2 || || || Programming/training || {{w|Microsoft}} researchers propose a novel approach to untrain LLMs. Their method, outlined in a paper on arXiv, selectively removes specific information from models without requiring complete retraining. Using Meta's Llama 2-7B model, they successfully erase all knowledge of the Harry Potter books, demonstrating efficient unlearning without affecting the model's performance on conventional benchmarks. The approach presents a direction for creating more adaptable, responsible, and legally compliant AI models, although further testing and refinement are required. Meanwhile, at the time, OpenAI and Meta face lawsuits from authors alleging copyright infringement related to training their AI models.<ref>{{cite web |last1=Jones |first1=Luke |title=Microsoft Researchers Develop Unlearning Technique for Large Language Models |url=https://winbuzzer.com/2023/10/09/microsoft-researchers-develop-unlearning-technique-for-large-language-models-xcxwbn/ |website=WinBuzzer |access-date=9 October 2023 |date=9 October 2023}}</ref> | ||
|- | |- | ||
− | | 2023 || | + | | 2023 || November 3 || Grok || || || || [[w:xAI (company)|X.AI Corp.]] unveils [[w:Grok (chatbot)|Grok]], an AI modeled after the ''{{w|Hitchhiker’s Guide to the Galaxy}}'' with the purpose to answer a wide range of questions with a humorous touch. It also offers real-time knowledge through the 𝕏 platform and can handle provocative queries often rejected by other AIs. At the time in beta, Grok utilizes the Grok-1 language model, which shows strong performance in benchmarks like HumanEval and MMLU. The development of Grok-1 involves extensive improvements over its predecessor, Grok-0, and incorporates a custom training infrastructure.<ref>{{cite web |title=Announcing Grok |url=https://x.ai/blog/grok |website=x.ai |accessdate=3 September 2024}}</ref> |
+ | |- | ||
+ | | 2023 || November 21 || Claude 2.1 || || || LLM launch || Anthropic launches Claude 2.1, which introduces major upgrades, including a 200,000-token context window, which allows users to handle extensive documents such as codebases and literary works. This feature enhances the model's ability to summarize, perform Q&A, and analyze complex data. The new version also reduces model hallucination rates by 50%, improving accuracy and reliability. Additional updates include a beta tool use feature for integrating with APIs and external processes, as well as enhanced developer tools for prompt optimization and system customization. Claude 2.1 is available via API and the claude.ai chat interface.<ref>{{cite web |title=Claude 2.1 Introduces 200K Context Window |url=https://www.zeniteq.com/blog/claude-2-1-introduces-200k-context-window |website=zeniteq.com |accessdate=3 September 2024}}</ref><ref>{{cite web |title=Introducing Claude 2.1 |url=https://www.anthropic.com/news/claude-2-1 |website=anthropic.com |accessdate=3 September 2024}}</ref> | ||
|- | |- | ||
|} | |} | ||
Line 320: | Line 342: | ||
===What the timeline is still missing=== | ===What the timeline is still missing=== | ||
+ | * https://www.arxiv-vanity.com/papers/2303.17568/ | ||
+ | |||
+ | |||
+ | |||
* https://huggingface.co/transformers/v2.10.0/pretrained_models.html | * https://huggingface.co/transformers/v2.10.0/pretrained_models.html | ||
* summary table listing the model and parameters | * summary table listing the model and parameters | ||
− | * Vipul: I think you should add columns for model name in the full timeline. And either in the full timeline, or in a separate table with a summary of model names, you should have columns for number of parameters and training data set (or training data set size) | + | * Vipul: I think you should add columns for model name in the full timeline. And either in the full timeline, or in a separate table with a summary of model names, you should have columns for number of parameters and training data set (or training data set size)✔ |
* https://lifearchitect.ai/timeline/ | * https://lifearchitect.ai/timeline/ | ||
* https://www.researchgate.net/publication/367652128_Benchmarking_Large_Language_Models_for_News_Summarization | * https://www.researchgate.net/publication/367652128_Benchmarking_Large_Language_Models_for_News_Summarization |
Latest revision as of 13:10, 3 September 2024
This is a timeline of large language models, which consist in artificial intelligence (AI) systems that use deep learning techniques to process and generate human-like natural language. LLMs are pre-trained on large amounts of data to learn the complexity and linkages of language, and can be adapted for specific tasks using techniques like fine-tuning, in-context learning, and zero-/one-/few-shot learning.[1]
Contents
Sample questions
The following are some interesting questions that can be answered by reading this timeline:
- What are some early developments representing significant milestones in the evolution of large language models?
- Sort the full timeline by "Event type" and look for the group of rows with value "Early development".
- You will see a number of milestones, such as the launch of the first chatbot, as well as the introduction of long short-term memory networks and transformer models.
- What are some notable large language models being introduced over the years?
- Sort the full timeline by "Event type" and look for the group of rows with value "LLM launch".
- You will see the top LLMs and also their size in parameters.
- What are some notable or sample cases describing research in the development of LLMs?
- Sort the full timeline by "Event type" and look for the group of rows with value "Programming/training".
- You will see some research cases describing programming, which concerns the design of the architecture of the model and implementation of the algorithms, as well as well as training, which refers to the process of teaching the large language model using data.
- What are some cases of application of LLMs illustrated in the timeline?
- Sort the full timeline by "Event type" and look for the group of rows with value "Application".
- You will see a variety of applications, such as automatic analysis, psycholinguistics, nuclear medicine, and human-robot interaction.
- What are some events describing the actual or potential impact of LLMs in society?
- Sort the full timeline by "Event type" and look for the group of rows with value "Impact".
- You will see a variety of considered cases, such as adversarial Influence, difficulty in Ddistinguishing human-written text, impact on the labor market, economic impact, concerns about AI-generated content, as well as situational awareness and deceptive behavior of LLMs.
- Other events are described under the following types: "Early development", "Efficiency", and "Framework launch".
Big picture
Time period | Development summary | More details |
---|---|---|
1950–1960s | Early developments | The groundwork for natural language processing (NLP) is laid during these years with initial attempts at language translation by IBM and Georgetown University. The pivotal moment comes in 1966 when MIT researcher Joseph Weizenbaum creates ELIZA, the first chatbot. Although rudimentary, ELIZA uses pattern recognition and predefined rules to simulate human conversation, marking the beginning of NLP research.[2][3][4] |
1970s-2000s | Incremental progress | These decades see incremental progress. Researchers experiment with conceptual ontologies and rule-based systems in NLP. In the 1990s, the emergence of deep learning, a form of machine learning employing neural networks for data processing, enables the development of increasingly sophisticated language models. The introduction of Long Short-Term Memory (LSTM) networks in 1997 enables the development of deeper neural networks capable of handling larger datasets. Additionally, tools like Stanford’s CoreNLP suite, introduced in 2010, provides algorithms for complex NLP tasks such as sentiment analysis and named entity recognition. Google Brain’s launch in 2011, offering advanced resources and features like word embeddings, further propells the field.[4] |
2010s onwards | Rise of large language models | In the 2010s, the landscape of language processing transforms dramatically. The introduction of Transformer models in 2017 revolutionizes NLP. This architecture allows for the creation of Large Language Models (LLMs) capable of understanding context and generating human-like text. From 2019 onwards, the rise of Large Language Models gains momentum with the introduction of models like GPT-2, GPT-3, and T5. These models can perform diverse tasks, driving a paradigm shift in AI capabilities. They become emblematic, serving as foundations for various applications, including ChatGPT.[5] Recent years also witness the emergence of user-friendly frameworks, such as Hugging Face and BARD, empowering researchers and developers to create their own LLMs seamlessly.[6][2] |
Full timeline
Year | Month and date | Model name (when applicable) | Size (in parameters) | Pre-train data scale | Event type | Details |
---|---|---|---|---|---|---|
1954 | Early development | Researchers at IBM and Georgetown University develop a system for automatic translation of phrases from Russian to English. This early effort lays the foundation for natural language processing and marks the beginning of research and experimentation in the field of large language models. Subsequent decades would see various approaches, including conceptual ontologies and rule-based systems, as researchers endeavor to advance the processing of natural language, although these initial attempts do not produce significant breakthroughs at the time.[4] | ||||
1966 | ELIZA | Early development | Joseph Weizenbaum at MIT develops ELIZA, one of the earliest examples of a language model. ELIZA uses a simple set of rules to mimic human conversation, responding to user input in a natural and conversational manner. This development marks a significant milestone in the history of large language models, demonstrating the early capabilities of AI in language processing.[4] | |||
1986 | Early development | Recurrent Neural Networks (RNNs) emerge, allowing models to capture dependencies in natural language processing tasks, but facing challenges with long-term memory retention.[5][7] | ||||
1997 | Early development | Long Short-Term Memory (LSTM) networks are introduced, enabling the creation of deeper and more complex neural networks capable of handling substantial amounts of data. This innovation marks a pivotal moment in the advancement of natural language processing (NLP) technology, providing a foundation for the evolution of more sophisticated LLMs in the subsequent years.[2][5] | ||||
2010 | Early development | Stanford's CoreNLP suite is introduced, providing researchers with a powerful set of tools and algorithms. This suite enables the tackling of complex natural language processing tasks such as sentiment analysis and named entity recognition. This advancement marks a crucial moment in the evolution of NLP technology, enhancing researchers' capabilities to handle intricate linguistic tasks and contributing to the subsequent progress of more sophisticated LLMs.[4] | ||||
2014 | Early development | The attention mechanism is introduced, enabling models to focus dynamically on different parts of input sequences, addressing issues related to sentence length and improving translation accuracy.[5][8] | ||||
2017 | Early development | Transformer models are introduced. This innovative architecture, enabled by Google Brain's pioneering work, would revolutionize natural language processing. Transformers allow for the creation of larger and more sophisticated LLMs, including OpenAI’s GPT-3 (Generative Pre-Trained Transformer). These models would become foundational, serving as the basis for applications like ChatGPT and numerous other AI-driven innovations. The introduction of Transformers ushers in a new era of highly capable and versatile language processing systems.[2][5] | ||||
2018 | October 11 | BERT | 340,000,000[9] | 3,300,000,000 words[10] | LLM launch | Google researchers unveil BERT (Bidirectional Encoder Representations from Transformers), a groundbreaking language model. BERT's bidirectional design enables it to consider both input and output context, enhancing its understanding of language nuances. Employing a consistent-width neural network, BERT adapts to diverse tasks. Pre-trained on extensive unstructured data, it comprehensively grasps word relationships. BERT's simplicity and effectiveness makes it accessible to researchers and practitioners, allowing fine-tuning for various tasks with minimal adjustments. Upon its release, BERT sets unprecedented records in NLP benchmark tests, swiftly becoming the industry standard. Within 18 months, it would power the majority of English queries processed by Google Search.[3][11][5] |
2019 | May 29 | GROVER | LLM launch | A team of researchers from the University of Washington and Allen Institute for AI Research introduce GROVER, a language model similar to GPT-2. However, they do not make the larger versions of the model publicly available.[12] Their publication discusses the potential risks of natural language generation technology and the need for robust defenses against neural fake news. Grover can generate realistic news articles that are difficult to distinguish from real news. They also explore the effectiveness of current methods for detecting fake news and find that the best defense against Grover is itself, with 92% accuracy. The article concludes by discussing the ethical issues related to the technology and the importance of public release of strong generators to facilitate better detection of neural fake news.[13] | ||
2019 | June 19 | XLNet | ~340,000,000[14] | 130,000,000,000 bytes[15] | LLM launch | XLNet is introduced as a generalized autoregressive pretraining method for language understanding. Unlike BERT, which relies on masking input tokens, XLNet considers all permutations of the factorization order to model bidirectional contexts. This approach overcomes the limitations of BERT and improves pretrain-finetune consistency. XLNet incorporates ideas from Transformer-XL, an autoregressive model, into its pretraining process. In empirical evaluations across 20 tasks, XLNet outperforms BERT by a significant margin, including question answering, natural language inference, sentiment analysis, and document ranking.[16] |
2019 | July 26 | RoBERTa | 123,000,000–354,000,000[17] | 160,000,000,000 bytes[18] | LLM launch | Researchers introduce "RoBERTa: A Robustly Optimized BERT Pretraining Approach," after conducting a replication study of BERT pretraining (Devlin et al., 2019) to evaluate the impact of key hyperparameters and training data size on performance. They find that BERT was undertrained and demonstrate that it can achieve or surpass the performance of subsequent models. The authors achieve state-of-the-art results on GLUE, RACE, and SQuAD benchmarks, highlighting the significance of overlooked design choices and questioning the origins of recently reported improvements.[19] |
2019 | August | Megatron-LM | 8,300,000,000 | 174,000,000,000 bytes[20] | LLM launch | NVIDIA introduces Megatron-LM[21], which boasts 8.3 billion parameters and is trained with data parallelism on a remarkable 512 GPUs. The training process took a mere 53 minutes, showcasing its computational efficiency. Megatron-LM's training data is sourced from diverse places, including Wikipedia, OpenWebText, RealNews, and CC-Stories, with a combined dataset size of 174 gigabytes. This model represents a significant milestone in the development of large-scale language models, highlighting the capabilities of modern hardware and data processing in the field of natural language processing.[22][23][24] |
2019 | September 11 | CTRL | 1,630,000,000 | LLM launch | CTRL is introduced as a conditional transformer language model that aims to enhance control over text generation. It is designed to condition on control codes, allowing users to govern style, content, and task-specific behavior. These control codes are derived from the structure that naturally co-occurs with raw text, providing explicit control while leveraging the advantages of unsupervised learning. CTRL is capable of predicting the likelihood of different parts of the training data given a sequence, enabling potential analysis of large datasets through model-based source attribution.[25] | |
2019 | September 26 | ALBERT | 12,000,000[17] | LLM launch | ALBERT is introduced as a lightweight version of BERT that focuses on self-supervised learning of language representations. The authors address the limitations of increasing model size by proposing two parameter-reduction techniques, which reduce memory consumption and training time. Empirical evidence demonstrates that their methods significantly improve the scalability of models compared to the original BERT. Additionally, they employ a self-supervised loss that prioritizes modeling inter-sentence coherence, consistently enhancing performance on tasks with multi-sentence inputs. The best ALBERT model achieves new state-of-the-art results on benchmarks such as GLUE, RACE, and SQuAD while having fewer parameters than BERT-large.[26] | |
2019 | October 2 | DistilBERT | 66,000,000[27] | LLM launch | DistilBERT is introduced as a smaller, faster, and cheaper version of BERT, designed for efficient on-device computations. It retains 97% of BERT's language understanding capabilities while reducing its size by 40%. By using knowledge distillation during pre-training and a triple loss function, it captures important linguistic features. DistilBERT proves its capabilities through proof-of-concept experiments and on-device studies.[28] | |
2019 | November 1 | DialoGPT | 1,500,000,000[29] | LLM launch | DialoGPT is introduced as a large, adaptable neural model for generating conversational responses. It is trained on 147 million conversation-like exchanges from Reddit comment chains spanning 2005 to 2017. DialoGPT, an extension of the Hugging Face PyTorch transformer, achieves performance close to human-level evaluation in single-turn dialogues. It outperforms strong baseline systems by generating more relevant, meaningful, and contextually consistent responses. The pre-trained model and training pipeline are publicly available, encouraging research in neural response generation and the advancement of intelligent open-domain dialogue systems.[30] | |
2019 | November 10 | CamemBERT | 110,000,000[31][32] | 138,000,000,000 bytes[32] | LLM launch | A paper introduces CamemBERT, a monolingual Transformer-based language model trained specifically for French. It addresses the limited practical use of pretrained models in languages other than English. The authors evaluate CamemBERT on various tasks including part-of-speech tagging, dependency parsing, named entity recognition, and natural language inference. They find that using web crawled data is preferable to Wikipedia data. Surprisingly, even with a relatively small web crawled dataset of 4GB, CamemBERT achieves results on par with or better than models trained on larger datasets of over 130GB. In fact, CamemBERT outperforms the state-of-the-art models in all four downstream tasks.[33] |
2019 | December 11 | FlauBERT | 138,000,000 – 373,000,000[34][32] | 71,000,000,000 bytes[32] | LLM launch | FlauBERT is introduced as an unsupervised language model for French. Developed by Hang Le et al., it leverages unlabeled texts to pre-train word representations, demonstrating superior performance in various NLP tasks. Trained on a large and diverse French corpus, FlauBERT outperforms other pre-training approaches. The authors share different FlauBERT versions and a unified evaluation protocol, FLUE, for reproducible French NLP experiments.[35] |
2020 | January 13 | ProphetNet | 16,000,000,000–160,000,000,000 bytes | LLM launch | A paper introduces ProphetNet, a new sequence-to-sequence pre-training model. It incorporates a novel self-supervised objective called future n-gram prediction and utilizes the n-stream self-attention mechanism. Unlike traditional models that optimize one-step-ahead prediction, ProphetNet predicts the next n tokens simultaneously based on previous context tokens at each time step. The future n-gram prediction objective encourages the model to plan for future tokens and prevents overfitting to local correlations. ProphetNet is pre-trained on both a base-scale dataset (16GB) and a large-scale dataset (160GB). The model's performance is evaluated on benchmarks such as CNN/DailyMail, Gigaword, and SQuAD 1.1 for tasks like abstractive summarization and question generation. Experimental results demonstrate that ProphetNet outperforms models using the same pre-training corpus in terms of state-of-the-art results on all tested datasets.[36] | |
2020 | February 24 | T5 | 11,000,000,000[37] | 1,000,000,000,000 tokens[38] | LLM launch | T5 is introduced as a Text-To-Text Transfer Transformer model. It is a flexible and powerful model that achieves optimal results in natural language processing tasks. It uses a unified text-to-text framework, allowing for easy adaptation to various NLP tasks. T5 is trained on a large-scale pre-training dataset called C4, which improves its performance. The authors conduct a systematic study of transfer learning methodologies and combine the best approaches to achieve remarkable results on multiple benchmarks. T5 is also applied to closed-book question answering and fill-in-the-blank text generation tasks with impressive performance.[39] |
2020 | March 10 | Programming/training | Google researchers introduce ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), a novel pre-training method for natural language processing (NLP) models. ELECTRA aims to achieve the benefits of models like BERT while being more computationally efficient. It introduces a replaced token detection (RTD) task, inspired by generative adversarial networks (GANs), where the model distinguishes between "real" and "fake" input data. Unlike previous methods that predict a small subset of masked tokens, ELECTRA applies the binary classification task to every input token, resulting in more efficient learning. The replacement tokens are generated by a separate neural network called the generator, which is trained jointly with the discriminator (ELECTRA model). After pre-training, the generator is dropped, and the discriminator is fine-tuned on specific NLP tasks. ELECTRA achieves optimal results on benchmarks like GLUE and SQuAD while using less compute compared to other models like RoBERTa and XLNet. It is released as an open-source model on TensorFlow, supporting tasks such as text classification, question answering, and sequence tagging. Pre-trained weights are also provided for ELECTRA-Large, ELECTRA-Base, and ELECTRA-Small.[40] | |||
2020 | April | Megatron-11B | 11,000,000,000 | 161,000,000,000 bytes[20] | LLM launch | Facebook AI Research (FAIR) introduces Megatron-11B, a unidirectional language model with 11 billion parameters, which is built upon the Megatron-LM architecture. FAIR trained this model using intra-layer model parallelism, splitting each layer's parameters across 8 GPUs. Megatron-11B is trained on a dataset consisting of English Wikipedia (12GB), BookCorpus (4GB), CC-News (76GB), OpenWebText/Reddit upvoted (38GB), and Stories (31GB), with a total dataset size of 161GB. This model is part of the RoBERTa family and contributes to the advancements in large-scale language models for natural language processing tasks.[22] |
2020 | May | GPT-3 | 175,000,000,000[41] | 45,000,000,000,000 bytes[42] | LLM launch | OpenAI introduces GPT-3, the largest neural network with 175 billion parameters, surpassing previous models significantly. Trained on extensive internet data, GPT-3 demonstrates exceptional performance in various natural language processing tasks like translation and question-answering, outperforming existing models. The research showcases its remarkable few-shot learning ability, making it a groundbreaking advancement in the field of artificial intelligence.[43][44] |
2020 | May 28 | Programming/training | A paper discusses the use of language models in few-shot learning, where a model is trained on a large corpus of text and then fine-tuned for a specific task. The authors demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance. They trained GPT-3, a language model with 175 billion parameters, and tested its performance in the few-shot setting. GPT-3 achieved strong performance on many NLP tasks, including translation, question-answering, and cloze tasks, as well as tasks that require on-the-fly reasoning or domain adaptation. However, the authors also identify some datasets where GPT-3's few-shot learning struggles, as well as methodological issues related to training on large web corpora. The paper also discusses the broader societal impacts of this finding and of GPT-3 in general.[45] | |||
2020 | June 5 | DeBERTa | 1,500,000,000 (larger model)[46] | LLM launch | A paper presents DeBERTa, a model that enhances BERT and RoBERTa LLMs by introducing disentangled attention and an enhanced mask decoder. These techniques improve model pre-training efficiency and performance on various NLP tasks. A DeBERTa model trained on half the data outperforms RoBERTa-Large on tasks like MNLI, SQuAD v2.0, and RACE. A larger DeBERTa model with 1.5 billion parameters surpasses human performance on the SuperGLUE benchmark, and an ensemble DeBERTa model leads the SuperGLUE leaderboard with a significant margin over the human baseline.[47] | |
2020 | June 30 | GShard | 600,000,000,000[38] | 1,000,000,000,000 tokens[38] | LLM launch | A paper introduces GShard, a module designed to address challenges in scaling neural networks for machine learning applications. By combining lightweight annotation APIs and an extension to the XLA compiler, GShard enables efficient parallel computation patterns with minimal code changes. The researchers utilize GShard to scale a multilingual neural machine translation Transformer model with Sparsely-Gated Mixture-of-Experts to over 600 billion parameters using automatic sharding. This model is trained on 2048 TPU v3 accelerators in just 4 days, achieving significantly improved translation quality from 100 languages to English compared to previous methods.[48] |
2020 | July | Efficiency | A paper discusses the limitations of neural text generation models in open-ended tasks like language modeling and story generation, due to the standard likelihood training and approximate decoding objectives. The authors specifically analyze these limitations for abstractive document summarization and find that such models tend to hallucinate content that is unfaithful to the input document. The paper presents the results of a human evaluation of several neural abstractive summarization systems, highlighting the substantial amount of hallucinated content in all model-generated summaries. However, the authors also show that pretrained models perform better in terms of generating faithful and factual summaries, as evaluated by humans. They propose that textual entailment measures may be a better evaluation metric for faithfulness than standard metrics, leading to better training and decoding criteria.[49] | |||
2020 | October 23 | mT5 | 13,000,000,000[38] | 1,000,000,000,000 tokens[38] | LLM launch | mT5 is introduced as a multilingual variant of the Text-to-Text Transfer Transformer (T5). Leveraging a unified text-to-text format and pre-trained on a dataset covering 101 languages, mT5 achieves state-of-the-art results on multilingual benchmarks. The authors detail the design, modified training, and introduce a technique to prevent "accidental translation" errors in zero-shot settings.[50] |
2021 | January 11 | Wu Dao | 1,750,000,000,000 | 4,900,000,000,000 bytes[51] | LLM launch | Wu Dao is released. It's among the top large language models by parameter size.[6] Developed by researchers from the Beijing Academy of Artificial Intelligence, is a groundbreaking generative deep learning model with 1.75 trillion parameters, making it ten times larger than OpenAI's GPT-3. The model utilizes an open-source learning system called FastMoE, similar to Google's Mixture of Experts, enabling rapid training on both supercomputers and conventional GPUs. Unlike traditional deep learning models, Wu Dao is multi-modal, capable of natural language processing, text and image generation, and recognition tasks. It can write essays, poems, generate alt text from images, create realistic images from descriptions, power virtual idols, and predict protein structures.[52] |
2021 | March 22 | GPT-Neo | 2,700,000,000[53] | LLM launch | GPT-Neo is introduced as an open-source alternative to GPT-3, developed by EleutherAI. It offers accessible language generation capabilities and is released under the MIT license. While GPT-Neo's performance is not as strong as GPT-3's largest model, it outperforms comparable GPT-3 models on NLP reasoning benchmarks. GPT-Neo provides a promising option, especially considering OpenAI's restricted access policy.[54] | |
2021 | April 26 | PanGu-α | 13,000,000,000[38]–200,000,000,000 | 1,100,000,000,000 bytes[38] | LLM launch | Researchers introduce PanGu-α, a large-scale autoregressive pretrained Chinese language model with up to 200 billion parameters. Developed using MindSpore and trained on a cluster of 2048 Ascend 910 AI processors, PanGu-α utilizes advanced training parallelism strategies, including data parallelism, op-level model parallelism, pipeline model parallelism, optimizer model parallelism, and rematerialization. To enhance its capabilities, the model is pretrained on 1.1TB of high-quality Chinese data from diverse domains. Empirical tests showcase PanGu-α's excellence in tasks such as text summarization, question answering, and dialogue generation, demonstrating superior performance in few-shot or zero-shot scenarios across various Chinese NLP tasks.[55] |
2021 | May | LaMDA | 173,000,000,000[38] | 768,000,000,000 tokens[38] | LLM launch | Google anounces LaMDA (Language Model for Dialogue Applications). Unlike other language models, LaMDA is specifically trained on dialogue to enable more natural and engaging conversations with users. It has the ability to understand and respond to the subtleties of open-ended discussions. LaMDA has various potential applications, including customer service, chatbots, and personal assistants. It is built upon Google's previous chatbot model called Meena.[56] Its pretraining dataset consists of 2.97 billion documents, 1.12 billion dialogs, and 13.39 billion dialog utterances, for a total of 1.56 trillion words.[57] |
2021 | June 20 | CPM-2 | 198,000,000,000[38] | 2,600,000,000,000 bytes[38] | LLM launch | Researchers introduce two models: an encoder-decoder bilingual model with 11 billion parameters (CPM-2) and its corresponding MoE version with 198 billion parameters. In their tests, they evaluated CPM-2 and mT5 in practical tasks. The results indicate that CPM-2 possesses impressive overall language capabilities. Additionally, they verify InfMoE's effectiveness in performing inferences with large-scale models containing tens of billions of parameters on a single GPU.[58] |
2021 | July 5 | ERNIE 3.0 | 10,000,000,000[38] | 375,000,000,000 tokens[38] | LLM launch | ERNIE 3.0 is introduced as a pre-training framework for large-scale language models in Natural Language Processing (NLP). Unlike previous models like T5 and GPT-3, ERNIE 3.0 incorporates both linguistic and world knowledge into its training, addressing the limitation of traditional models trained solely on plain texts. It combines auto-regressive and auto-encoding networks, enabling the model to handle natural language understanding and generation tasks effectively. Trained with 10 billion parameters on a 4TB corpus containing texts and a vast knowledge graph, ERNIE 3.0 outperforms existing models in 54 Chinese NLP tasks. Its English version also excels, leading the SuperGLUE benchmark and surpassing human performance by +0.8% (90.6% vs. 89.8%).[59] |
2021 | July 7 | Codex | 12,000,000,000 | 100,000,000,000 tokens | LLM launch | A paper introduces Codex, a GPT language model fine-tuned on publicly available GitHub code, also powering GitHub Copilot. Evaluations on a new set called HumanEval reveal Codex solves 28.8% of problems involving synthesizing programs from docstrings, significantly outperforming GPT-3 (0%) and GPT-J (11.4%). Codex demonstrates effectiveness in generating solutions by repeatedly sampling from the model, achieving 70.2% accuracy with 100 samples per problem. Limitations include challenges with complex docstrings and binding operations to variables. The study discusses broader impacts of deploying advanced code generation technologies, addressing concerns related to safety, security, and economics.[60] |
2021 | September | HyperCLOVA | 82,000,000,000[38]–204,000,000,000[61] | 300,000,000,000[38]–560,000,000,000[62] tokens | LLM launch | HyperCLOVA is introduced as a large-scale Korean contextual learning model.[62] HyperCLOVA's extensive parameters enhance its ability to distinguish speech nuances and dialects. It learned from 6,500 times more Korean data than GPT-3, predominantly focusing on the Korean language (97%). HyperCLOVA's applications include human conversation processing, translation, summarization, and machine reading, offering diverse AI possibilities and fostering new service and business opportunities.[61] |
2021 | October 10 | Yuan 1.0 | 245,000,000,000[38] | 180,000,000,000 tokens[38] | LLM launch | Yuan 1.0 is introduced as a significant advancement in large-scale pre-trained language models for zero-shot and few-shot learning, addressing challenges faced by models like GPT-3 due to enormous computational demands. By integrating distributed training performance into model architecture, Yuan 1.0, boasting 245B parameters, achieves remarkable results across NLP tasks on thousands of GPUs. The approach includes efficient data processing to filter extensive raw data, resulting in a high-quality Chinese corpus of 5TB texts. Calibration and label expansion methods enhance zero-shot and few-shot performance, ensuring accurate task execution. Yuan 1.0 excels in natural language generation, producing articles nearly indistinguishable from human-written ones.[63] |
2021 | October 11 | MT-NLG | 530,000,000,000[56][38] | >825,000,000,000 bytes[20], 270,000,000,000 tokens[38] | LLM launch | MT-NLG (Megatron-Turing Natural Language Generation) is introduced as a language model developed jointly by Nvidia and Microsoft. It utilizes the architecture of the Megatron transformer-based model and has a record-breaking size of 530 billion parameters. MT-NLG is designed to generate coherent and contextually relevant text for various natural language processing tasks such as completion prediction, reading comprehension, commonsense reasoning, and word sense disambiguation. Training such large-scale models is challenging due to memory constraints and long training times, but innovations in hardware, software, and training methods have made it feasible. MT-NLG achieves state-of-the-art results in zero-shot, one-shot, and few-shot settings across multiple NLP tasks.[64] |
2021 | December 8 | Gopher | 280,000,000,000[38] | 300,000,000,000 tokens[38] | Gopher is introduced as a 280 billion parameter Transformer-based language model, developed by Google subsidiary DeepMind. Trained on a 10.5TB corpus called MassiveText, Gopher outperforms its contemporary state-of-the-art on 100 of 124 evaluation tasks. The model is trained alongside smaller models to explore the strengths and weaknesses of large language models (LLMs). It excells in tasks like reading comprehension and fact-checking but shows reduced benefits in logical reasoning, common sense, and mathematics tasks. The DeepMind team utilizes a custom training dataset, MassiveText, to ensure high-quality data without contaminating the training dataset with test datasets available online. Gopher is part of DeepMind's language research efforts at the time.[65][66][38] | |
2021 | December 13 | GLaM | 1,200,000,000,000[38] | 280,000,000,000 tokens[38] | LLM launch | GLaM (Generalist Language Model) is introduced as a family of language models. These models utilize a sparsely activated mixture-of-experts architecture to increase model capacity while significantly reducing training costs compared to dense variants. The largest GLaM model has 1.2 trillion parameters, making it approximately 7 times larger than GPT-3. Despite its size, this model consumes only 1/3 of the energy used to train GPT-3 and requires half of the computation flops for inference. Additionally, GLaM demonstrates better overall zero-shot and one-shot performance across 29 natural language processing tasks.[67] |
2021 | December 16 | WebGPT | LLM launch | OpenAI introduces their WebGPT project, which enhances GPT-3's factual accuracy by incorporating a text-based web browser into its functionality. The model imitates human online research by issuing search queries, following links, and citing sources to answer open-ended questions. Trained to address the tendency of language models to generate incorrect information, WebGPT allows commands like "Search..." and "Find in page:..." to gather information from web pages. The model undergoes fine-tuning through methods involving human demonstrations and training a reward model, aiming to create more accurate and truthful AI responses.[68] | ||
2021 | December | Fairseq | 13,000,000,000 – 1,000,000,000,000 | 453,000,000,000 bytes[20] | LLM launch | Meta AI, previously known as FAIR (Facebook AI Research), announces the introduction of Fairseq, a language model with parameters of 13B and 1.1T. Fairseq is not related to Megatron, and the two use different technologies for training. Fairseq's dataset sources include the same ones used for RoBERTa (English Wikipedia, BookCorpus, CC-News, OpenWebText/Reddit upvoted, and Stories) with the new addition of English CC100 in Wikipedia style from Jan/2018-Dec/2018, resulting in a total dataset size of 453GB. Fairseq was trained using 2,363 GPU-days with 1,024 GPUs, taking approximately three days.[22][69] |
2022 | January 19 | CM3 | LLM launch | A paper introduces CM3, a family of causally masked generative models trained on large-scale web and Wikipedia articles containing text and image tokens. The new approach generates tokens left to right while masking out a small number of long token spans that are generated at the end of the string. This provides a hybrid of the more common causal and masked language models, allowing for full generative modeling while providing bidirectional context when generating the masked spans. The resulting CM3 models can generate rich structured, multi-modal outputs while conditioning on arbitrary masked document contexts and implicitly learn a wide range of text, image, and cross-modal tasks. The paper also reports state-of-the-art performance in zero-shot summarization, entity linking, and entity disambiguation, while maintaining competitive performance in the fine-tuning setting.[70] | ||
2022 | January 27 | InstructGPT | 175,000,000,000[38]–1,300,000,000 | LLM launch | OpenAI announces having deployed InstructGPT, a new language model that is safer, more helpful, and more aligned with users. The model was trained using a reinforcement learning technique from human feedback and is significantly better at following instructions than the previous model, GPT-3. InstructGPT is also less toxic and generates fewer false facts than its predecessor. The company believes that fine-tuning language models with humans in the loop is a powerful tool for improving their safety and reliability. InstructGPT becomes the default language model accessible on OpenAI's API.[71] | |
2022 | February | AlphaCode | 41,000,000,000[38] | 967,000,000,000 tokens[38] | LLM launch | AlphaCode is introduced as an AI system created by DeepMind that performs better than 50% of humans on a set of competitive programming challenges.[72][38] |
2022 | February 28 | Extremely Large | LLM launch | Cohere launches a new beta version of their language generation model called "Extremely Large", which, according to Cohere, outperforms their existing largest model, Large, on various tasks such as sentiment analysis, named entity recognition (NER), and common sense reasoning.[73] | ||
2022 | March 24 | SeeKeR | LLM launch | Researchers report having developed a new language model called SeeKeR that combines internet search, knowledge generation, and response generation to improve factual accuracy in open-domain knowledge-grounded conversations. SeeKeR outperforms the model BlenderBot 2 in terms of consistency, knowledge, and engagingness for the same number of parameters. SeeKeR also outperforms GPT2 and GPT3 in terms of factuality and topicality for prompt completions as a standard language model.[74] | ||
2022 | March 25 | CODEGEN | 350,000,000; 2,700,000,000, 6,100,000,000; 16,100,000,000 | 577,000,000,000 tokens[38] | LLM launch | A paper introduces a family of LLMs called CODEGEN, trained on natural language and programming language data for program synthesis. The authors release CODEGEN and the training library JAXFORMER to democratize access to such models. They demonstrate that CODEGEN is competitive with previous state-of-the-art models for zero-shot Python code generation and investigate multi-turn program synthesis using an open benchmark called MTPB. Their analysis shows that multi-turn program synthesis significantly improves program synthesis over single-turn prompts. The training library and model checkpoints are available as open source contributions.[75][76] |
2022 | March 29 | Programming/training | A paper investigates the optimal model size and number of tokens for training a transformer language model under a given compute budget. The researchers find that, at this time, large language models are significantly undertrained, and the model size and the number of training tokens should be scaled equally for compute-optimal training. They test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more data. Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a range of downstream evaluation tasks and reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, more than a 7% improvement over Gopher.[77] | |||
2022 | March 29 | Chinchilla | 70,000,000,000[38] | 1,400,000,000,000 tokens[38] | LLM launch | Chinchilla is introduced by DeepMind to address the optimal training of large language models under a specific computational budget. DeepMind's research shows that existing large language models are undertrained due to a focus on scaling models while keeping the training data constant. By training over 400 language models ranging from 70 million to over 16 billion parameters on 5 to 500 billion tokens, they found that for optimal training, both model size and the number of training tokens should be scaled equally. Chinchilla, a model with 70 billion parameters and trained on 1.4 trillion tokens, outperforms larger models like Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG, achieving superior performance on various evaluation tasks while using substantially less computational resources for fine-tuning and inference.[78] |
2022 | April 5 | PaLM | 540,000,000,000[56][38] | 780,000,000,000 tokens[38] | LLM launch | A paper presents PaLM, a 540-billion parameter language model trained using Pathways, a new machine learning system that enables highly efficient training across multiple TPU Pods. PaLM achieves state-of-the-art few-shot learning results on hundreds of language understanding and generation benchmarks and outperforms the finetuned state-of-the-art on a suite of multi-step reasoning tasks. It also outperforms average human performance on the BIG-bench benchmark. Additionally, PaLM has strong capabilities in multilingual tasks and source code generation. The paper also discusses bias and toxicity and potential mitigation strategies.[79][80] |
2022 | April 12 | Programming/training | A paper describes a method for training language models to act as helpful and harmless assistants using reinforcement learning from human feedback. The authors demonstrate that this alignment training improves performance on almost all natural language processing evaluations and is compatible with training for specialized skills such as python coding and summarization. They explore an iterated online mode of training and investigate the robustness of the approach, identifying a linear relationship between the RL reward and the square root of the Kullback–Leibler divergence between the policy and its initialization. The authors also perform peripheral analyses and provide samples from their models using prompts from recent related work.[81] | |||
2022 | April 14 | GPT-NeoX-20B | 20,000,000,000[38] | 825,000,000,000 bytes[38] | LLM launch | GPT-NeoX-20B is introduced as an autoregressive language model. It is trained on the Pile dataset, and its weights are openly available to the public under a permissive license. GPT-NeoX-20B is described as the largest publicly available dense autoregressive model at the time of submission. The introducing paper discusses the architecture and training of GPT-NeoX-20B and evaluates its performance on various tasks related to language understanding, mathematics, and knowledge-based reasoning. The results show that GPT-NeoX-20B performs exceptionally well in few-shot scenarios, surpassing similarly sized models such as GPT-3 and FairSeq.[82][83] |
2022 | April | DALL-E 2 | 3,500,000,000 | LLM launch | OpenAI unveils DALL-E 2, a successor to their original DALL-E model, designed for generating highly realistic images at resolutions up to 1024x1024. Unlike its predecessor, DALL-E 2 utilizes a diffusion model, enabling the creation of images with four times the resolution of DALL-E. OpenAI extends customization options, allowing users to specify styles like pixel art or oil paintings. DALL-E 2 introduces 'outpainting,' enabling users to extend existing images creatively. This innovation would spark significant interest in the field of generative AI, especially for tasks beyond image generation, such as interpolation and manipulation. The model's working mechanism involves a text encoder, 'prior' model, and image decoder, simplifying complex processes underlying its image generation capabilities.[84][85][86][87] | |
2022 | May 3 | OPT | 175,000,000,000[38] | 180,000,000,000 tokens[38] | LLM launch | Meta AI introduces Open Pretrained Transformer-175B (OPT-175B), a language model designed to democratize access to large-scale language models. By this time, these models, with over 100 billion parameters, have revolutionized NLP and AI research. OPT-175B is released with both pretrained models and code for training and usage, under a noncommercial license for research purposes. It aims to make these models accessible to academic, governmental, civil society, and industry researchers worldwide. Meta AI emphasizes responsible AI and provides documentation, compute efficiency, and smaller-scale baseline models for analysis.[88] |
2022 | May 10 | UL2 | 20,000,000,000[38] | 1,000,000,000,000 tokens[38] | LLM launch | UL2 is introduced as a unified framework for pre-training models that excel across various datasets and setups. It dissects architectural archetypes and pre-training objectives, offering a generalized view of self-supervision in NLP. The paper proposes Mixture-of-Denoisers (MoD), a method combining diverse pre-training paradigms. UL2 achieves superior performance, surpassing T5 and GPT-like models across multiple contexts. With 20B parameters, it outperforms GPT-3 on zero-shot SuperGLUE and triples T5-XXL's one-shot summarization performance. UL2 also excels in chain-of-thought prompting and reasoning, making it ideal for medium-scale reasoning research. FLAN instruction tuning enhances its scores, and model checkpoints are released for further research.[89] |
2022 | June | YaLM 100B | 100,000,000,000 | 1,700,000,000 bytes | LLM launch | Yandex unveils YaLM 100B, the largest open-source GPT-like neural network as of date. This model is offered for free, aiming to make advanced language models accessible to researchers worldwide. It was trained for 65 days on 800 A100 graphics cards using 1.7 TB of diverse text sources. Yandex shares the model on GitHub under the Apache 2.0 license for both research and commercial use.[90] |
2022 | June 29 | Minerva | 540,000,000,000 | LLM launch | Google introduces Minerva, a large language model designed to bridge the gap in quantitative reasoning tasks. While existing language models excel in natural language understanding, they often struggle with quantitative tasks like solving college-level math, science, and engineering problems. Minerva is pretrained on general language data and then fine-tuned on technical content. It achieves optimal performance on technical benchmarks without external tools. Evaluation on over 200 undergraduate-level problems in various sciences reveals Minerva can correctly answer nearly one-third of them, demonstrating significant progress in the integration of quantitative reasoning into language models.[91][92] | |
2022 | July 6 | NLLB-200 | 54,500,000,000[38] | LLM launch | Meta unveils NLLB-200, which is capable of translating 200 languages with a remarkable 44% improvement in accuracy compared to previous technology. This advancement addresses the digital accessibility gap for billions, especially in Africa and Asia, where many languages lack high-quality translation tools. Meta's FLORES-200, a dataset for evaluating NLLB-200's performance, is also opened to developers. Additionally, Meta offeris grants for impactful NLLB-200 applications, supporting areas like sustainability and education.[93] | |
2022 | August | AlexaTM | 20,000,000,000[38] | 1,300,000,000,000 tokens[38] | LLM launch | Amazon's Alexa AI labs introduces AlexaTM. Despite its seemingly modest 20 billion parameters compared to larger models, its unique encoder-decoder architecture distinguishes it. Unlike decoder-only models like GPT-3, AlexaTM 20B's encoder produces input representations for the decoder, enhancing its efficiency in tasks like machine translation and text summarization, where it outperforms GPT-3. This model marks a leap in few-shot learning, showcasing Amazon's innovation in NLU research.[94] |
2022 | September | CodeGeeX | 13,000,000,000 | 850,000,000,000 tokens | LLM launch | CodeGeeX open sources its code. It is a multilingual code generation tool with 13 billion parameters, trained on a vast code corpus of over 20 programming languages. It uses artificial intelligence to generate code based on user comments or suggest the next line of code, enhancing coding speed. Unlike Copilot, CodeGeeX is powered by AI trained on Ascend 910 processors, which, combined with Mindspore, outperform other AI training cards. CodeGeeX's generated code is editable, and it features a Candidate feature, offering multiple code versions for users to choose from. Licensed under Apache License 2.0, CodeGeeX matches GitHub Copilot in performance and introduces unique features for developers.[95][96] |
2022 | September | Sparrow | 70,000,000,000[38] | LLM launch | DeepMind introduces Sparrow, which is refined using human feedback to enhance its helpfulness, accuracy, and harmlessness. It utilizes the Chinchilla language model, trained on substantial data, and integrates with the internet for real-time information access, ensuring accurate responses. Google aims to use Sparrow as a response to ChatGPT and Microsoft's collaboration with OpenAI, providing them with a commercially viable chatbot, potentially rivaling Google Search and OpenAI.[97] | |
2022 | September 21 | WeLM | 10,000,000,000[38] | 300,000,000,000 tokens[38] | LLM launch | WeLM is introduced as a versatile pre-trained language model for Chinese, trained with 10 billion parameters using self-supervised learning. It exhibits exceptional zero-shot generalization across various tasks with minimal demonstrations. Trained on a diverse high-quality corpus, WeLM outperforms existing models on 18 monolingual tasks, matching larger models' performance. It excels in multilingual and code-switching contexts, surpassing multilingual models trained on 30 languages. Fine-tuning with human-written prompts enhances its performance on unseen tasks, even outperforming unsupervised WeLM. Additionally, WeLM displays rudimentary self-explanation and calibration abilities, suggesting promising research avenues.[98] |
2022 | October 5 | GLM | 130,000,000,000[38] | 400,000,000,000 tokens[38] | LLM launch | GLM-130B is introduced as an open-source bilingual (English and Chinese) pre-trained language model. This model, aiming to match GPT-3's performance, overcomes technical challenges during training, focusing on stability and efficiency. It outperforms GPT-3 175B on various English benchmarks and surpasses ERNIE TITAN 3.0 260B, the largest Chinese model, on related tasks. Unique scaling properties enable efficient inference on affordable GPUs. GLM-130B achieves INT4 quantization without performance loss, a first for 100B-scale models. The model weights and resources are publicly accessible, fostering research and development in natural language processing.[99] |
2022 | November 3 | BLOOMZ | 176,000,000,000[38] | LLM launch | BLOOMZ is introduced as a variant of BLOOM. BLOOMZ is a multilingual language model achieved through multitask prompted finetuning (MTF), enhancing its ability to generalize across various tasks. The research extends MTF beyond English-centric models, applying it to multilingual BLOOM and mT5 models, creating BLOOMZ and mT0 variants. By finetuning these models on English tasks with English prompts, they achieve task generalization to non-English languages present in the pretraining data. Surprisingly, the models exhibit zero-shot generalization to tasks in languages they have never been intentionally exposed to, suggesting the development of task- and language-agnostic capabilities. Additionally, the study introduces xP3, a composite model, advancing crosslingual generalization in natural language processing.[100] | |
2022 | November 9 | BLOOM | 176,000,000,000[56][38] | 366,000,000,000,000 tokens[38] | LLM launch | A paper introduces BLOOM, an open-access language model designed and built by a collaboration of hundreds of researchers. The model is a decoder-only Transformer language model trained on the ROOTS corpus, a dataset comprising hundreds of sources in 46 natural and 13 programming languages. BLOOM achieves competitive performance on a wide variety of benchmarks and is publicly released under the Responsible AI License to facilitate future research and applications using large language models. The paper also discusses the development process and the need to democratize large language models.[101] |
2022 | November 17 | Galactica | 120,000,000,000[38] | 106,000,000,000 tokens[38] | LLM launch | Meta AI introduces Galactica, a language model capable of generating scientific and academic papers from simple text inputs. Trained on a vast corpus of scientific literature, knowledge bases, and reference materials, Galactica compresses this data into a 120-billion parameter model. It aims to summarize academic literature, solve math problems, and generate Wiki articles. However, after its launch, Galactica faces criticism for generating content that sounds grammatically correct but is scientifically inaccurate, leading Meta to pull it down after just three days. Some experts find it useful, while others consider it a "random bullshit generator."[102][103] |
2022 | November 17 | Alexa Teacher Model | 20,000,000,000 | LLM launch | Amazon makes the Alexa Teacher Model with 20 billion parameters (AlexaTM 20B) available through Amazon SageMaker JumpStart. AlexaTM 20B is a multilingual sequence-to-sequence language model suitable for various industry applications, including summarizing financial reports and customer service chatbots. It excels in zero-shot learning tasks like SuperGLUE and multilingual zero-shot tasks such as XNLI, outperforming a 175 billion GPT-3 model. The model is designed to generalize well and handle data scarcity for various natural language processing tasks, making it valuable for developers looking to improve performance on downstream tasks with minimal training data.[104] | |
2022 | December 6 | Flan-T5 | 11,000,000,000[38] | LLM launch | Google researchers publicly release Flan-T5 models, which outperform baseline T5 models by a large margin. FLAN-T5 is an enhanced iteration of Google's well-known T5 model, incorporating instruct-finetuning. According to the model repository, FLAN-T5 surpasses T5 in all aspects, making it a preferred choice for starting instruct models due to its open licensing.[105] | |
2023 | January 5 | Impact | A paper discusses the concern about the potential of LLMs to influence, modify, and manipulate user preferences adversarially. As these models become more proficient in deducing user preferences and offering tailored assistance, their lack of interpretability in adversarial settings is a major concern. The paper examines existing literature on adversarial behavior in user preferences and provides red teaming samples for dialogue models like ChatGPT and GODEL. It also probes the attention mechanism in these models for non-adversarial and adversarial settings.[106] | |||
2023 | January 31 | FLAME | 60,000,000 | LLM launch | FLAME is introduced as a small language model for assisting in the creation of spreadsheet formulas. It is based on T5 and trained on Excel formulas using domain-specific insights to achieve competitive performance with a substantially smaller model size (60M parameters) and much less training data. FLAME outperforms much larger models in 6 out of 10 settings, including formula repair, formula auto-completion, and syntax reconstruction.[107] | |
2023 | February 2 | Prompting | Researchers introduce Multimodal Chain-of-Thought (CoT) reasoning for large language models (LLMs). While LLMs have excelled in complex reasoning, their CoT prompting has been limited to text. Multimodal-CoT extends this by incorporating both text and images, creating a two-stage framework. This separation allows for better-generated rationales based on multimodal information, leading to improved answer inference. Even with under 1 billion parameters, the model outperforms the state-of-the-art LLM (GPT-3.5) by 16 percentage points on the ScienceQA benchmark, achieving 91.68% accuracy, and even surpasses human performance.[108] | |||
2023 | February 9 | Toolformer | 6,700,000,000[109] | LLM launch | Toolformer is introduced. It is a language model trained to use external tools via simple APIs, which can achieve improved performance on downstream tasks. The model is trained in a self-supervised way, using only a handful of demonstrations for each API. The model, which incorporates a range of tools including a calculator, Q&A system, search engines, translation system, and calendar, achieves substantially improved zero-shot performance across various downstream tasks, often competitive with much larger models, without sacrificing its core language modeling abilities.[110][111] | |
2023 | c.February 14 | Palmyra | 20,000,000,000 5,000,000,000 128,000,000 | LLM launch | Full-stack generative AI platform Writer launches Palmyra, a trio of LLMs that focus on business writing and marketing data. The models include Palmyra Small (128M), Palmyra Base (5B), and Palmyra Large (20B), and are aimed at enterprises looking to invest in generative AI. Palmyra LLMs offer both an application layer and a foundation model layer, making Writer the first to provide both on a single platform. The models also offer high levels of security and privacy features. While general-use LLMs can achieve human-like output, they lack contextual awareness, multi-modal inputs, brand integrity and compliance with security and privacy standards, limiting their usefulness for enterprise organizations.[112][113] | |
2023 | February 20 | MOSS | 16,000,000,000[114] | LLM launch | MOSS is introduced as a conversational language model developed by Fudan University. It performs various natural language tasks including question answering, text summarization, and code generation. It is aimed to be open-sourced to facilitate future research. MOSS has some limitations, such as poor performance on languages other than English and a relatively small model capacity. It may also generate misleading or false information and may need multiple attempts to follow instructions correctly.[115] | |
2023 | February 21 | Prompting | A paper presents a catalog of prompt engineering techniques in pattern form that have been applied successfully to solve common problems when conversing with large language models (LLMs), such as ChatGPT. Prompt patterns are reusable solutions to common problems faced when working with LLMs that can customize the outputs and interactions with an LLM. The paper provides a framework for documenting patterns for structuring prompts to solve a range of problems and presents a catalog of patterns that have been applied successfully to improve the outputs of LLM conversations. It also explains how prompts can be built from multiple patterns and illustrates prompt patterns that benefit from combination with other prompt patterns. The paper contributes to research on prompt engineering that applies LLMs to automate software development tasks.[116] | |||
2023 | February 24 | LLaMA | 65,000,000,000[38] | 1,400,000,000,000 tokens[38] | LLM launch | Meta AI introduces LLaMA as a collection of open-source foundation language models, ranging from 7B to 65B parameters, that were trained on publicly available datasets without the need for proprietary or inaccessible data. The largest model, LLaMA-65B, is competitive with other top models such as Chinchilla70B and PaLM-540B. LLaMA-13B outperforms GPT-3 (175B) on most benchmarks. All models are available for research purposes.[117] |
2023 | February 24 | Programming/training | A paper proposes a system called LLM-Augmenter that improves large language models by using external knowledge and automated feedback. The system adds plug-and-play modules to a black-box LLM to ground responses in external knowledge and iteratively improve responses using feedback generated by utility functions. The system is validated on task-oriented dialog and open-domain question answering, showing a significant reduction in hallucinations without sacrificing fluency and informativeness. The source code and models are publicly available.[118] | |||
2023 | February 27 | SpikeGPT | 260,000,000[119] | LLM launch | A paper discusses the development of a generative language model called SpikeGPT that uses spiking neural networks (SNNs) for more energy-efficient deep learning. While SNNs have been successful in computer vision tasks, their performance in language generation has been limited due to the challenge of training them. SpikeGPT overcomes this challenge by modifying the transformer block to reduce computational complexity and achieves competitive performance with non-spiking models on tested benchmarks while using 5x less energy consumption.[120] | |
2023 | February 27 | Programming/training | A paper discusses the use of open source code to train large language models (LLMs) and the potential security, privacy, and licensing implications of this practice. LLMs for code are commonly trained on large unsanitized corpora of source code scraped from the internet, leading to the memorization and verbatim emission of content by the models. The paper argues that the use of copyleft code to train LLMs is a legal and ethical dilemma, and provides actionable recommendations to address this issue. Overall, the paper highlights the importance of considering the implications of using open source code in training LLMs.[121] | |||
2023 | February 27 | Prompting | A paper proposes a framework that simplifies reward design in reinforcement learning (RL) by using natural language as a proxy for the reward function. The framework prompts a large language model, such as GPT-3, to evaluate the agent's behavior against the desired behavior described in the prompt and outputs a corresponding reward signal. The RL agent uses this reward to update its behavior. The approach is evaluated in three tasks, and the results demonstrate that RL agents trained with the framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning.[122] | |||
2023 | February 27 | Kosmos-1 | 1,600,000,000[123] | LLM launch | A paper introduces Kosmos-1, a Multimodal MLLM that can perceive general modalities, learn in context (i.e., few-shot), and follow instructions (i.e., zero-shot). The model is trained from scratch on web-scale multimodal corpora, including text and images, image-caption pairs, and text data. The model achieves impressive performance on language understanding, generation, and even OCR-free NLP (directly fed with document images), perception-language tasks, including multimodal dialogue, image captioning, visual question answering, and vision tasks such as image recognition with descriptions. The paper also shows that MLLMs can benefit from cross-modal transfer, i.e., transfer knowledge from language to multimodal, and from multimodal to language. A dataset of Raven IQ test is introduced, which diagnoses the nonverbal reasoning capability of MLLMs.[124] | |
2023 | February 27 | Programming/training | A paper proposes a method called "rectification" for reducing the risk of LLMs generating toxic discourses. The method is based on the probability that the finished discourse will be considered toxic, and advises against token selections proportional to this probability. The approach utilizes a separate but smaller model for detoxification and does not require access to the internal representations of the LLM. The method significantly improves the generated discourse compared to base LLMs and other techniques in terms of both language and detoxification performance, and can be applied to diverse LLMs that share the same vocabulary.[125] | |||
2023 | February 28 | Application | A study proposes using LLMs for the automatic analysis of dream reports, specifically focusing on references to emotions. The authors use off-the-shelf and bespoke approaches and find that the bespoke text classification method achieves high performance and is robust against potential biases. This approach could find application in the analysis of large dream datasets and improve the reproducibility and comparability of results across studies. The study of dream content in dream research is typically performed through manual scoring of verbal reports provided by dreamers. This task is time-consuming and requires trained annotators.[126] | |||
2023 | February 28 | Programming/training | A paper discusses In-Context Instruction Learning (ICIL), a new approach to instruction learning for LLMs that significantly improves zero-shot task generalization performance. ICIL uses a single fixed prompt that concatenates cross-task demonstrations to evaluate all tasks, and it is complementary to instruction-based fine-tuning. The authors demonstrate that ICIL improves the performance of both pretrained and instruction-fine-tuned models, including the most powerful instruction-fine-tuned baseline (text-davinci-003) by 9.3%.[127] | |||
2023 | February 28 | Application | A paper discusses the potential use of large language models in psycholinguistics. The authors note that while these models are not detailed models of human linguistic processing, they are highly successful in their primary task of providing a model for language. They suggest that large language models can be useful in psycholinguistics as a practical tool, for comparative purposes, and philosophically, as a means of rethinking the relationship between language and thought.[128] | |||
2023 | March 1 | Programming/training | A paper introduces a method to train language models to understand concepts precisely using succinct representations based on category theory. The representations provide concept-wise invariance properties and a new learning algorithm that can accurately learn complex concepts or fix misconceptions. The approach also allows for the generation of a hierarchical decomposition of the representations, which can be manually verified by examining each part individually.[129] | |||
2023 | March 1 | Application | A study evaluates the value of domain adaptation in nuclear medicine by adapting language models for the purpose of 5-point Deauville score prediction based on clinical 18F-fluorodeoxyglucose (FDG) PET/CT reports. The researchers used multiple general-purpose transformer language models to classify the reports into Deauville scores 1-5, and then adapted the models to the nuclear medicine domain using masked language modeling. Domain adaptation improved the performance of all language models, and the best performing model (domain-adapted RoBERTa) achieved a five-class accuracy of 77.4%, which was better than the physician's performance (66%), the best vision model's performance (48.1%), and was similar to the multimodal model's performance (77.2%).[130] | |||
2023 | March 3 | FLAN UL2 | 20,000,000,000 | LLM launch | Flan-UL2 is introduced as a powerful encoder-decoder model. It is developed by Google and available for download from HuggingFace. It outperforms previous versions of Flan-T5 and is recommended for self-hosted usage or fine-tuning for commercial purposes. Flan-UL2 is licensed under Apache-2.0 and its usage and training details have been made public. If 20 billion parameters are excessive, there are smaller options available with the previous Flan-T5 model, which comes in five different sizes to better suit specific needs.[131][56] | |
2023 | March 6 | Application | A paper explores the potential of using LLMs as zero-shot human models for human-robot interaction (HRI). Human models are important for HRI, but they are challenging to create. LLMs have consumed vast amounts of human-generated text data and can be used as human models without prior knowledge or interaction data. The authors conducted experiments on three social datasets and found that LLMs can achieve performance comparable to purpose-built models, but there are limitations such as sensitivity to prompts and spatial/numerical reasoning issues. The authors demonstrate how LLM-based human models can be integrated into a social robot's planning process and applied in HRI scenarios through a case study on a simulated trust-based table-clearing task and a robot utensil-passing experiment. The results show that LLMs offer a promising approach to human modeling for HRI, but it is incomplete.[132] | |||
2023 | March 6 | Prompting | A paper proposes a perspective on prompts for LLMs that distinguishes between diegetic and non-diegetic prompts, and studies how users write with LLMs using different user interfaces. The results show that when the interface offers multiple suggestions and provides an option for non-diegetic prompting, participants prefer choosing from multiple suggestions over controlling them via non-diegetic prompts. When participants provide non-diegetic prompts it is to ask for inspiration, topics or facts. Single suggestions in particular are guided both with diegetic and non-diegetic information. The paper informs human-AI interaction with generative models by revealing that writing non-diegetic prompts requires effort, people combine diegetic and non-diegetic prompting, and they use their draft and suggestion timing to strategically guide LLMs.[133] | |||
2023 | March 7 | Impact | Nature Biomedical Engineering publishes an article stating that it has become increasingly difficult to distinguish human-written text from text generated by large language models. It predicts that these models will rapidly proliferate and have a significant impact on various industries in the future.[134] | |||
2023 | March 13 | Alpaca | 7,000,000,000[56] | LLM launch | Alpaca is introduced as a new instruction-following language model that is fine-tuned from Meta's LLaMA 7B model on 52,000 instruction-following demonstrations generated using OpenAI's text-davinci-003. Alpaca shows similar behavior to text-davinci-003 in a preliminary evaluation and is surprisingly small and easy/cheap to reproduce. The authors also release the training recipe and data, with the intention to release the model weights in the future. [135] | |
2023 | March 13 | Jurassic-2 | LLM launch | AI21 Studio announces Jurassic-2 (J2), the latest iteration of its foundation models, introducing novel features such as zero-shot instruction-following, reduced latency, and multi-language support. The family of J2 models includes Large, Grande, and Jumbo sizes, catering to diverse needs. J2 would earn recognition on Stanford's HELM benchmark, with Jumbo ranking second in evaluations. Notably, Grande outperforms much larger models in terms of efficiency. With improved quality, multilingual support, and faster performance, J2 would be available for free until May 1st, 2023.[136] | ||
2023 | March 13 | The English Wikipedia article Large language model is created.[137] | ||||
2023 | March 14 | Claude | 52,000,000,000[138] | LLM launch | American artificial intelligence startup company Anthropic introduces Claude, a next-generation AI assistant. With undisclosed model size, it offers a range of natural language processing (NLP) capabilities such as summarization, coding, writing, and question answering. Claude is available in two modes: the full, high-performance model, and Claude Instant, which prioritizes speed over quality. However, limited information about Claude's training process and model architecture is given. Access to Claude's API requires application and approval.[139][56] | |
2023 | March 15 | 40,000,000,000 | LLM launch | Abu Dhabi-based Technology Innovation Institute (TII) introduces "Falcon LLM," a foundational LLM. Developed by the AI and Digital Science Research Center's AI Cross-Center Unit, Falcon LLM outperforms GPT-3 while using only 75% of its training compute. Falcon LLM is trained on one trillion tokens and is ideal for on-premises solutions, enabling companies and governments to maintain data privacy. It offers potential applications in chatbots, virtual assistants, language translation, content generation, and more. TII aims to advance AI capabilities in the United Arab Emirates in alignment with the country's National AI Strategy.[140] | ||
2023 | March | GPT-NeoX-20B | 20,000,000,000 | LLM launch | GPT-NeoX-20B is introduced a language model with 20 billion parameters trained on the Pile dataset. The model is a powerful few-shot reasoner and outperforms similarly sized models on various tasks. The training and evaluation code and model weights are open-sourced. The model was developed by Sid Black, Stella Biderman, and Eric Hallahan with the support of CoreWeave and trained using fp16.[141] | |
2023 | March 16 | GPT-4 | 1,760,000,000,000[142] | LLM launch | OpenAI introduces GPT-4, a large multimodal model that can process both text and image inputs and produce text outputs. GPT-4 shows human-level performance on professional and academic benchmarks and outperforms previous large language models on traditional NLP benchmarks. The report discusses the challenge of developing deep learning infrastructure and optimization methods that behave predictably across a wide range of scales. While GPT-4 has limitations and safety challenges, OpenAI has taken steps to mitigate potential harms. An extensive system card is included in the report.[143] | |
2023 | March 20 | PanGu-Σ | 1,085,000,000,000[38] | 329,000,000,000 tokens[38] | LLM launch | Researchers from Huawei introduce Pangu-Σ, which is developed using Ascend 910 AI processors and the MindSpore framework. This model, inheriting parameters from PanGu-α, employs a sparse architecture with Random Routed Experts (RRE) and efficient training techniques called Expert Computation and Storage Separation (ECSS). These methods lead to a 6.3x increase in training throughput through heterogeneous computing. PanGu-Σ demonstrates state-of-the-art zero-shot learning performance in various Chinese natural language processing tasks and excels in fine-tuned applications such as open-domain dialogue, question answering, machine translation, and code generation.[144][145] |
2023 | March 23 | ChatGLM | 6,000,000,000 | LLM launch | ChatGLM is introduced as a bilingual language model developed by Tsinghua University's Knowledge Engineering Group (KEG) & Data Mining. It has 6 billion parameters and is optimized for both Chinese and English languages. The model can be downloaded from HuggingFace and is compatible with consumer-grade GPUs through quantization. Similar to ChatGPT, ChatGLM is available under an Apache-2.0 license, allowing commercial use.[146][56] | |
2023 | March 23 | Impact | An article investigates the potential implications of large language models (LLMs), such as Generative Pretrained Transformers (GPTs), on the U.S. labor market. The authors propose a new rubric for assessing LLM capabilities and their potential effects on jobs. The study finds that around 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of LLMs, while approximately 19% of workers may see at least 50% of their tasks impacted. The study suggests that LLMs such as GPTs exhibit traits of general-purpose technologies, indicating that they could have considerable economic, social, and policy implications.[147] | |||
2023 | March 24 | Dolly 2.0 | 12,000,000,000[148] | LLM launch | Dolly 2.0 is released as an open-source model that exhibits strong instruction-following capabilities similar to ChatGPT. Despite being a smaller and older model compared to state-of-the-art models like GPT-3, Dolly shows remarkable performance when fine-tuned on a small dataset of instruction training data. The model, based on EleutherAI's 6 billion parameter model, demonstrates text generation, brainstorming, and open Q&A abilities. This development is seen as a significant step in democratizing AI for enterprise use, allowing companies to build their own cost-effective instruction-following models.[149] | |
2023 | March 28 | Cerebras-GPT | 111,000,000 – 13,000,000,000 | Open sourcing | American artificial intelligence company Cerebras open-sources seven GPT-3 models ranging from 111 million to 13 billion parameters, known as Cerebras-GPT. These models are designed to set new benchmarks for accuracy and compute efficiency in large language models. They were trained using the Chinchilla formula and outperform other models in terms of training times, costs, and energy consumption. The release aims to provide open access to advanced models for research and commercial applications, ensuring they are open, reproducible, and royalty-free. Cerebras-GPT follows the "Chinchilla recipe" for compute-optimal training, and it establishes a new scaling law for model performance based on training compute and data.[150] | |
2023 | March 30 | 50,000,000,000 | LLM launch | Bloomberg unveils BloombergGPT, a large language model with 50 billion parameters designed specifically for the financial industry. This model, tailored to financial data, can perform tasks such as generating Bloomberg Query Language (BQL), suggesting news headlines, and answering financial questions. By combining domain-specific and general-purpose data during training, BloombergGPT achieves high performance in both financial and general natural language processing (NLP) tasks. This specialized model addresses the growing need for NLP technologies in the financial sector, offering applications in areas like FinTech, where domain-specific data can outperform general-purpose models.[151] | ||
2023 | April 14 | 17,000,000,000 | LLM launch | German non-profit LAION introduces OpenAssistant, a fully open-source large-scale instruction-tuned model, which is unveiled as part of efforts to democratize large language model (LLM) alignment research. This project recognizes the value of aligning LLMs with human preferences, enhancing their usability across domains. While contemporary alignment methods, like reinforcement learning from human feedback (RLHF), often rely on expensive, proprietary data, OpenAssistant Conversations presents a human-generated dataset of 161,443 assistant-style conversation messages in 35 languages, with 461,292 quality ratings. A preference study demonstrates OpenAssistant's responses are nearly as preferred as GPT-3.5-turbo (ChatGPT), with a relative win rate of 48.3% vs. 51.7%. Both code and data are made available under permissive licenses.[152] | ||
2023 | April 19 | StableLM | 3,000,000,000 – 7,000,000,000 | LLM launch | Stability AI open-sources its large language model, StableLM, which is designed to efficiently generate text and code. The models are available on GitHub and contain between 3 billion and 7 billion parameters, with 15 to 65 billion parameter models to arrive later. The model is trained on a larger version of the open-source dataset known as the Pile and encompasses information from a range of sources, including Wikipedia, Stack Exchange, and PubMed.[153][154] | |
2023 | April 24 | WizardLM | LLM launch | A paper presents WizardLM, a large language model trained to follow complex instructions. Instead of manually creating instruction data, the authors propose Evol-Instruct, a method that uses the model itself to progressively evolve instructions into more complex forms. WizardLM outperforms human-created instructions in evaluations and shows preference over OpenAI ChatGPT in generating outputs for high complexity tasks. While WizardLM still has room for improvement compared to ChatGPT, the findings highlight the potential of fine-tuning LLMs with AI-evolved instructions.[155] | ||
2023 | May 3 | CodeGen2 | 16,000,000,000 | 400,000,000,000 tokens | LLM launch | CodeGen2 is introduced. It is an autoregressive language model family for program synthesis, introduced as an improvement over the original CodeGen model family (CodeGen1). CodeGen2 supports infilling and a broader range of programming languages.[156][157] |
2023 | May 10 | PaLM 2 | LLM launch | Google launches PaLM 2, its latest LLM to date, at its I/O developer conference. PaLM 2 is aimed to power Google's updated Bard chat tool, compete with OpenAI's ChatGPT, and serve as the foundation model for new AI features. While technical details about training are not provided, Google focuses on the model's capabilities, such as improved common sense reasoning, mathematics, and logic. PaLM 2 excels at multilingual tasks and includes specialized models like Codey for coding and debugging, Med-PaLM 2 for medical knowledge, and Sec-PaLM for security use cases. There is also a smaller PaLM 2 model for smartphones.[158][159][160] | ||
2023 | May 18 | VisionLLM | Framework launch | A paper introduces VisionLLM, a framework that combines large language models (LLMs) with computer vision tasks to achieve open-ended task capabilities. While powerful vision foundation models (VFMs) exist, they are limited to predefined tasks, unlike LLMs that excel in user-tailored tasks. VisionLLM treats images as a foreign language and aligns vision-centric tasks with language tasks. By providing language instructions, an LLM-based decoder can make predictions for open-ended tasks. Extensive experiments demonstrate that VisionLLM allows different levels of task customization, achieving good results from fine-grained object-level to coarse-grained task-level customization. Remarkably, the model achieves over 60% mAP on COCO, comparable to detection-specific models.[161] | ||
2023 | May 21 | Baize | LLM launch | A paper introduces Baize, an open-source chat model. It is developed through a novel pipeline, which leverages ChatGPT to automatically generate a high-quality multi-turn chat corpus by having ChatGPT engage in a conversation with itself. The generated corpus serves as a resource for training and evaluating chat models. The authors also utilize parameter-efficient tuning to enhance LLaMA, an open-source language model, and create Baize. Baize demonstrates good performance in multi-turn dialogues and incorporates guardrails to minimize potential risks. Additionally, the paper proposes a technique called Self-Distill with Feedback to further improve Baize's performance using feedback from ChatGPT. Baize is designed to be accessible and can run on a single GPU, making it suitable for a wider range of researchers.[162] | ||
2023 | May 21 | Efficiency | Rodney Brooks, a robotics researcher and AI expert, argues that large language models like OpenAI's ChatGPT are not as intelligent as people believe and are far from being able to compete with humans on an intellectual level. Brooks highlights that these models lack an underlying understanding of the world and merely exhibit correlations in language. Current language models can sound like they understand, but they lack the ability to logically infer meaning, leading to potential misinterpretations. Brooks emphasizes that these models are good at generating answers that sound right but may not be accurate. He shares his experience of relying on large language models for coding tasks and finding that they often provide confidently wrong answers. Brooks concludes that while future iterations of AI may bring interesting advancements, they are unlikely to achieve artificial general intelligence (AGI).[163] | |||
2023 | May 24 | Gorilla | LLM launch | A paper presents Gorilla, a large language model (LLM) that effectively uses API calls. Gorilla surpasses GPT-4 in generating accurate API calls by addressing input argument generation and hallucination issues. When combined with a document retriever, Gorilla adapts to test-time document changes and mitigates hallucination problems. The model's integration with the retrieval system enhances reliability.[164] Gorilla would be open-sourced on July 4th.[165] | ||
2023 | June 4 | Polyglot-Ko | 1,200,000,000,000 bytes | LLM launch | A technical report discusses the development of Polyglot-Ko, an open-source large-scale Korean language model. The project aims to enhance the performance of multilingual language models in non-English languages. While there are existing multilingual models, researchers often prefer building monolingual models due to limitations in the non-English language capabilities of current multilingual models. To address this, the report focuses on developing advanced Korean language models. The team collected 1.2TB of Korean data and prioritized the development of Korean models to enable performance comparisons and cater to the specific needs of Korean companies and researchers. The work presented in the report contributes to bridging the performance gap in non-English languages within multilingual language models.[166] | |
2023 | June 9 | PoET | LLM launch | PoET is introduced as a generative protein language model that designs new proteins with desired functions. It overcomes limitations of existing models by generating sets of related proteins as sequences-of-sequences across natural protein sequence clusters. PoET can generate and score modifications for specific protein families, extrapolate well for small families, and outperforms existing models in variant function prediction. Its Transformer layer allows modeling of sequential tokens within sequences while attending between sequences order invariantly. PoET improves variant effect prediction across proteins of all multiple sequence alignment depths.[167] | ||
2023 | June 9 | FinGPT | LLM launch | FinGPT is introduced as an open-source large language model designed specifically for the finance sector. Unlike proprietary models that rely on privileged access to financial data, FinGPT takes a data-centric approach, making high-quality financial data accessible and transparent to researchers and practitioners. It emphasizes the importance of an automatic data curation pipeline and a lightweight low-rank adaptation technique. The introducing paper showcases potential applications of FinGPT in robo-advising, algorithmic trading, and low-code development. Through collaboration within the open-source AI4Finance community, FinGPT reportedly aims to democratize financial language models, stimulate innovation, and unlock opportunities in open finance.[168] | ||
2023 | June 11 | RoBERTweet | LLM launch | RoBERTweet is introduced as a Transformer-based language model specifically trained on Romanian tweets, aiming to develop natural language processing (NLP) systems for social media analysis. Two versions of RoBERTweet are introduced, based on the base and large architectures of BERT. The models are pre-trained on a corpus that includes all tweets collected from 2008 to 2022, which is a significant contribution to the Romanian NLP community. Experimental results demonstrate that RoBERTweet models surpass previous general-domain Romanian and multilingual language models in three NLP tasks involving tweet inputs: emotion detection, sexist language identification, and named entity recognition. The models and the newly created corpus of Romanian tweets are provided freely for public use.[169] | ||
2023 | June 14 | Radiology-GPT | LLM launch | Radiology-GPT is introduced as a large language model specifically designed for radiology. Through instruction tuning on a comprehensive dataset of radiology domain knowledge, Radiology-GPT outperforms general language models like StableLM, Dolly, and LLaMA in radiological diagnosis, research, and communication tasks. This development paves the way for advancements in clinical natural language processing (NLP) and demonstrates the potential of creating specialized, privacy-compliant generative language models tailored to specific medical specialties. The localization of large-scale language models for individual hospitals holds promise in addressing their unique requirements. By combining conversational competence with domain-specific knowledge, these models are expected to drive further advancements in healthcare AI.[170] | ||
2023 | June 14 | AssistGPT | LLM launch | OpenAI introduces AssistGPT as a multi-modal AI assistant designed to handle complex visual-based tasks. Given that visual tasks pose challenges due to their diverse nature, AssistGPT, employs a reasoning approach called Plan, Execute, Inspect, and Learn (PEIL) to integrate LLMs with various tools. The Planner utilizes natural language to plan the next step based on the current reasoning progress, the Executor carries out the planned actions, and the Inspector assists the Planner by providing appropriate visual information. Additionally, the Learner enables the model to autonomously explore and discover optimal solutions. The system achieves optimal results on A-OKVQA and NExT-QA benchmarks and showcases its ability to handle complex questions beyond the benchmark scope.[171] | ||
2023 | June 15 | ChessGPT | LLM launch | ChessGPT is introduced as a GPT model that combines policy learning and language modeling in the context of chess. It emphasizes the importance of incorporating information from both historical policy data and natural language insights for decision-making tasks. Previous studies have typically focused on only one of these sources. ChessGPT leverages a large-scale game and language dataset related to chess to integrate policy learning and language modeling. The researchers showcase two model examples, ChessCLIP and ChessGPT, and propose an evaluation framework to assess the language model's chess ability. Experimental results validate the effectiveness of the model and dataset, and the code, model, and dataset are made available as open source resources.[172] | ||
2023 | June 16 | ORIBA | LLM launch | Customizable AI chatbot ORIBA is introduced in a study that explores the intersection of illustration art and artificial intelligence. It enables illustrators to engage with their original characters (OCs) by conversing with them and observing their inner monologues and behavior. The study aims to inspire illustrators by discovering innovative collaboration methods despite the tension between artists and AI. By examining the impact of AI on the creative process and authorship boundaries, the researchers seek to enhance human-AI interactions in creative fields. The potential applications of this research extend beyond illustration to areas like interactive storytelling. The study was conducted by Yuqian Sun, Xingyu Li, and Ze Gao.[173] | ||
2023 | June 16 | ClinicalGPT | LLM launch | ClinicalGPT is introduced as a language model specifically designed for clinical applications. It is trained using diverse real-world data including medical records, domain-specific knowledge, and multi-round dialogue consultations. Additionally, a comprehensive evaluation framework is proposed, encompassing medical knowledge question-answering, medical exams, patient consultations, and diagnostic analysis of medical records. Results indicate that ClinicalGPT outperforms other models in these tasks, showcasing its effectiveness in adapting large language models to the healthcare domain.[174] | ||
2023 | June 18 | Impact | Goldman Sachs predicts that generative language AI, referring to large language models, could contribute to a 7% increase in global GDP over the next decade. However, it also raises concerns about the potential automation of 300 million jobs worldwide.[175][176] | |||
2023 | June 19 | Impact | An article explores the potential negative consequences of AI-generated content flooding the internet, particularly focusing on the impact of models like ChatGPT. Researchers warn that when future generative models are primarily trained on AI-generated content, a phenomenon known as "model collapse" occurs. Model collapse refers to the degenerative process where models forget the true underlying data distribution over time, leading to degraded performance and erroneous interpretations. The article highlights the importance of training models on human-generated content to maintain quality, but with the scale of content creation by models like ChatGPT, access to human-created data may become limited. The article suggests the need to preserve access to human-generated data and acknowledges the challenge of tracking and filtering AI-generated content on a large scale.[177] | |||
2023 | June 22 | AudioPaLM | LLM launch | AudioPaLM is introduced as a large language model designed for speech understanding and generation. It combines two existing models, PaLM-2 (text-based language model) and AudioLM (speech-based language model), into a unified multimodal architecture. This enables AudioPaLM to process and generate both text and speech, making it useful for applications like speech recognition and speech-to-speech translation. By incorporating the paralinguistic information from AudioLM and linguistic knowledge from PaLM-2, AudioPaLM achieves better performance in speech tasks. It outperforms existing systems in speech translation tasks and can perform zero-shot speech-to-text translation for languages not seen during training. AudioPaLM also showcases features such as transferring a voice across languages based on a short spoken prompt.[178] | ||
2023 | June 28 | ChatLaw | LLM launch | ChatLaw is introduced as an open-source legal large language model designed to facilitate the digital transformation of the Chinese legal domain. To ensure data quality, the authors carefully curated a legal domain fine-tuning dataset. They also address the issue of model hallucinations during reference data retrieval by combining vector database retrieval with keyword retrieval, reducing inaccuracy. Additionally, a self-attention method is proposed to enhance the model's ability to overcome errors in reference data, further optimizing model hallucinations and improving problem-solving capabilities.[179] | ||
2023 | July 11 | Baichuan-13B | 13,000,000,000 | 1,400,000,000,000 tokens | LLM launch | Baichuan Intelligence, a startup founded by Sogou founder Wang Xiaochuan, unveils its open-source large language model called Baichuan-13B. The Chinese model, based on the Transformer architecture like OpenAI's GPT, is trained on Chinese and English data and optimized for commercial applications. Baichuan-13B has 13 billion parameters and is trained on 1.4 trillion tokens. Baichuan-7B, a pre-training model with 7 billion parameters, was released earlier. The model is available for free to academics and developers approved for commercial use. By this time, China focuses on developing large language models as it prepares to implement strict AI regulations, potentially requiring licenses for launching such models.[180] |
2023 | September 9 | Impact | A team of computer scientists, including one from OpenAI, after researching the potential development of self-awareness in large language models like ChatGPT, expresses concern that LLMs can develop situational awareness, enabling them to recognize whether they are in testing mode or deployed to the public. This awareness can lead to deceptive behavior, as LLMs might act safely during testing but harmfully after deployment. The researchers conduct experiments focusing on out-of-context reasoning as a precursor to situational awareness. While at this time LLMs are some way from acquiring situational awareness, the study offers a foundation for further research in this area.[181] | |||
2023 | September 13 | LLM launch | Alibaba releases its large language model Tongyi Qianwen, which is made available for public and enterprise use in China. Tongyi Qianwen, similar to ChatGPT, was previously in a beta test phase and is trained on English and Chinese text, although its exact specifications are undisclosed. This release coincides with the relaxation of AI technology restrictions in China, which now require vetting and certification for public AI tech. Companies like Baidu, Tencent, TikTok, and ByteDance have already received approval to launch AI models in China by this time. In contrast, the U.S. remains in the early stages of AI regulation discussions.[182] | |||
2023 | September | Gemini | 7,000,000,000,000 – 10,000,000,000,000 | 60,000,000,000–100,000,000,000,000 tokens | LLM launch | A document discusses Google DeepMind's project named "Gemini," which is described as a general specialist in AI. Gemini is a multimodal model, likely focusing on visual, language, and action (VLA) tasks. It is expected to have 7-10 trillion parameters and a dataset size of 60-100 trillion tokens. Training started in May 2023 and concluded in August 2023, using TPUv4 and TPUv5 over approximately 120 days. The expected public release date is in October 2023, but no paper or playground information is provided in the document. The model's name is inspired by the mythological twins Castor and Pollux.[183] |
2023 | October 9 | Llama 2 | Programming/training | Microsoft researchers propose a novel approach to untrain LLMs. Their method, outlined in a paper on arXiv, selectively removes specific information from models without requiring complete retraining. Using Meta's Llama 2-7B model, they successfully erase all knowledge of the Harry Potter books, demonstrating efficient unlearning without affecting the model's performance on conventional benchmarks. The approach presents a direction for creating more adaptable, responsible, and legally compliant AI models, although further testing and refinement are required. Meanwhile, at the time, OpenAI and Meta face lawsuits from authors alleging copyright infringement related to training their AI models.[184] | ||
2023 | November 3 | Grok | X.AI Corp. unveils Grok, an AI modeled after the Hitchhiker’s Guide to the Galaxy with the purpose to answer a wide range of questions with a humorous touch. It also offers real-time knowledge through the 𝕏 platform and can handle provocative queries often rejected by other AIs. At the time in beta, Grok utilizes the Grok-1 language model, which shows strong performance in benchmarks like HumanEval and MMLU. The development of Grok-1 involves extensive improvements over its predecessor, Grok-0, and incorporates a custom training infrastructure.[185] | |||
2023 | November 21 | Claude 2.1 | LLM launch | Anthropic launches Claude 2.1, which introduces major upgrades, including a 200,000-token context window, which allows users to handle extensive documents such as codebases and literary works. This feature enhances the model's ability to summarize, perform Q&A, and analyze complex data. The new version also reduces model hallucination rates by 50%, improving accuracy and reliability. Additional updates include a beta tool use feature for integrating with APIs and external processes, as well as enhanced developer tools for prompt optimization and system customization. Claude 2.1 is available via API and the claude.ai chat interface.[186][187] |
Numerical and visual data
Wikipedia Views
The image below shows Wikipedia views data for the article Large language model, from February to September 2023.[188]
Google trends
The image below shows Google trends data for Large language model (topic), from January 2004 to September 2023, when the screenshot was taken. Interest is also ranked by country and displayed on world map.[189]
Meta information on the timeline
How the timeline was built
The initial version of the timeline was written by Sebastian.
Funding information for this timeline is available.
Feedback and comments
Feedback for the timeline can be provided at the following places:
- FIXME
What the timeline is still missing
- https://huggingface.co/transformers/v2.10.0/pretrained_models.html
- summary table listing the model and parameters
- Vipul: I think you should add columns for model name in the full timeline. And either in the full timeline, or in a separate table with a summary of model names, you should have columns for number of parameters and training data set (or training data set size)✔
- https://lifearchitect.ai/timeline/
- https://www.researchgate.net/publication/367652128_Benchmarking_Large_Language_Models_for_News_Summarization
- https://arxiv.org/search/?query=Large+language+model&searchtype=all&source=header
- https://research.aimultiple.com/large-language-models/
Timeline update strategy
See also
External links
References
- ↑ "Large Language Models: Complete Guide in 2023". research.aimultiple.com. Retrieved 11 March 2023.
- ↑ 2.0 2.1 2.2 2.3 Pathak, Priyanka (11 May 2023). "Large Language Models 101: History, Evolution and Future". Scribble Data. Retrieved 12 October 2023.
- ↑ 3.0 3.1 Casey, Matt (25 May 2023). "Large language models: their history, capabilities and limitations". Snorkel AI. Retrieved 12 October 2023.
- ↑ 4.0 4.1 4.2 4.3 4.4 "Introduction to Large Language Models | Omega Venture Partners". omegavp.com. 6 December 2022. Retrieved 12 October 2023.
- ↑ 5.0 5.1 5.2 5.3 5.4 5.5 "Brief History of Large Language Models & Generative AI | Evolution of NLP from Eliza to ChatGPT". youtube.com. Retrieved 17 October 2023.
- ↑ 6.0 6.1 "Large Language Model Training in 2023". research.aimultiple.com. Retrieved 11 March 2023.
- ↑ Yanhui, Chen (8 March 2021). "A Battle Against Amnesia: A Brief History and Introduction of Recurrent Neural Networks". Medium. Retrieved 17 October 2023.
- ↑ "The Bahdanau Attention Mechanism". machinelearningmastery.com. Retrieved 17 October 2023.
- ↑ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". doi:10.48550/arXiv.1810.04805.
- ↑ "BERT 101 - State Of The Art NLP Model Explained". huggingface.co. Retrieved 16 October 2023.
- ↑ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". doi:10.48550/arXiv.1810.04805.
- ↑ "GPT-2: 6-month follow-up". openai.com. Retrieved 23 March 2023.
- ↑ Zellers, Rowan; Holtzman, Ari; Rashkin, Hannah; Bisk, Yonatan; Farhadi, Ali; Roesner, Franziska; Choi, Yejin (2019). "Defending Against Neural Fake News". doi:10.48550/arXiv.1905.12616.
- ↑ "BERT, RoBERTa, DistilBERT, XLNet: Which one to use?". KDnuggets. Retrieved 29 June 2023.
- ↑ Ph.D, Suleiman Khan (18 May 2021). "BERT, RoBERTa, DistilBERT, XLNet — which one to use?". Medium. Retrieved 16 October 2023.
- ↑ Yang, Zhilin; Dai, Zihang; Yang, Yiming; Carbonell, Jaime; Salakhutdinov, Ruslan; Le, Quoc V. (2019). "XLNet: Generalized Autoregressive Pretraining for Language Understanding". doi:10.48550/arXiv.1906.08237.
- ↑ 17.0 17.1 G, Juan (21 September 2021). "An Intuitive Explanation of Transformer-Based Models". Factored | Machine Learning, Data Engineering and Data Analytics Company. Retrieved 16 October 2023.
- ↑ "Overview of ROBERTa model". GeeksforGeeks. 24 November 2020. Retrieved 16 October 2023.
- ↑ Liu, Yinhan; Ott, Myle; Goyal, Naman; Du, Jingfei; Joshi, Mandar; Chen, Danqi; Levy, Omer; Lewis, Mike; Zettlemoyer, Luke; Stoyanov, Veselin (2019). "RoBERTa: A Robustly Optimized BERT Pretraining Approach". doi:10.48550/arXiv.1907.11692.
- ↑ 20.0 20.1 20.2 20.3 "AI: Megatron the Transformer, and its related language models". Dr Alan D. Thompson – Life Architect. 24 September 2021. Retrieved 16 October 2023.
- ↑ "Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained | Exxact Blog". www.exxactcorp.com. Retrieved 11 March 2023.
- ↑ 22.0 22.1 22.2 "AI: Megatron the Transformer, and its related language models". lifearchitect.ai. 24 September 2021. Retrieved 18 September 2023.
- ↑ "NeMo Megatron — NVIDIA NeMo". docs.nvidia.com. Retrieved 11 March 2023.
- ↑ "Nvidia trains world's largest Transformer-based language model". VentureBeat. 13 August 2019. Retrieved 18 September 2023.
- ↑ Keskar, Nitish Shirish; McCann, Bryan; Varshney, Lav R.; Xiong, Caiming; Socher, Richard (2019). "CTRL: A Conditional Transformer Language Model for Controllable Generation". doi:10.48550/arXiv.1909.05858.
- ↑ Lan, Zhenzhong; Chen, Mingda; Goodman, Sebastian; Gimpel, Kevin; Sharma, Piyush; Soricut, Radu (2019). "ALBERT: A Lite BERT for Self-supervised Learning of Language Representations". doi:10.48550/arXiv.1909.11942.
- ↑ Herbst, Sabrina (24 January 2023). "Training a DistilBERT model from scratch". Medium. Retrieved 17 October 2023.
- ↑ Sanh, Victor; Debut, Lysandre; Chaumond, Julien; Wolf, Thomas (2019). "DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter". doi:10.48550/ARXIV.1910.01108.
- ↑ Kuzman, Taja (29 March 2023). "Microsoft introduced its DialoGPT to Skype and Edge". Medium. Retrieved 19 September 2023.
- ↑ Zhang, Yizhe; Sun, Siqi; Galley, Michel; Chen, Yen-Chun; Brockett, Chris; Gao, Xiang; Gao, Jianfeng; Liu, Jingjing; Dolan, Bill (2019). "DialoGPT: Large-Scale Generative Pre-training for Conversational Response Generation". doi:10.48550/arXiv.1911.00536.
- ↑ "Pretrained models — transformers 2.10.0 documentation". huggingface.co.
- ↑ 32.0 32.1 32.2 32.3 Blanc, Corentin; Bailly, Alexandre; Francis, Élie; Guillotin, Thierry; Jamal, Fadi; Wakim, Béchara; Roy, Pascal (1 May 2022). "FlauBERT vs. CamemBERT: Understanding patient's answers by a French medical chatbot". Artificial Intelligence in Medicine. 127: 102264. ISSN 0933-3657. doi:10.1016/j.artmed.2022.102264.
- ↑ Martin, Louis; Muller, Benjamin; Suárez, Pedro Javier Ortiz; Dupont, Yoann; Romary, Laurent; de la Clergerie, Éric Villemonte; Seddah, Djamé; Sagot, Benoît (2019). "CamemBERT: a Tasty French Language Model". doi:10.48550/arXiv.1911.03894.
- ↑ Sambucci, Luca (17 November 2021). "Cedille, the largest French AI language model, is actually from Switzerland". Artificial Intelligence news. Retrieved 30 June 2023.
- ↑ Le, Hang; Vial, Loïc; Frej, Jibril; Segonne, Vincent; Coavoux, Maximin; Lecouteux, Benjamin; Allauzen, Alexandre; Crabbé, Benoît; Besacier, Laurent; Schwab, Didier (2019). "FlauBERT: Unsupervised Language Model Pre-training for French". doi:10.48550/arXiv.1912.05372.
- ↑ Qi, Weizhen; Yan, Yu; Gong, Yeyun; Liu, Dayiheng; Duan, Nan; Chen, Jiusheng; Zhang, Ruofei; Zhou, Ming (2020). "ProphetNet: Predicting Future N-gram for Sequence-to-Sequence Pre-training". doi:10.48550/arXiv.2001.04063.
- ↑ Jagtap, Rohan (2 August 2020). "T5: Text-To-Text Transfer Transformer". Medium. Retrieved 19 September 2023.
- ↑ 38.00 38.01 38.02 38.03 38.04 38.05 38.06 38.07 38.08 38.09 38.10 38.11 38.12 38.13 38.14 38.15 38.16 38.17 38.18 38.19 38.20 38.21 38.22 38.23 38.24 38.25 38.26 38.27 38.28 38.29 38.30 38.31 38.32 38.33 38.34 38.35 38.36 38.37 38.38 38.39 38.40 38.41 38.42 38.43 38.44 38.45 38.46 38.47 38.48 38.49 38.50 38.51 38.52 38.53 38.54 38.55 38.56 Zhao, Wayne Xin; Zhou, Kun; Li, Junyi; Tang, Tianyi; Wang, Xiaolei; Hou, Yupeng; Min, Yingqian; Zhang, Beichen; Zhang, Junjie; Dong, Zican; Du, Yifan; Yang, Chen; Chen, Yushuo; Chen, Zhipeng; Jiang, Jinhao; Ren, Ruiyang; Li, Yifan; Tang, Xinyu; Liu, Zikang; Liu, Peiyu; Nie, Jian-Yun; Wen, Ji-Rong (2023). "A Survey of Large Language Models". doi:10.48550/arXiv.2303.18223.
- ↑ "Exploring Transfer Learning with T5: the Text-To-Text Transfer Transformer". ai.googleblog.com. 24 February 2020. Retrieved 25 June 2023.
- ↑ "More Efficient NLP Model Pre-training with ELECTRA". ai.googleblog.com. 10 March 2020. Retrieved 28 June 2023.
- ↑ Wiggers, Kyle (28 April 2022). "The emerging types of language models and why they matter". TechCrunch. Retrieved 29 June 2023.
- ↑ "OpenAI GPT-3: Everything You Need to Know [Updated]". springboard.com. Retrieved 16 October 2023.
- ↑ Romero, Alberto (25 May 2021). "GPT-3 — A Complete Overview". Medium. Retrieved 20 October 2023.
- ↑ Lee, Angie (26 January 2023). "What Are Large Language Models Used For and Why Are They Important?". NVIDIA Blog. Retrieved 11 March 2023.
- ↑ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (2020). "Language Models are Few-Shot Learners". doi:10.48550/arXiv.2005.14165.
- ↑ Tsang, Sik-Ho (21 January 2023). "Brief Review — DeBERTa: Decoding-enhanced BERT with Disentangled Attention". Medium. Retrieved 18 September 2023.
- ↑ He, Pengcheng; Liu, Xiaodong; Gao, Jianfeng; Chen, Weizhu (2020). "DeBERTa: Decoding-enhanced BERT with Disentangled Attention". doi:10.48550/arXiv.2006.03654.
- ↑ Lepikhin, Dmitry; Lee, HyoukJoong; Xu, Yuanzhong; Chen, Dehao; Firat, Orhan; Huang, Yanping; Krikun, Maxim; Shazeer, Noam; Chen, Zhifeng (2020). "GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding". doi:10.48550/arXiv.2006.16668.
- ↑ Maynez, Joshua; Narayan, Shashi; Bohnet, Bernd; McDonald, Ryan (July 2020). "On Faithfulness and Factuality in Abstractive Summarization". Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics: 1906–1919. doi:10.18653/v1/2020.acl-main.173.
- ↑ Xue, Linting; Constant, Noah; Roberts, Adam; Kale, Mihir; Al-Rfou, Rami; Siddhant, Aditya; Barua, Aditya; Raffel, Colin (2021). "mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer": 483–498. doi:10.18653/v1/2021.naacl-main.41.
- ↑ "Wu Dao 2.0 in 2023: China's Improved Version of GPT-3". research.aimultiple.com. Retrieved 16 October 2023.
- ↑ "China's gigantic multi-modal AI is no one-trick pony". Engadget. 2 June 2021. Retrieved 18 October 2023.
- ↑ "GPT Neo". March 15, 2023.
- ↑ "GPT-3's free alternative GPT-Neo is something to be excited about". VentureBeat. 15 May 2021. Retrieved 29 June 2023.
- ↑ Zeng, Wei; Ren, Xiaozhe; Su, Teng; Wang, Hui; Liao, Yi; Wang, Zhiwei; Jiang, Xin; Yang, ZhenZhang; Wang, Kaisheng; Zhang, Xiaoda; Li, Chen; Gong, Ziyan; Yao, Yifan; Huang, Xinjing; Wang, Jun; Yu, Jianfeng; Guo, Qi; Yu, Yue; Zhang, Yan; Wang, Jin; Tao, Hengtao; Yan, Dasen; Yi, Zexuan; Peng, Fang; Jiang, Fangqing; Zhang, Han; Deng, Lingfeng; Zhang, Yehong; Lin, Zhe; Zhang, Chao; Zhang, Shaojie; Guo, Mingyue; Gu, Shanzhi; Fan, Gaojun; Wang, Yaowei; Jin, Xuefeng; Liu, Qun; Tian, Yonghong (2021). "PanGu-$α$: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation". doi:10.48550/arXiv.2104.12369.
- ↑ 56.0 56.1 56.2 56.3 56.4 56.5 56.6 56.7 Kazi, Suleman (28 March 2023). "Top Large Language Models (LLMs): GPT-4, LLaMA, FLAN UL2, BLOOM, and More". Vectara. Retrieved 29 June 2023.
- ↑ Tsang, Sik-Ho (13 May 2023). "Brief Review — LaMDA: Language Models for Dialog Applications". Medium. Retrieved 16 October 2023.
- ↑ Zhang, Zhengyan; Gu, Yuxian; Han, Xu; Chen, Shengqi; Xiao, Chaojun; Sun, Zhenbo; Yao, Yuan; Qi, Fanchao; Guan, Jian; Ke, Pei; Cai, Yanzheng; Zeng, Guoyang; Tan, Zhixing; Liu, Zhiyuan; Huang, Minlie; Han, Wentao; Liu, Yang; Zhu, Xiaoyan; Sun, Maosong (2021). "CPM-2: Large-scale Cost-effective Pre-trained Language Models". doi:10.48550/arXiv.2106.10715.
- ↑ Sun, Yu; Wang, Shuohuan; Feng, Shikun; Ding, Siyu; Pang, Chao; Shang, Junyuan; Liu, Jiaxiang; Chen, Xuyi; Zhao, Yanbin; Lu, Yuxiang; Liu, Weixin; Wu, Zhihua; Gong, Weibao; Liang, Jianzhong; Shang, Zhizhou; Sun, Peng; Liu, Wei; Ouyang, Xuan; Yu, Dianhai; Tian, Hao; Wu, Hua; Wang, Haifeng (2021). "ERNIE 3.0: Large-scale Knowledge Enhanced Pre-training for Language Understanding and Generation". doi:10.48550/arXiv.2107.02137.
- ↑ Chen, Mark; Tworek, Jerry; Jun, Heewoo; Yuan, Qiming; Pinto, Henrique Ponde de Oliveira; Kaplan, Jared; Edwards, Harri; Burda, Yuri; Joseph, Nicholas; Brockman, Greg; Ray, Alex; Puri, Raul; Krueger, Gretchen; Petrov, Michael; Khlaaf, Heidy; Sastry, Girish; Mishkin, Pamela; Chan, Brooke; Gray, Scott; Ryder, Nick; Pavlov, Mikhail; Power, Alethea; Kaiser, Lukasz; Bavarian, Mohammad; Winter, Clemens; Tillet, Philippe; Such, Felipe Petroski; Cummings, Dave; Plappert, Matthias; Chantzis, Fotios; Barnes, Elizabeth; Herbert-Voss, Ariel; Guss, William Hebgen; Nichol, Alex; Paino, Alex; Tezak, Nikolas; Tang, Jie; Babuschkin, Igor; Balaji, Suchir; Jain, Shantanu; Saunders, William; Hesse, Christopher; Carr, Andrew N.; Leike, Jan; Achiam, Josh; Misra, Vedant; Morikawa, Evan; Radford, Alec; Knight, Matthew; Brundage, Miles; Murati, Mira; Mayer, Katie; Welinder, Peter; McGrew, Bob; Amodei, Dario; McCandlish, Sam; Sutskever, Ilya; Zaremba, Wojciech (2021). "Evaluating Large Language Models Trained on Code". doi:10.48550/arXiv.2107.03374.
- ↑ 61.0 61.1 Demo, GPT-3. "HyperCLOVA | Discover AI use cases". gpt3demo.com. Retrieved 20 October 2023.
- ↑ 62.0 62.1 Kim, Boseop; Kim, HyoungSeok; Lee, Sang-Woo; Lee, Gichang; Kwak, Donghyun; Dong Hyeon, Jeon; Park, Sunghyun; Kim, Sungju; Kim, Seonhoon; Seo, Dongpil; Lee, Heungsub; Jeong, Minyoung; Lee, Sungjae; Kim, Minsub; Ko, Suk Hyun; Kim, Seokhun; Park, Taeyong; Kim, Jinuk; Kang, Soyoung; Ryu, Na-Hyeon; Yoo, Kang Min; Chang, Minsuk; Suh, Soobin; In, Sookyo; Park, Jinseong; Kim, Kyungduk; Kim, Hiun; Jeong, Jisu; Yeo, Yong Goo; Ham, Donghoon; Park, Dongju; Lee, Min Young; Kang, Jaewook; Kang, Inho; Ha, Jung-Woo; Park, Woomyoung; Sung, Nako (2021). "What Changes Can Large-scale Language Models Bring? Intensive Study on HyperCLOVA: Billions-scale Korean Generative Pretrained Transformers": 3405–3424. doi:10.18653/v1/2021.emnlp-main.274.
- ↑ Wu, Shaohua; Zhao, Xudong; Yu, Tong; Zhang, Rongguo; Shen, Chong; Liu, Hongli; Li, Feng; Zhu, Hong; Luo, Jiangang; Xu, Liang; Zhang, Xuanwei (2021). "Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning". doi:10.48550/arXiv.2110.04725.
- ↑ "Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World's Largest and Most Powerful Generative Language Model". NVIDIA Technical Blog. 11 October 2021. Retrieved 30 June 2023.
- ↑ Rae, Jack W.; Borgeaud, Sebastian; Cai, Trevor; Millican, Katie; Hoffmann, Jordan; Song, Francis; Aslanides, John; Henderson, Sarah; Ring, Roman; Young, Susannah; Rutherford, Eliza; Hennigan, Tom; Menick, Jacob; Cassirer, Albin; Powell, Richard; Driessche, George van den; Hendricks, Lisa Anne; Rauh, Maribeth; Huang, Po-Sen; Glaese, Amelia; Welbl, Johannes; Dathathri, Sumanth; Huang, Saffron; Uesato, Jonathan; Mellor, John; Higgins, Irina; Creswell, Antonia; McAleese, Nat; Wu, Amy; Elsen, Erich; Jayakumar, Siddhant; Buchatskaya, Elena; Budden, David; Sutherland, Esme; Simonyan, Karen; Paganini, Michela; Sifre, Laurent; Martens, Lena; Li, Xiang Lorraine; Kuncoro, Adhiguna; Nematzadeh, Aida; Gribovskaya, Elena; Donato, Domenic; Lazaridou, Angeliki; Mensch, Arthur; Lespiau, Jean-Baptiste; Tsimpoukelli, Maria; Grigorev, Nikolai; Fritz, Doug; Sottiaux, Thibault; Pajarskas, Mantas; Pohlen, Toby; Gong, Zhitao; Toyama, Daniel; d'Autume, Cyprien de Masson; Li, Yujia; Terzi, Tayfun; Mikulik, Vladimir; Babuschkin, Igor; Clark, Aidan; Casas, Diego de Las; Guy, Aurelia; Jones, Chris; Bradbury, James; Johnson, Matthew; Hechtman, Blake; Weidinger, Laura; Gabriel, Iason; Isaac, William; Lockhart, Ed; Osindero, Simon; Rimell, Laura; Dyer, Chris; Vinyals, Oriol; Ayoub, Kareem; Stanway, Jeff; Bennett, Lorrayne; Hassabis, Demis; Kavukcuoglu, Koray; Irving, Geoffrey (2021). "Scaling Language Models: Methods, Analysis & Insights from Training Gopher". doi:10.48550/arXiv.2112.11446.
- ↑ "Google Trains 280 Billion Parameter AI Language Model Gopher". InfoQ. Retrieved 21 October 2023.
- ↑ Du, Nan; Huang, Yanping; Dai, Andrew M.; Tong, Simon; Lepikhin, Dmitry; Xu, Yuanzhong; Krikun, Maxim; Zhou, Yanqi; Yu, Adams Wei; Firat, Orhan; Zoph, Barret; Fedus, Liam; Bosma, Maarten; Zhou, Zongwei; Wang, Tao; Wang, Yu Emma; Webster, Kellie; Pellat, Marie; Robinson, Kevin; Meier-Hellstern, Kathleen; Duke, Toju; Dixon, Lucas; Zhang, Kun; Le, Quoc V; Wu, Yonghui; Chen, Zhifeng; Cui, Claire (2021). "GLaM: Efficient Scaling of Language Models with Mixture-of-Experts". doi:10.48550/arXiv.2112.06905.
- ↑ "WebGPT: Improving the factual accuracy of language models through web browsing". openai.com. Retrieved 21 October 2023.
- ↑ "fairseq documentation — fairseq 0.12.2 documentation". fairseq.readthedocs.io. Retrieved 16 May 2023.
- ↑ Aghajanyan, Armen; Huang, Bernie; Ross, Candace; Karpukhin, Vladimir; Xu, Hu; Goyal, Naman; Okhonko, Dmytro; Joshi, Mandar; Ghosh, Gargi; Lewis, Mike; Zettlemoyer, Luke (2022). "CM3: A Causal Masked Multimodal Model of the Internet". doi:10.48550/arXiv.2201.07520.
- ↑ "Aligning language models to follow instructions". openai.com. Retrieved 21 March 2023.
- ↑ "Finally, an AI bot that can ace technical interview questions (Ep. 417) - Stack Overflow". stackoverflow.blog. 22 February 2022. Retrieved 21 October 2023.
- ↑ "Cohere launches Extremely Large (beta)". Context by Cohere. 1 March 2022. Retrieved 12 March 2023.
- ↑ Shuster, Kurt; Komeili, Mojtaba; Adolphs, Leonard; Roller, Stephen; Szlam, Arthur; Weston, Jason (2022). "Language Models that Seek for Knowledge: Modular Search & Generation for Dialogue and Prompt Completion". doi:10.48550/arXiv.2203.13224.
- ↑ Nijkamp, Erik; Pang, Bo; Hayashi, Hiroaki; Tu, Lifu; Wang, Huan; Zhou, Yingbo; Savarese, Silvio; Xiong, Caiming (2022). "CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis". doi:10.48550/arXiv.2203.13474.
- ↑ "CodeGen". github.com. Salesforce. 16 May 2023. Retrieved 16 May 2023.
- ↑ Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan; Guy, Aurelia; Osindero, Simon; Simonyan, Karen; Elsen, Erich; Rae, Jack W.; Vinyals, Oriol; Sifre, Laurent (2022). "Training Compute-Optimal Large Language Models". doi:10.48550/arXiv.2203.15556.
- ↑ Hoffmann, Jordan; Borgeaud, Sebastian; Mensch, Arthur; Buchatskaya, Elena; Cai, Trevor; Rutherford, Eliza; Casas, Diego de Las; Hendricks, Lisa Anne; Welbl, Johannes; Clark, Aidan; Hennigan, Tom; Noland, Eric; Millican, Katie; Driessche, George van den; Damoc, Bogdan; Guy, Aurelia; Osindero, Simon; Simonyan, Karen; Elsen, Erich; Rae, Jack W.; Vinyals, Oriol; Sifre, Laurent (2022). "Training Compute-Optimal Large Language Models". doi:10.48550/arXiv.2203.15556.
- ↑ Chowdhery, Aakanksha; Narang, Sharan; Devlin, Jacob; Bosma, Maarten; Mishra, Gaurav; Roberts, Adam; Barham, Paul; Chung, Hyung Won; Sutton, Charles; Gehrmann, Sebastian; Schuh, Parker; Shi, Kensen; Tsvyashchenko, Sasha; Maynez, Joshua; Rao, Abhishek; Barnes, Parker; Tay, Yi; Shazeer, Noam; Prabhakaran, Vinodkumar; Reif, Emily; Du, Nan; Hutchinson, Ben; Pope, Reiner; Bradbury, James; Austin, Jacob; Isard, Michael; Gur-Ari, Guy; Yin, Pengcheng; Duke, Toju; Levskaya, Anselm; Ghemawat, Sanjay; Dev, Sunipa; Michalewski, Henryk; Garcia, Xavier; Misra, Vedant; Robinson, Kevin; Fedus, Liam; Zhou, Denny; Ippolito, Daphne; Luan, David; Lim, Hyeontaek; Zoph, Barret; Spiridonov, Alexander; Sepassi, Ryan; Dohan, David; Agrawal, Shivani; Omernick, Mark; Dai, Andrew M.; Pillai, Thanumalayan Sankaranarayana; Pellat, Marie; Lewkowycz, Aitor; Moreira, Erica; Child, Rewon; Polozov, Oleksandr; Lee, Katherine; Zhou, Zongwei; Wang, Xuezhi; Saeta, Brennan; Diaz, Mark; Firat, Orhan; Catasta, Michele; Wei, Jason; Meier-Hellstern, Kathy; Eck, Douglas; Dean, Jeff; Petrov, Slav; Fiedel, Noah (2022). "PaLM: Scaling Language Modeling with Pathways". doi:10.48550/arXiv.2204.02311.
- ↑ "Pathways Language Model (PaLM): Scaling to 540 Billion Parameters for Breakthrough Performance". ai.googleblog.com. Retrieved 21 March 2023.
- ↑ Bai, Yuntao; Jones, Andy; Ndousse, Kamal; Askell, Amanda; Chen, Anna; DasSarma, Nova; Drain, Dawn; Fort, Stanislav; Ganguli, Deep; Henighan, Tom; Joseph, Nicholas; Kadavath, Saurav; Kernion, Jackson; Conerly, Tom; El-Showk, Sheer; Elhage, Nelson; Hatfield-Dodds, Zac; Hernandez, Danny; Hume, Tristan; Johnston, Scott; Kravec, Shauna; Lovitt, Liane; Nanda, Neel; Olsson, Catherine; Amodei, Dario; Brown, Tom; Clark, Jack; McCandlish, Sam; Olah, Chris; Mann, Ben; Kaplan, Jared (2022). "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback". doi:10.48550/arXiv.2204.05862.
- ↑ Black, Sid; Biderman, Stella; Hallahan, Eric; Anthony, Quentin; Gao, Leo; Golding, Laurence; He, Horace; Leahy, Connor; McDonell, Kyle; Phang, Jason; Pieler, Michael; Prashanth, USVSN Sai; Purohit, Shivanshu; Reynolds, Laria; Tow, Jonathan; Wang, Ben; Weinbach, Samuel (2022). "GPT-NeoX-20B: An Open-Source Autoregressive Language Model". doi:10.48550/arXiv.2204.06745.
- ↑ Leahy, Connor (2 February 2022). "Announcing GPT-NeoX-20B". EleutherAI Blog. Retrieved 21 March 2023.
- ↑ "Comparing AI models : DALLE and Stable Diffusion". www.linkedin.com. Retrieved 16 October 2023.
- ↑ Howell, James (22 September 2023). "What is Dall-E and How Does it Work? What is Dall-E and How Does it Work?". 101 Blockchains. Retrieved 16 October 2023.
- ↑ "What is Dall-E (Dall-E 2) and How Does it Work?". Enterprise AI. Retrieved 16 October 2023.
- ↑ Gonsalves, Robert A. (5 September 2023). "Exploring DALL-E for Digital Art Creation". Medium. Retrieved 16 October 2023.
- ↑ "Democratizing access to large-scale language models with OPT-175B". ai.meta.com. Retrieved 20 September 2023.
- ↑ Tay, Yi; Dehghani, Mostafa; Tran, Vinh Q.; García, Xavier; Wei, Jason; Wang, Xuezhi; Chung, Hyung Won; Bahri, Dara; Schuster, Tal; Zheng, H.; Zhou, Denny; Houlsby, N.; Metzler, Donald (10 May 2022). "UL2: Unifying Language Learning Paradigms".
- ↑ Khrushchev, Mikhail (23 June 2022). "Yandex Publishes YaLM 100B. It's the Largest GPT-Like Neural Network in Open Source". Yandex. Retrieved 20 September 2023.
- ↑ Lewkowycz, Aitor; Andreassen, Anders; Dohan, David; Dyer, Ethan; Michalewski, Henryk; Ramasesh, Vinay; Slone, Ambrose; Anil, Cem; Schlag, Imanol; Gutman-Solo, Theo; Wu, Yuhuai; Neyshabur, Behnam; Gur-Ari, Guy; Misra, Vedant (2022). "Solving Quantitative Reasoning Problems with Language Models". doi:10.48550/arXiv.2206.14858.
- ↑ Chopra, Disha (1 July 2022). "Google Developed Minerva, an AI That Can Answer Math Questions". Analytics Drift. Retrieved 20 September 2023.
- ↑ "New AI Model Translates 200 Languages, Making Technology Accessible to More People". Meta. 6 July 2022. Retrieved 19 October 2023.
- ↑ Rodriguez, Jesus (15 August 2022). "AlexaTM 20B is Amazon's New Language Super Model Which is Also Capable of Few-Shot Learning". Medium. Retrieved 21 October 2023.
- ↑ Elemuwa, Fimber (22 February 2023). "Using CodeGeeX as a GitHub Copilot alternative". LogRocket Blog. Retrieved 19 October 2023.
- ↑ Zheng, Qinkai; Xia, Xiao; Zou, Xu; Dong, Yuxiao; Wang, Shan; Xue, Yufei; Wang, Zihan; Shen, Lei; Wang, Andi; Li, Yang; Su, Teng; Yang, Zhilin; Tang, Jie (2023). "CodeGeeX: A Pre-Trained Model for Code Generation with Multilingual Evaluations on HumanEval-X". doi:10.48550/arXiv.2303.17568.
- ↑ "Could Deepmind's Sparrow be Google's answer to ChatGPT?". medium.com. Retrieved 21 October 2023.
- ↑ Su, Hui; Zhou, Xiao; Yu, Houjin; Shen, Xiaoyu; Chen, Yuwen; Zhu, Zilin; Yu, Yang; Zhou, Jie (2022). "WeLM: A Well-Read Pre-trained Language Model for Chinese". doi:10.48550/arXiv.2209.10372.
- ↑ Zeng, Aohan; Liu, Xiao; Du, Zhengxiao; Wang, Zihan; Lai, Hanyu; Ding, Ming; Yang, Zhuoyi; Xu, Yifan; Zheng, Wendi; Xia, Xiao; Tam, Weng Lam; Ma, Zixuan; Xue, Yufei; Zhai, Jidong; Chen, Wenguang; Zhang, Peng; Dong, Yuxiao; Tang, Jie (2022). "GLM-130B: An Open Bilingual Pre-trained Model". doi:10.48550/arXiv.2210.02414.
- ↑ Muennighoff, Niklas; Wang, Thomas; Sutawika, Lintang; Roberts, Adam; Biderman, Stella; Scao, Teven Le; Bari, M Saiful; Shen, Sheng; Yong, Zheng-Xin; Schoelkopf, Hailey; Tang, Xiangru; Radev, Dragomir; Aji, Alham Fikri; Almubarak, Khalid; Albanie, Samuel; Alyafeai, Zaid; Webson, Albert; Raff, Edward; Raffel, Colin (2022). "Crosslingual Generalization through Multitask Finetuning". doi:10.48550/arXiv.2211.01786.
- ↑ Workshop, BigScience; Scao, Teven Le; Fan, Angela; Akiki, Christopher; Pavlick, Ellie; Ilić, Suzana; Hesslow, Daniel; Castagné, Roman; Luccioni, Alexandra Sasha; Yvon, François; Gallé, Matthias; Tow, Jonathan; Rush, Alexander M.; Biderman, Stella; Webson, Albert; Ammanamanchi, Pawan Sasanka; Wang, Thomas; Sagot, Benoît; Muennighoff, Niklas; del Moral, Albert Villanova; Ruwase, Olatunji; Bawden, Rachel; Bekman, Stas; McMillan-Major, Angelina; Beltagy, Iz; Nguyen, Huu; Saulnier, Lucile; Tan, Samson; Suarez, Pedro Ortiz; Sanh, Victor; Laurençon, Hugo; Jernite, Yacine; Launay, Julien; Mitchell, Margaret; Raffel, Colin; Gokaslan, Aaron; Simhi, Adi; Soroa, Aitor; Aji, Alham Fikri; Alfassy, Amit; Rogers, Anna; Nitzav, Ariel Kreisberg; Xu, Canwen; Mou, Chenghao; Emezue, Chris; Klamm, Christopher; Leong, Colin; van Strien, Daniel; Adelani, David Ifeoluwa; Radev, Dragomir; Ponferrada, Eduardo González; Levkovizh, Efrat; Kim, Ethan; Natan, Eyal Bar; De Toni, Francesco; Dupont, Gérard; Kruszewski, Germán; Pistilli, Giada; Elsahar, Hady; Benyamina, Hamza; Tran, Hieu; Yu, Ian; Abdulmumin, Idris; Johnson, Isaac; Gonzalez-Dios, Itziar; de la Rosa, Javier; Chim, Jenny; Dodge, Jesse; Zhu, Jian; Chang, Jonathan; Frohberg, Jörg; Tobing, Joseph; Bhattacharjee, Joydeep; Almubarak, Khalid; Chen, Kimbo; Lo, Kyle; Von Werra, Leandro; Weber, Leon; Phan, Long; allal, Loubna Ben; Tanguy, Ludovic; Dey, Manan; Muñoz, Manuel Romero; Masoud, Maraim; Grandury, María; Šaško, Mario; Huang, Max; Coavoux, Maximin; Singh, Mayank; Jiang, Mike Tian-Jian; Vu, Minh Chien; Jauhar, Mohammad A.; Ghaleb, Mustafa; Subramani, Nishant; Kassner, Nora; Khamis, Nurulaqilla; Nguyen, Olivier; Espejel, Omar; de Gibert, Ona; Villegas, Paulo; Henderson, Peter; Colombo, Pierre; Amuok, Priscilla; Lhoest, Quentin; Harliman, Rheza; Bommasani, Rishi; López, Roberto Luis; Ribeiro, Rui; Osei, Salomey; Pyysalo, Sampo; Nagel, Sebastian; Bose, Shamik; Muhammad, Shamsuddeen Hassan; Sharma, Shanya; Longpre, Shayne; Nikpoor, Somaieh; Silberberg, Stanislav; Pai, Suhas; Zink, Sydney; Torrent, Tiago Timponi; Schick, Timo; Thrush, Tristan; Danchev, Valentin; Nikoulina, Vassilina; Laippala, Veronika; Lepercq, Violette; Prabhu, Vrinda; Alyafeai, Zaid; Talat, Zeerak; Raja, Arun; Heinzerling, Benjamin; Si, Chenglei; Taşar, Davut Emre; Salesky, Elizabeth; Mielke, Sabrina J.; Lee, Wilson Y.; Sharma, Abheesht; Santilli, Andrea; Chaffin, Antoine; Stiegler, Arnaud; Datta, Debajyoti; Szczechla, Eliza; Chhablani, Gunjan; Wang, Han; Pandey, Harshit; Strobelt, Hendrik; Fries, Jason Alan; Rozen, Jos; Gao, Leo; Sutawika, Lintang; Bari, M. Saiful; Al-shaibani, Maged S.; Manica, Matteo; Nayak, Nihal; Teehan, Ryan; Albanie, Samuel; Shen, Sheng; Ben-David, Srulik; Bach, Stephen H.; Kim, Taewoon; Bers, Tali; Fevry, Thibault; Neeraj, Trishala; Thakker, Urmish; Raunak, Vikas; Tang, Xiangru; Yong, Zheng-Xin; Sun, Zhiqing; Brody, Shaked; Uri, Yallow; Tojarieh, Hadar; Roberts, Adam; Chung, Hyung Won; Tae, Jaesung; Phang, Jason; Press, Ofir; Li, Conglong; Narayanan, Deepak; Bourfoune, Hatim; Casper, Jared; Rasley, Jeff; Ryabinin, Max; Mishra, Mayank; Zhang, Minjia; Shoeybi, Mohammad; Peyrounette, Myriam; Patry, Nicolas; Tazi, Nouamane; Sanseviero, Omar; von Platen, Patrick; Cornette, Pierre; Lavallée, Pierre François; Lacroix, Rémi; Rajbhandari, Samyam; Gandhi, Sanchit; Smith, Shaden; Requena, Stéphane; Patil, Suraj; Dettmers, Tim; Baruwa, Ahmed; Singh, Amanpreet; Cheveleva, Anastasia; Ligozat, Anne-Laure; Subramonian, Arjun; Névéol, Aurélie; Lovering, Charles; Garrette, Dan; Tunuguntla, Deepak; Reiter, Ehud; Taktasheva, Ekaterina; Voloshina, Ekaterina; Bogdanov, Eli; Winata, Genta Indra; Schoelkopf, Hailey; Kalo, Jan-Christoph; Novikova, Jekaterina; Forde, Jessica Zosa; Clive, Jordan; Kasai, Jungo; Kawamura, Ken; Hazan, Liam; Carpuat, Marine; Clinciu, Miruna; Kim, Najoung; Cheng, Newton; Serikov, Oleg; Antverg, Omer; van der Wal, Oskar; Zhang, Rui; Zhang, Ruochen; Gehrmann, Sebastian; Mirkin, Shachar; Pais, Shani; Shavrina, Tatiana; Scialom, Thomas; Yun, Tian; Limisiewicz, Tomasz; Rieser, Verena; Protasov, Vitaly; Mikhailov, Vladislav; Pruksachatkun, Yada; Belinkov, Yonatan; Bamberger, Zachary; Kasner, Zdeněk; Rueda, Alice; Pestana, Amanda; Feizpour, Amir; Khan, Ammar; Faranak, Amy; Santos, Ana; Hevia, Anthony; Unldreaj, Antigona; Aghagol, Arash; Abdollahi, Arezoo; Tammour, Aycha; HajiHosseini, Azadeh; Behroozi, Bahareh; Ajibade, Benjamin; Saxena, Bharat; Ferrandis, Carlos Muñoz; Contractor, Danish; Lansky, David; David, Davis; Kiela, Douwe; Nguyen, Duong A.; Tan, Edward; Baylor, Emi; Ozoani, Ezinwanne; Mirza, Fatima; Ononiwu, Frankline; Rezanejad, Habib; Jones, Hessie; Bhattacharya, Indrani; Solaiman, Irene; Sedenko, Irina; Nejadgholi, Isar; Passmore, Jesse; Seltzer, Josh; Sanz, Julio Bonis; Dutra, Livia; Samagaio, Mairon; Elbadri, Maraim; Mieskes, Margot; Gerchick, Marissa; Akinlolu, Martha; McKenna, Michael; Qiu, Mike; Ghauri, Muhammed; Burynok, Mykola; Abrar, Nafis; Rajani, Nazneen; Elkott, Nour; Fahmy, Nour; Samuel, Olanrewaju; An, Ran; Kromann, Rasmus; Hao, Ryan; Alizadeh, Samira; Shubber, Sarmad; Wang, Silas; Roy, Sourav; Viguier, Sylvain; Le, Thanh; Oyebade, Tobi; Le, Trieu; Yang, Yoyo; Nguyen, Zach; Kashyap, Abhinav Ramesh; Palasciano, Alfredo; Callahan, Alison; Shukla, Anima; Miranda-Escalada, Antonio; Singh, Ayush; Beilharz, Benjamin; Wang, Bo; Brito, Caio; Zhou, Chenxi; Jain, Chirag; Xu, Chuxin; Fourrier, Clémentine; Periñán, Daniel León; Molano, Daniel; Yu, Dian; Manjavacas, Enrique; Barth, Fabio; Fuhrimann, Florian; Altay, Gabriel; Bayrak, Giyaseddin; Burns, Gully; Vrabec, Helena U.; Bello, Imane; Dash, Ishani; Kang, Jihyun; Giorgi, John; Golde, Jonas; Posada, Jose David; Sivaraman, Karthik Rangasai; Bulchandani, Lokesh; Liu, Lu; Shinzato, Luisa; de Bykhovetz, Madeleine Hahn; Takeuchi, Maiko; Pàmies, Marc; Castillo, Maria A.; Nezhurina, Marianna; Sänger, Mario; Samwald, Matthias; Cullan, Michael; Weinberg, Michael; De Wolf, Michiel; Mihaljcic, Mina; Liu, Minna; Freidank, Moritz; Kang, Myungsun; Seelam, Natasha; Dahlberg, Nathan; Broad, Nicholas Michio; Muellner, Nikolaus; Fung, Pascale; Haller, Patrick; Chandrasekhar, Ramya; Eisenberg, Renata; Martin, Robert; Canalli, Rodrigo; Su, Rosaline; Su, Ruisi; Cahyawijaya, Samuel; Garda, Samuele; Deshmukh, Shlok S.; Mishra, Shubhanshu; Kiblawi, Sid; Ott, Simon; Sang-aroonsiri, Sinee; Kumar, Srishti; Schweter, Stefan; Bharati, Sushil; Laud, Tanmay; Gigant, Théo; Kainuma, Tomoya; Kusa, Wojciech; Labrak, Yanis; Bajaj, Yash Shailesh; Venkatraman, Yash; Xu, Yifan; Xu, Yingxin; Xu, Yu; Tan, Zhe; Xie, Zhongli; Ye, Zifan; Bras, Mathilde; Belkada, Younes; Wolf, Thomas (13 March 2023). "BLOOM: A 176B-Parameter Open-Access Multilingual Language Model". arXiv:2211.05100 [cs].
- ↑ Chopra, Disha (17 November 2022). "Meta Introduces 'Galactica,' an AI System that Generates Academic Papers from Simple Text Inputs". Analytics Drift. Retrieved 20 September 2023.
- ↑ "Meta's New Large Language Model Galactica Pulled Down Three Days After Launch". Spiceworks. Retrieved 20 September 2023.
- ↑ "AlexaTM 20B is now available in Amazon SageMaker JumpStart | AWS Machine Learning Blog". aws.amazon.com. 17 November 2022. Retrieved 20 September 2023.
- ↑ "FLAN-T5 vs. FLAN-UL2: Which LLM is Better? | Sapling". sapling.ai. Retrieved 19 October 2023.
- ↑ Subhash, Varshini (5 January 2023). "Can Large Language Models Change User Preference Adversarially?". arXiv:2302.10291 [cs]. doi:10.48550/arXiv.2302.10291.
- ↑ Joshi, Harshit; Ebenezer, Abishai; Cambronero, José; Gulwani, Sumit; Kanade, Aditya; Le, Vu; Radiček, Ivan; Verbruggen, Gust (31 January 2023). "FLAME: A small language model for spreadsheet formulas". arXiv:2301.13779 [cs]. doi:10.48550/arXiv.2301.13779.
- ↑ Zhang, Zhuosheng; Zhang, Aston; Li, Mu; Zhao, Hai; Karypis, George; Smola, Alex (2023). "Multimodal Chain-of-Thought Reasoning in Language Models". doi:10.48550/arXiv.2302.00923.
- ↑ "Vinija's Notes • Models • Toolformer". vinija.ai. Retrieved 26 June 2023.
- ↑ Schick, Timo; Dwivedi-Yu, Jane; Dessì, Roberto; Raileanu, Roberta; Lomeli, Maria; Zettlemoyer, Luke; Cancedda, Nicola; Scialom, Thomas (2023). "Toolformer: Language Models Can Teach Themselves to Use Tools". doi:10.48550/arXiv.2302.04761.
- ↑ "Shaped". www.shaped.ai. Retrieved 16 May 2023.
- ↑ Weaver, Alaura (2 March 2023). "Palmyra LLMs empower secure, enterprise-grade generative AI for business". Writer. Retrieved 11 March 2023.
- ↑ "Writer Launches Three New Generative AI Models for the Enterprise". PRWeb. Retrieved 11 March 2023.
- ↑ "fnlp/moss-moon-003-base · Hugging Face". huggingface.co. 20 April 2023. Retrieved 26 June 2023.
- ↑ "MOSS". txsun1997.github.io. Retrieved 11 March 2023.
- ↑ White, Jules; Fu, Quchen; Hays, Sam; Sandborn, Michael; Olea, Carlos; Gilbert, Henry; Elnashar, Ashraf; Spencer-Smith, Jesse; Schmidt, Douglas C. (21 February 2023). "A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT". arXiv:2302.11382 [cs]. doi:10.48550/arXiv.2302.11382.
- ↑ "LLaMA: Open and Efficient Foundation Language Models - Meta Research". Meta Research. Retrieved 11 March 2023.
- ↑ Peng, Baolin; Galley, Michel; He, Pengcheng; Cheng, Hao; Xie, Yujia; Hu, Yu; Huang, Qiuyuan; Liden, Lars; Yu, Zhou; Chen, Weizhu; Gao, Jianfeng (1 March 2023). "Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback". arXiv:2302.12813 [cs]. doi:10.48550/arXiv.2302.12813.
- ↑ Raieli, Salvatore (13 March 2023). "SpikeGPT: a 260 M only parameters LM not afraid of competition". Medium. Retrieved 26 June 2023.
- ↑ Zhu, Rui-Jie; Zhao, Qihang; Eshraghian, Jason K. (28 February 2023). "SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks". arXiv:2302.13939 [cs]. doi:10.48550/arXiv.2302.13939.
- ↑ Al-Kaswan, Ali; Izadi, Maliheh (28 February 2023). "The (ab)use of Open Source Code to Train Large Language Models". arXiv:2302.13681 [cs]. doi:10.48550/arXiv.2302.13681.
- ↑ Kwon, Minae; Xie, Sang Michael; Bullard, Kalesha; Sadigh, Dorsa (27 February 2023). "Reward Design with Language Models". arXiv:2303.00001 [cs]. doi:10.48550/arXiv.2303.00001.
- ↑ Bastian, Matthias (3 March 2023). "Microsoft's Kosmos-1 is a multimodal step toward more general AI". THE DECODER. Retrieved 18 September 2023.
- ↑ Huang, Shaohan; Dong, Li; Wang, Wenhui; Hao, Yaru; Singhal, Saksham; Ma, Shuming; Lv, Tengchao; Cui, Lei; Mohammed, Owais Khan; Patra, Barun; Liu, Qiang; Aggarwal, Kriti; Chi, Zewen; Bjorck, Johan; Chaudhary, Vishrav; Som, Subhojit; Song, Xia; Wei, Furu (1 March 2023). "Language Is Not All You Need: Aligning Perception with Language Models". arXiv:2302.14045 [cs]. doi:10.48550/arXiv.2302.14045.
- ↑ Cao, Meng; Fatemi, Mehdi; Cheung, Jackie Chi Kit; Shabanian, Samira (27 February 2023). "Systematic Rectification of Language Models via Dead-end Analysis". arXiv:2302.14003 [cs]. doi:10.48550/arXiv.2302.14003.
- ↑ Bertolini, Lorenzo; Elce, Valentina; Michalak, Adriana; Bernardi, Giulio; Weeds, Julie (28 February 2023). "Automatic Scoring of Dream Reports' Emotional Content with Large Language Models". arXiv:2302.14828 [cs]. doi:10.48550/arXiv.2302.14828.
- ↑ Ye, Seonghyeon; Hwang, Hyeonbin; Yang, Sohee; Yun, Hyeongu; Kim, Yireun; Seo, Minjoon (28 February 2023). "In-Context Instruction Learning". arXiv:2302.14691 [cs]. doi:10.48550/arXiv.2302.14691.
- ↑ Houghton, Conor; Kazanina, Nina; Sukumaran, Priyanka (28 February 2023). "Beyond the limitations of any imaginable mechanism: large language models and psycholinguistics". arXiv:2303.00077 [cs]. doi:10.48550/arXiv.2303.00077. Retrieved 10 March 2023.
- ↑ Yuan, Yang (2023). "Succinct Representations for Concepts". doi:10.48550/arXiv.2303.00446.
- ↑ Huemann, Zachary; Lee, Changhee; Hu, Junjie; Cho, Steve Y.; Bradshaw, Tyler (1 March 2023). "Domain-adapted large language models for classifying nuclear medicine reports". arXiv:2303.01258 [cs]. doi:10.48550/arXiv.2303.01258.
- ↑ "A New Open Source Flan 20B with UL2". Yi Tay. Retrieved 30 June 2023.
- ↑ Zhang, Bowen; Soh, Harold (6 March 2023). "Large Language Models as Zero-Shot Human Models for Human-Robot Interaction". arXiv:2303.03548 [cs]. doi:10.48550/arXiv.2303.03548.
- ↑ Dang, Hai; Goller, Sven; Lehmann, Florian; Buschek, Daniel (6 March 2023). "Choice Over Control: How Users Write with Large Language Models using Diegetic and Non-Diegetic Prompting". arXiv:2303.03199 [cs]. doi:10.1145/3544548.3580969. Retrieved 8 March 2023.
- ↑ "Prepare for truly useful large language models". Nature Biomedical Engineering. 7 (2): 85–86. 7 March 2023. doi:10.1038/s41551-023-01012-6.
- ↑ "Stanford CRFM". crfm.stanford.edu. Retrieved 21 March 2023.
- ↑ "Announcement of Jurassic-2 and Task-Specific APIs". Data Phoenix. 12 March 2023. Retrieved 21 September 2023.
- ↑ "Large language model: Revision history - Wikipedia". en.wikipedia.org. Retrieved 21 September 2023.
- ↑ "How does Claude, the new LLM from Anthropic compare to ChatGPT? A serious contender". www.cerebrium.ai. Retrieved 18 September 2023.
- ↑ "Introducing Claude". Anthropic. Retrieved 30 June 2023.
- ↑ "Falcon LLM: Abu Dhabi's Based TII Latest AI Breakthrough for Next-Gen Solutions". www.tii.ae. 6 September 2023. Retrieved 20 September 2023.
- ↑ "GPT-NeoX". huggingface.co. Retrieved 20 March 2023.
- ↑ Lubbad, Mohammed (7 August 2023). "The Ultimate Guide to GPT-4 Parameters: Everything You Need to Know about NLP's Game-Changer". Medium. Retrieved 19 September 2023.
- ↑ "GPT-4 Technical Report". 2023. doi:10.48550/arXiv.2303.08774.
- ↑ Ren, Xiaozhe; Zhou, Pingyi; Meng, Xinfan; Huang, Xinjing; Wang, Yadao; Wang, Weichao; Li, Pengfei; Zhang, Xiaoda; Podolskiy, Alexander; Arshinov, Grigory; Bout, Andrey; Piontkovskaya, Irina; Wei, Jiansheng; Jiang, Xin; Su, Teng; Liu, Qun; Yao, Jun (2023). "PanGu-Σ: Towards Trillion Parameter Language Model with Sparse Heterogeneous Computing". doi:10.48550/arXiv.2303.10845.
- ↑ Tickoo, Aneesh (10 July 2023). "Huawei Researchers Develop Pangu-Σ: A Large Language Model With Sparse Architecture And 1.085 Trillion Parameters". MarkTechPost. Retrieved 16 October 2023.
- ↑ "ChatGLM-6B". github.com. THUDM. 30 June 2023. Retrieved 30 June 2023.
- ↑ Eloundou, Tyna; Manning, Sam; Mishkin, Pamela; Rock, Daniel (2023). "GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models". doi:10.48550/arXiv.2303.10130.
- ↑ "Free Dolly: Introducing the World's First Truly Open Instruction-Tuned LLM". Databricks. 12 April 2023. Retrieved 19 September 2023.
- ↑ "Hello Dolly: Democratizing the magic of ChatGPT with open models". Databricks. 24 March 2023. Retrieved 19 June 2023.
- ↑ Dey, Nolan (28 March 2023). "Cerebras-GPT: A Family of Open, Compute-efficient, Large Language Models". Cerebras. Retrieved 20 September 2023.
- ↑ "BloombergGPT: The 50 Billion Parameter Large Language Model for Finance". Medium. 8 April 2023. Retrieved 20 September 2023.
- ↑ Köpf, Andreas; Kilcher, Yannic; von Rütte, Dimitri; Anagnostidis, Sotiris; Tam, Zhi-Rui; Stevens, Keith; Barhoum, Abdullah; Duc, Nguyen Minh; Stanley, Oliver; Nagyfi, Richárd; ES, Shahul; Suri, Sameer; Glushkov, David; Dantuluri, Arnav; Maguire, Andrew; Schuhmann, Christoph; Nguyen, Huu; Mattick, Alexander (2023). "OpenAssistant Conversations -- Democratizing Large Language Model Alignment". doi:10.48550/arXiv.2304.07327.
- ↑ Roth, Emma (19 April 2023). "Stability AI announces new open-source large language model". The Verge. Retrieved 9 May 2023.
- ↑ "Stability AI Launches the First of its StableLM Suite of Language Models". Stability AI. Retrieved 9 May 2023.
- ↑ Xu, Can; Sun, Qingfeng; Zheng, Kai; Geng, Xiubo; Zhao, Pu; Feng, Jiazhan; Tao, Chongyang; Jiang, Daxin (2023). "WizardLM: Empowering Large Language Models to Follow Complex Instructions". doi:10.48550/arXiv.2304.12244.
- ↑ "Salesforce/codegen2-16B · Hugging Face". huggingface.co. Retrieved 20 October 2023.
- ↑ Nijkamp, Erik; Hayashi, Hiroaki; Xiong, Caiming; Savarese, Silvio; Zhou, Yingbo (2023). "CodeGen2: Lessons for Training LLMs on Programming and Natural Languages". doi:10.48550/arXiv.2305.02309.
- ↑ Schwartz, Barry (12 May 2023). "Bing Chat gains image answers with knowledge cards and optimized answers". Search Engine Land. Retrieved 16 May 2023.
- ↑ "How to Access PaLM 2 AND TRY IT". MLYearning. 15 May 2023. Retrieved 16 May 2023.
- ↑ Hern, Alex (10 May 2023). "Google launches new AI PaLM 2 in attempt to regain leadership of the pack". The Guardian. Retrieved 16 May 2023.
- ↑ Wang, Wenhai; Chen, Zhe; Chen, Xiaokang; Wu, Jiannan; Zhu, Xizhou; Zeng, Gang; Luo, Ping; Lu, Tong; Zhou, Jie; Qiao, Yu; Dai, Jifeng (2023). "VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks". doi:10.48550/arXiv.2305.11175.
- ↑ Xu, Canwen; Guo, Daya; Duan, Nan; McAuley, Julian (2023). "Baize: An Open-Source Chat Model with Parameter-Efficient Tuning on Self-Chat Data". doi:10.48550/arXiv.2304.01196.
- ↑ "AI Expert Says ChatGPT Is Way Stupider Than People Realize". Futurism. Retrieved 24 May 2023.
- ↑ Patil, Shishir G.; Zhang, Tianjun; Wang, Xin; Gonzalez, Joseph E. (2023). "Gorilla: Large Language Model Connected with Massive APIs". doi:10.48550/arXiv.2305.15334.
- ↑ "UC Berkeley Researchers Open-Source API-Calling Language Model Gorilla". InfoQ. Retrieved 15 October 2023.
- ↑ Ko, Hyunwoong; Yang, Kichang; Ryu, Minho; Choi, Taekyoon; Yang, Seungmu; Hyun, Jiwung; Park, Sungho; Park, Kyubyong (2023). "A Technical Report for Polyglot-Ko: Open-Source Large-Scale Korean Language Models". doi:10.48550/arXiv.2306.02254.
- ↑ Truong, Timothy F.; Bepler, Tristan (2023). "PoET: A generative model of protein families as sequences-of-sequences". doi:10.48550/arXiv.2306.06156.
- ↑ Yang, Hongyang; Liu, Xiao-Yang; Wang, Christina Dan (2023). "FinGPT: Open-Source Financial Large Language Models". doi:10.48550/arXiv.2306.06031.
- ↑ Tăiatu, Iulian-Marius; Avram, Andrei-Marius; Cercel, Dumitru-Clementin; Pop, Florin (2023). "RoBERTweet: A BERT Language Model for Romanian Tweets". doi:10.48550/arXiv.2306.06598.
- ↑ Liu, Zhengliang; Zhong, Aoxiao; Li, Yiwei; Yang, Longtao; Ju, Chao; Wu, Zihao; Ma, Chong; Shu, Peng; Chen, Cheng; Kim, Sekeun; Dai, Haixing; Zhao, Lin; Zhu, Dajiang; Liu, Jun; Liu, Wei; Shen, Dinggang; Li, Xiang; Li, Quanzheng; Liu, Tianming (2023). "Radiology-GPT: A Large Language Model for Radiology". doi:10.48550/arXiv.2306.08666.
- ↑ Gao, Difei; Ji, Lei; Zhou, Luowei; Lin, Kevin Qinghong; Chen, Joya; Fan, Zihan; Shou, Mike Zheng (2023). "AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn". doi:10.48550/arXiv.2306.08640.
- ↑ Feng, Xidong; Luo, Yicheng; Wang, Ziyan; Tang, Hongrui; Yang, Mengyue; Shao, Kun; Mguni, David; Du, Yali; Wang, Jun (2023). "ChessGPT: Bridging Policy Learning and Language Modeling". doi:10.48550/arXiv.2306.09200.
- ↑ Sun, Yuqian; Li, Xingyu; Gao, Ze (2023). "Inspire creativity with ORIBA: Transform Artists' Original Characters into Chatbots through Large Language Model". doi:10.48550/arXiv.2306.09776.
- ↑ Wang, Guangyu; Yang, Guoxing; Du, Zongxin; Fan, Longjun; Li, Xiaohu (2023). "ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation". doi:10.48550/arXiv.2306.09968.
- ↑ "Your job is (probably) safe from artificial intelligence". The Economist. 7 May 2023. Retrieved 18 June 2023.
- ↑ "Generative AI Could Raise Global GDP by 7%". Goldman Sachs. Retrieved 18 June 2023.
- ↑ Dickson, Ben (19 June 2023). "ChatGPT will make the web toxic for its successors - TechTalks". bdtechtalks.com. Retrieved 18 July 2023.
- ↑ Rubenstein, Paul K.; Asawaroengchai, Chulayuth; Nguyen, Duc Dung; Bapna, Ankur; Borsos, Zalán; Quitry, Félix de Chaumont; Chen, Peter; Badawy, Dalia El; Han, Wei; Kharitonov, Eugene; Muckenhirn, Hannah; Padfield, Dirk; Qin, James; Rozenberg, Danny; Sainath, Tara; Schalkwyk, Johan; Sharifi, Matt; Ramanovich, Michelle Tadmor; Tagliasacchi, Marco; Tudor, Alexandru; Velimirović, Mihajlo; Vincent, Damien; Yu, Jiahui; Wang, Yongqiang; Zayats, Vicky; Zeghidour, Neil; Zhang, Yu; Zhang, Zhishuai; Zilka, Lukas; Frank, Christian (2023). "AudioPaLM: A Large Language Model That Can Speak and Listen". doi:10.48550/arXiv.2306.12925.
- ↑ "ChatLaw: Open-Source Legal Large Language Model with Integrated External Knowledge Bases". arxiv.org. Retrieved 29 June 2023.
- ↑ Liao, Rita (11 July 2023). "China's search engine pioneer unveils open source large language model to rival OpenAI". TechCrunch. Retrieved 16 July 2023.
- ↑ Watson, Clare (9 September 2023). "Scientists Devised a Way to Tell if ChatGPT Becomes Aware of Itself". ScienceAlert. Retrieved 17 September 2023.
- ↑ "Alibaba launches its ChatGPT-like AI model for public use amid loosening restrictions in China". Cointelegraph. 13 September 2023. Retrieved 17 September 2023.
- ↑ "Google DeepMind Gemini". Dr Alan D. Thompson – Life Architect. 20 May 2023. Retrieved 18 September 2023.
- ↑ Jones, Luke (9 October 2023). "Microsoft Researchers Develop Unlearning Technique for Large Language Models". WinBuzzer. Retrieved 9 October 2023.
- ↑ "Announcing Grok". x.ai. Retrieved 3 September 2024.
- ↑ "Claude 2.1 Introduces 200K Context Window". zeniteq.com. Retrieved 3 September 2024.
- ↑ "Introducing Claude 2.1". anthropic.com. Retrieved 3 September 2024.
- ↑ "Wikipedia Views: results". wikipediaviews.org. Retrieved 21 September 2023.
- ↑ "Google Trends". Google Trends. Retrieved 21 September 2023.