Timeline of transformers

This is a timeline of transformers.

Sample questions

The following are some interesting questions that can be answered by reading this timeline:

Big picture

Time period	Development summary	More details

Full timeline

Year	Month and date	Event type	Details
2017	June		Google researchers first describe the transformer algorithm that would turbocharge the power of chatbots.
2018	June 11		OpenAI releases a paper entitled Improving Language Understanding by Generative Pre-Training, in which they introduces the Generative Pre-trained Transformer (GPT).^[1]
2018	November		Google open sources BERT (Bidirectional Encoder Representations from Transformers) model, which was trained in just four days. The model has 345M parameters and was trained on a dataset that includes English Wikipedia (12GB) and BookCorpus (4GB), for a total size of 16GB.^[2]
2019	February 14		OpenAI releases Generative Pre-trained Transformer 2 (GPT-2).
2019	July		Facebook AI (FAIR) and the University of Washington introduce a new language model called RoBERTa (Robustly optimized BERT approach). The model was developed jointly by researchers from FAIR and UW and is based on the BERT (Bidirectional Encoder Representations from Transformers) model architecture. RoBERTa has 125M parameters in its base configuration. The model was trained on a large and diverse dataset that includes the original BERT dataset, consisting of English Wikipedia (12GB) and BookCorpus (4GB), as well as several other sources such as CC-News (76GB), which includes 63 million English news articles from Sep/2016-Feb/2019, OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. The total size of the dataset used to train RoBERTa is 161GB.^[2]
2019	August		NVIDIA introduces Megatron-LM, a transformer language model with 8.3 billion parameters, 8-way model parallelism, and 64-way data parallelism trained on 512 GPUs. It is the largest transformer model trained to date and is 24x the size of BERT and 5.6x the size of GPT-2. NVIDIA used a 37 GB WebText dataset downloaded from Reddit for training and split it into a 95:5 ratio (training and validation sets). Megatron-LM surpassed previous state-of-the-art results in wikitext perplexity and Lambada accuracy. NVIDIA solved the problem of training massive transformer models by using model parallelism that splits the parameters across multiple GPUs without requiring new compiler or code re-wiring.^[3]
2020	April		Facebook AI Research labs introduces Megatron-11b, which is a unidirectional language model with 11B parameters and is based on Megatron-LM. The model was developed by FAIR and trained using intra-layer model parallelism, with each layer's parameters split across eight GPUs. Like RoBERTa, Megatron-11b was trained on a dataset that includes the original BERT dataset, consisting of English Wikipedia (12GB) and BookCorpus (4GB), as well as CC-News (76GB), which includes 63 million English news articles from Sep/2016-Feb/2019, OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. The total size of the dataset used to train Megatron-11b is 161GB. Megatron-11b would demonstrated impressive performance on a range of language tasks, including language modeling and text generation, which underscores the potential of large-scale language models to advance natural language processing research.^[2]
2020	June 11		OpenAI releases Generative Pre-trained Transformer 3 (GPT-3) in beta.
2021	October	Transformer launch	NVIDIA and Microsoft jointly introduce Megatron-Turing NLG 530B, also known as MT-NLG, which is the successor to Microsoft Turing NLG 17B and NVIDIA Megatron-LM 8.3B. The MT-NLG model is three times larger than GPT-3, with 530 billion parameters, and is trained on over 4,000 GPUs. The dataset sources used to train the MT-NLG model include The Pile v1 plus an additional 14 datasets, such as Books3, OpenWebText2, Stack Exchange, PubMed Abstracts, and GitHub, among others. The total size of the dataset is estimated to be over 825GB, with a potential size of 1.86TB (1,863GB). The MT-NLG model represents a significant advancement in natural language generation and has the potential to be used in a wide range of applications.^[2]
2021	December	Transformer launch	Lab Meta (previously known as Facebook AI Research) introduces language model Fairseq. The model is not related to Megatron and uses different technologies for training. Fairseq has 13B parameters and was trained using 2,363 GPU-days, assuming 1,024 GPUs for a total of around three days. Like RoBERTa and Megatron-11b, Fairseq was trained on a dataset that includes the original BERT dataset, which consists of English Wikipedia (12GB) and BookCorpus (4GB), as well as CC-News (76GB), OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. However, a new addition to the dataset is English CC100 in Wikipedia style from Jan/2018-Dec/2018, which adds 292GB to the dataset. The total size of the dataset used to train Fairseq is 453GB.^[2]
2022	June 14		Google releases all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models.^[4] The Switch Transformer is a type of neural network layer used in the transformer architecture, which replaces the regular feed-forward neural network (FFN) layer. The Switch Transformer is unique because it has multiple FFNs, also known as experts, in each layer, as opposed to the single FFN found in a standard transformer layer.^[5]
2023	February 15		A paper presents a pilot study that evaluates the cognitive abilities of two recently released generative transformer models, ChatGPT and DALL-E 2, in decision-making and spatial reasoning. The study constructs input prompts following neutral a priori guidelines and finds that DALL-E 2 can generate at least one correct image for each spatial reasoning prompt, but most images generated are incorrect. Similarly, ChatGPT demonstrates some level of rational decision-making, but many of its decisions violate at least one of the axioms under the classical Von Neumann-Morgenstern utility theorem. ChatGPT's outputs tend to be unpredictable and can make irrational decisions for simpler problems while drawing correct conclusions for more complex bet structures. The paper discusses the challenges of scaling up such cognitive evaluations for generative models and conducting them with a closed set of answer keys.^[6]
2023	February 18		A paper evaluates the performance of Generative Pre-trained Transformer (GPT) models for machine translation, covering various aspects such as the quality of different GPT models, the effect of prompting strategies, robustness towards domain shifts and document-level translation. The experiment includes eighteen different translation directions involving high and low resource languages, as well as non English-centric translations. The results show that GPT models achieve competitive translation quality for high resource languages, while having limited capabilities for low resource languages. Hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality. The paper provides valuable insights for researchers and practitioners in the field to understand the potential and limitations of GPT models for translation.^[7]
2023	March 10		A paper introduces Exphormer, a framework for building scalable graph transformers using a sparse attention mechanism based on virtual global nodes and expander graphs. These mechanisms allow for linear complexity in the size of the graph while maintaining theoretical properties and competitive empirical results on various graph datasets. Exphormer is shown to be able to scale to larger datasets than previous graph transformer architectures.^[8]

Meta information on the timeline

How the timeline was built

The initial version of the timeline was written by FIXME.

Funding information for this timeline is available.

Feedback and comments

Feedback for the timeline can be provided at the following places:

FIXME

What the timeline is still missing

Timeline update strategy

External links

References

↑ Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived from the original (PDF) on 26 January 2021. Retrieved 23 January 2021.
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 "AI: Megatron the Transformer, and its related language models". lifearchitect.ai. 24 September 2021. Retrieved 12 March 2023.
↑ "Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained | Exxact Blog". www.exxactcorp.com. Retrieved 11 March 2023.
↑ "https://twitter.com/LiamFedus/status/1536791574612303872". Twitter. Retrieved 8 March 2023. External link in |title= (help)
↑ Davis, Jonathan (2 May 2021). "Understanding Google's Switch Transformer". Medium. Retrieved 8 March 2023.
↑ Tang, Zhisheng; Kejriwal, Mayank (15 February 2023). "A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning". arXiv:2302.09068 [cs]. doi:10.48550/arXiv.2302.09068. Retrieved 7 March 2023.
↑ Hendy, Amr; Abdelrehim, Mohamed; Sharaf, Amr; Raunak, Vikas; Gabr, Mohamed; Matsushita, Hitokazu; Kim, Young Jin; Afify, Mohamed; Awadalla, Hany Hassan (17 February 2023). "How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation". arXiv:2302.09210 [cs]. doi:10.48550/arXiv.2302.09210.
↑ Shirzad, Hamed; Velingker, Ameya; Venkatachalam, Balaji; Sutherland, Danica J.; Sinop, Ali Kemal (2023). "Exphormer: Sparse Transformers for Graphs". doi:10.48550/arXiv.2303.06147.

[gpt1paper-1] Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived from the original (PDF) on 26 January 2021. Retrieved 23 January 2021.

[lifearchitect.ai-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 "AI: Megatron the Transformer, and its related language models". lifearchitect.ai. 24 September 2021. Retrieved 12 March 2023.

[3] "Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained | Exxact Blog". www.exxactcorp.com. Retrieved 11 March 2023.

[4] "https://twitter.com/LiamFedus/status/1536791574612303872". Twitter. Retrieved 8 March 2023. External link in |title= (help)

[5] Davis, Jonathan (2 May 2021). "Understanding Google's Switch Transformer". Medium. Retrieved 8 March 2023.

[6] Tang, Zhisheng; Kejriwal, Mayank (15 February 2023). "A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning". arXiv:2302.09068 [cs]. doi:10.48550/arXiv.2302.09068. Retrieved 7 March 2023.

[7] Hendy, Amr; Abdelrehim, Mohamed; Sharaf, Amr; Raunak, Vikas; Gabr, Mohamed; Matsushita, Hitokazu; Kim, Young Jin; Afify, Mohamed; Awadalla, Hany Hassan (17 February 2023). "How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation". arXiv:2302.09210 [cs]. doi:10.48550/arXiv.2302.09210.

[8] Shirzad, Hamed; Velingker, Ameya; Venkatachalam, Balaji; Sutherland, Danica J.; Sinop, Ali Kemal (2023). "Exphormer: Sparse Transformers for Graphs". doi:10.48550/arXiv.2303.06147.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

Timeline of transformers

Contents

Sample questions

Big picture

Full timeline

Meta information on the timeline

How the timeline was built

Feedback and comments

What the timeline is still missing

Timeline update strategy

See also

External links

References

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools