Timeline of transformers

This is a timeline of transformers.

Full timeline

Year	Month and date	Event type	Details
2017	June		Google researchers first describe the transformer algorithm that would turbocharge the power of chatbots.
2018	June 11		OpenAI releases a paper entitled Improving Language Understanding by Generative Pre-Training, in which they introduces the Generative Pre-trained Transformer (GPT).^[1]
2018	November		Google open sources BERT (Bidirectional Encoder Representations from Transformers) model, which was trained in just four days. The model has 345M parameters and was trained on a dataset that includes English Wikipedia (12GB) and BookCorpus (4GB), for a total size of 16GB.^[2]
2019	February 14		OpenAI releases Generative Pre-trained Transformer 2 (GPT-2).
2019	July		Facebook AI (FAIR) and the University of Washington introduce a new language model called RoBERTa (Robustly optimized BERT approach). The model was developed jointly by researchers from FAIR and UW and is based on the BERT (Bidirectional Encoder Representations from Transformers) model architecture. RoBERTa has 125M parameters in its base configuration. The model was trained on a large and diverse dataset that includes the original BERT dataset, consisting of English Wikipedia (12GB) and BookCorpus (4GB), as well as several other sources such as CC-News (76GB), which includes 63 million English news articles from Sep/2016-Feb/2019, OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. The total size of the dataset used to train RoBERTa is 161GB.^[2]
2019	July 26		A paper describes DLGNet, a transformer-based model for dialogue response generation. By this time, the use of transformer-based models such as GPT-2 has shown significant improvements in capturing long-range structures in language modeling tasks. DLGNet uses a combination of the long-range transformer architecture, injection of random informative paddings, joint modeling of dialogue context and response, and 100% tokenization coverage from byte pair encoding. DLGNet outperforms state-of-the-art multi-turn dialogue models on Movie Triples and Ubuntu Dialogue datasets, achieving best performance to date based on several metrics, including BLEU, ROUGE, and distinct n-gram.^[3]
2019	August		NVIDIA introduces Megatron-LM, a transformer language model with 8.3 billion parameters, 8-way model parallelism, and 64-way data parallelism trained on 512 GPUs. It is the largest transformer model trained to date and is 24x the size of BERT and 5.6x the size of GPT-2. NVIDIA used a 37 GB WebText dataset downloaded from Reddit for training and split it into a 95:5 ratio (training and validation sets). Megatron-LM surpassed previous state-of-the-art results in wikitext perplexity and Lambada accuracy. NVIDIA solved the problem of training massive transformer models by using model parallelism that splits the parameters across multiple GPUs without requiring new compiler or code re-wiring.^[4]
2020	April		Facebook AI Research labs introduces Megatron-11b, which is a unidirectional language model with 11B parameters and is based on Megatron-LM. The model was developed by FAIR and trained using intra-layer model parallelism, with each layer's parameters split across eight GPUs. Like RoBERTa, Megatron-11b was trained on a dataset that includes the original BERT dataset, consisting of English Wikipedia (12GB) and BookCorpus (4GB), as well as CC-News (76GB), which includes 63 million English news articles from Sep/2016-Feb/2019, OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. The total size of the dataset used to train Megatron-11b is 161GB. Megatron-11b would demonstrated impressive performance on a range of language tasks, including language modeling and text generation, which underscores the potential of large-scale language models to advance natural language processing research.^[2]
2020	June 11		OpenAI releases Generative Pre-trained Transformer 3 (GPT-3) in beta.
2021	October	Transformer launch	NVIDIA and Microsoft jointly introduce Megatron-Turing NLG 530B, also known as MT-NLG, which is the successor to Microsoft Turing NLG 17B and NVIDIA Megatron-LM 8.3B. The MT-NLG model is three times larger than GPT-3, with 530 billion parameters, and is trained on over 4,000 GPUs. The dataset sources used to train the MT-NLG model include The Pile v1 plus an additional 14 datasets, such as Books3, OpenWebText2, Stack Exchange, PubMed Abstracts, and GitHub, among others. The total size of the dataset is estimated to be over 825GB, with a potential size of 1.86TB (1,863GB). The MT-NLG model represents a significant advancement in natural language generation and has the potential to be used in a wide range of applications.^[2]
2021	December	Transformer launch	Lab Meta (previously known as Facebook AI Research) introduces language model Fairseq. The model is not related to Megatron and uses different technologies for training. Fairseq has 13B parameters and was trained using 2,363 GPU-days, assuming 1,024 GPUs for a total of around three days. Like RoBERTa and Megatron-11b, Fairseq was trained on a dataset that includes the original BERT dataset, which consists of English Wikipedia (12GB) and BookCorpus (4GB), as well as CC-News (76GB), OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. However, a new addition to the dataset is English CC100 in Wikipedia style from Jan/2018-Dec/2018, which adds 292GB to the dataset. The total size of the dataset used to train Fairseq is 453GB.^[2]
2022	June 14		Google releases all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models.^[5] The Switch Transformer is a type of neural network layer used in the transformer architecture, which replaces the regular feed-forward neural network (FFN) layer. The Switch Transformer is unique because it has multiple FFNs, also known as experts, in each layer, as opposed to the single FFN found in a standard transformer layer.^[6]
2023	February 15		A paper presents a pilot study that evaluates the cognitive abilities of two recently released generative transformer models, ChatGPT and DALL-E 2, in decision-making and spatial reasoning. The study constructs input prompts following neutral a priori guidelines and finds that DALL-E 2 can generate at least one correct image for each spatial reasoning prompt, but most images generated are incorrect. Similarly, ChatGPT demonstrates some level of rational decision-making, but many of its decisions violate at least one of the axioms under the classical Von Neumann-Morgenstern utility theorem. ChatGPT's outputs tend to be unpredictable and can make irrational decisions for simpler problems while drawing correct conclusions for more complex bet structures. The paper discusses the challenges of scaling up such cognitive evaluations for generative models and conducting them with a closed set of answer keys.^[7]
2023	February 18		A paper evaluates the performance of Generative Pre-trained Transformer (GPT) models for machine translation, covering various aspects such as the quality of different GPT models, the effect of prompting strategies, robustness towards domain shifts and document-level translation. The experiment includes eighteen different translation directions involving high and low resource languages, as well as non English-centric translations. The results show that GPT models achieve competitive translation quality for high resource languages, while having limited capabilities for low resource languages. Hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality. The paper provides valuable insights for researchers and practitioners in the field to understand the potential and limitations of GPT models for translation.^[8]
2023	March 7		A paper discusses using a single transformer neural network to predict ImageNet parameters of other neural networks, which can be used for initialization to boost training of diverse ImageNet models in PyTorch. This approach aims to democratize pretraining and can lead to faster convergence and competitive final performance in other datasets.^[9]
2023	March 7		The Graph Decision Transformer (GDT) is introduced as a novel offline reinforcement learning approach that models input sequences as causal graphs, allowing for better long-term dependency learning. GDT uses a graph transformer to process the graph inputs with relation-enhanced mechanisms and can optionally use a sequence transformer for fine-grained spatial information. The experiments demonstrate that GDT outperforms or matches state-of-the-art offline RL methods on image-based Atari and OpenAI Gym tasks.^[10]
2023	March 7		An article proposes a new framework called Event Voxel Set Transformer (EVSTr) for spatiotemporal representation learning on event streams, which are sparse and asynchronous event data from neuromorphic vision sensors. Unlike conventional methods that project event data into dense frames, EVSTr first converts it into a voxel set and hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder with a multi-scale neighbor embedding layer (MNEL) for local information aggregation and a voxel self-attention layer (VSAL) for global representation modeling. The framework incorporates a long-term temporal structure through a segmental consensus strategy for modeling motion patterns over a sequence of segmented voxel sets. EVSTr achieves state-of-the-art performance on object classification and action recognition tasks while maintaining low model complexity, and a new dataset (NeuroHAR) is presented for addressing the lack of real-world event-based datasets for action recognition.^[11]
2023	March 9		A paper proposes a method for diagnosing major depressive disorder (MDD) and predicting drug response in patients with MDD using EEG signals. The method utilizes transformers, which are modified recursive neural networks with a novel architecture, to effectively evaluate the time dependency of time series. The transformer model outperforms other deep learning schemes such as CNN, LSTM, and CNN-LSTM, achieving high accuracy and recall rates for both MDD diagnosis and drug response prediction. The proposed method has the potential to assist healthcare professionals in early diagnosis and treatment of MDD patients. The main novelty of the research is the use of transformers for analyzing EEG signal analysis, which allows for effective examination of the time dependence of time series.^[12]
2023	March 10		A paper introduces Exphormer, a framework for building scalable graph transformers using a sparse attention mechanism based on virtual global nodes and expander graphs. These mechanisms allow for linear complexity in the size of the graph while maintaining theoretical properties and competitive empirical results on various graph datasets. Exphormer is shown to be able to scale to larger datasets than previous graph transformer architectures.^[13]
2023	March 10	Research (application)	A paper focuses on measuring and detecting viral social media posts, particularly on Twitter. The authors use Twitter's "Viral Tweets" topic to create labeled datasets of tweets and propose a new metric to accurately represent viral tweets. They also develop a transformers-based model to predict viral tweets and provide access to the code and tweet IDs.^[14]
2023	March 13		An article discusses a transformer-based world model that is able to efficiently learn from real-world episodes in a reinforcement learning setting. The model uses a transformer to attend to compact latent states, taken actions, and experienced or predicted rewards at different time steps. This allows the world model to directly access previous states and learn long-term dependencies while remaining computationally efficient. The transformer-based world model generates new and meaningful experiences that can be used to train a policy, which outperforms previous reinforcement learning algorithms on the Atari 100k benchmark.^[15]
2023	March 13		A paper introduces a new approach for image dehazing using a depth-consistency self-prompt Transformer. The approach is designed to enforce depth consistency between hazy input images and their clear counterparts, which is important for effective dehazing. The proposed approach involves generating a prompt based on deep features extracted from the input images, which is then embedded into the model using a prompt embedding module. A prompt attention module is also introduced to pay more attention to haze residuals for better removal. The approach is shown to outperform state-of-the-art methods on both synthetic and real-world datasets in terms of perception metrics. Additionally, the paper proposes a new continuous self-prompt inference that can iteratively correct the dehazing model towards better haze-free image generation.^[16]
2023	March 13		A paper presents DEHRFormer, a real-time transformer for simultaneous depth estimation and haze removal from varicolored haze scenes. The proposed approach includes a single encoder and two task-specific decoders, with the transformer decoders designed to decode coupling features from the encoder and project them into clean image and depth map, respectively. A novel learning paradigm is introduced, which utilizes contrastive learning and domain consistency learning to tackle the weak-generalization problem for real-world dehazing while predicting the same depth map from the same scene with varicolored haze. Experiments demonstrate that DEHRFormer outperforms previous depth estimation networks and dehazing approaches across diverse varicolored haze scenes.^[17]
2023	March 17		CoLT5 is introduced as a new Transformer model that uses conditional computation to process long documents more efficiently by allocating more resources to important tokens. It achieves state-of-the-art performance on the long-input SCROLLS benchmark and can handle inputs up to 64k in length.^[18]

Meta information on the timeline

How the timeline was built

The initial version of the timeline was written by FIXME.

Funding information for this timeline is available.

Feedback and comments

Feedback for the timeline can be provided at the following places:

FIXME

What the timeline is still missing

Timeline update strategy

External links

References

↑ Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived from the original (PDF) on 26 January 2021. Retrieved 23 January 2021.
↑ ^2.0 ^2.1 ^2.2 ^2.3 ^2.4 "AI: Megatron the Transformer, and its related language models". lifearchitect.ai. 24 September 2021. Retrieved 12 March 2023.
↑ Olabiyi, Oluwatobi; Mueller, Erik T. (2019). "DLGNet: A Transformer-based Model for Dialogue Response Generation". doi:10.48550/arXiv.1908.01841. {{cite journal}}: Cite journal requires |journal= (help)
↑ "Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained | Exxact Blog". www.exxactcorp.com. Retrieved 11 March 2023.
↑ "https://twitter.com/LiamFedus/status/1536791574612303872". Twitter. Retrieved 8 March 2023. {{cite web}}: External link in |title= (help)
↑ Davis, Jonathan (2 May 2021). "Understanding Google's Switch Transformer". Medium. Retrieved 8 March 2023.
↑ Tang, Zhisheng; Kejriwal, Mayank (15 February 2023). "A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning". arXiv:2302.09068 [cs]. doi:10.48550/arXiv.2302.09068. Retrieved 7 March 2023.
↑ Hendy, Amr; Abdelrehim, Mohamed; Sharaf, Amr; Raunak, Vikas; Gabr, Mohamed; Matsushita, Hitokazu; Kim, Young Jin; Afify, Mohamed; Awadalla, Hany Hassan (17 February 2023). "How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation". arXiv:2302.09210 [cs]. doi:10.48550/arXiv.2302.09210.
↑ Knyazev, Boris; Hwang, Doha; Lacoste-Julien, Simon (2023). "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?". doi:10.48550/arXiv.2303.04143. {{cite journal}}: Cite journal requires |journal= (help)
↑ Hu, Shengchao; Shen, Li; Zhang, Ya; Tao, Dacheng (2023). "Graph Decision Transformer". doi:10.48550/arXiv.2303.03747. {{cite journal}}: Cite journal requires |journal= (help)
↑ Xie, Bochen; Deng, Yongjian; Shao, Zhanpeng; Liu, Hai; Xu, Qingsong; Li, Youfu (2023). "Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams". doi:10.48550/arXiv.2303.03856. {{cite journal}}: Cite journal requires |journal= (help)
↑ Saeedi, Abdolkarim; Maghsoudi, Arash; Rahatabad, Fereidoun Nowshiravan (2023). "Depression Diagnosis and Drug Response Prediction via Recurrent Neural Networks and Transformers Utilizing EEG Signals". doi:10.48550/arXiv.2303.06033. {{cite journal}}: Cite journal requires |journal= (help)
↑ Shirzad, Hamed; Velingker, Ameya; Venkatachalam, Balaji; Sutherland, Danica J.; Sinop, Ali Kemal (2023). "Exphormer: Sparse Transformers for Graphs". doi:10.48550/arXiv.2303.06147. {{cite journal}}: Cite journal requires |journal= (help)
↑ Elmas, Tuğrulcan; Selim, Stephane; Houssiaux, Célia (2023). "Measuring and Detecting Virality on Social Media: The Case of Twitter's Viral Tweets Topic". doi:10.48550/arXiv.2303.06120. {{cite journal}}: Cite journal requires |journal= (help)
↑ Robine, Jan; Höftmann, Marc; Uelwer, Tobias; Harmeling, Stefan (2023). "Transformer-based World Models Are Happy With 100k Interactions". doi:10.48550/arXiv.2303.07109. {{cite journal}}: Cite journal requires |journal= (help)
↑ Wang, Cong; Pan, Jinshan; Lin, Wanyu; Dong, Jiangxin; Wu, Xiao-Ming (2023). "SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency". doi:10.48550/arXiv.2303.07033. {{cite journal}}: Cite journal requires |journal= (help)
↑ Chen, Sixiang; Ye, Tian; Shi, Jun; Liu, Yun; Jiang, JingXia; Chen, Erkang; Chen, Peng (2023). "DEHRFormer: Real-time Transformer for Depth Estimation and Haze Removal from Varicolored Haze Scenes". doi:10.48550/arXiv.2303.06905. {{cite journal}}: Cite journal requires |journal= (help)
↑ Ainslie, Joshua; Lei, Tao; de Jong, Michiel; Ontañón, Santiago; Brahma, Siddhartha; Zemlyanskiy, Yury; Uthus, David; Guo, Mandy; Lee-Thorp, James; Tay, Yi; Sung, Yun-Hsuan; Sanghai, Sumit (2023). "CoLT5: Faster Long-Range Transformers with Conditional Computation". doi:10.48550/arXiv.2303.09752. {{cite journal}}: Cite journal requires |journal= (help)

[gpt1paper-1] Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived from the original (PDF) on 26 January 2021. Retrieved 23 January 2021.

[lifearchitect.ai-2] 2.0 ^2.1 ^2.2 ^2.3 ^2.4 "AI: Megatron the Transformer, and its related language models". lifearchitect.ai. 24 September 2021. Retrieved 12 March 2023.

[3] Olabiyi, Oluwatobi; Mueller, Erik T. (2019). "DLGNet: A Transformer-based Model for Dialogue Response Generation". doi:10.48550/arXiv.1908.01841. {{cite journal}}: Cite journal requires |journal= (help)

[4] "Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained | Exxact Blog". www.exxactcorp.com. Retrieved 11 March 2023.

[5] "https://twitter.com/LiamFedus/status/1536791574612303872". Twitter. Retrieved 8 March 2023. {{cite web}}: External link in |title= (help)

[6] Davis, Jonathan (2 May 2021). "Understanding Google's Switch Transformer". Medium. Retrieved 8 March 2023.

[7] Tang, Zhisheng; Kejriwal, Mayank (15 February 2023). "A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning". arXiv:2302.09068 [cs]. doi:10.48550/arXiv.2302.09068. Retrieved 7 March 2023.

[8] Hendy, Amr; Abdelrehim, Mohamed; Sharaf, Amr; Raunak, Vikas; Gabr, Mohamed; Matsushita, Hitokazu; Kim, Young Jin; Afify, Mohamed; Awadalla, Hany Hassan (17 February 2023). "How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation". arXiv:2302.09210 [cs]. doi:10.48550/arXiv.2302.09210.

[9] Knyazev, Boris; Hwang, Doha; Lacoste-Julien, Simon (2023). "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?". doi:10.48550/arXiv.2303.04143. {{cite journal}}: Cite journal requires |journal= (help)

[10] Hu, Shengchao; Shen, Li; Zhang, Ya; Tao, Dacheng (2023). "Graph Decision Transformer". doi:10.48550/arXiv.2303.03747. {{cite journal}}: Cite journal requires |journal= (help)

[11] Xie, Bochen; Deng, Yongjian; Shao, Zhanpeng; Liu, Hai; Xu, Qingsong; Li, Youfu (2023). "Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams". doi:10.48550/arXiv.2303.03856. {{cite journal}}: Cite journal requires |journal= (help)

[12] Saeedi, Abdolkarim; Maghsoudi, Arash; Rahatabad, Fereidoun Nowshiravan (2023). "Depression Diagnosis and Drug Response Prediction via Recurrent Neural Networks and Transformers Utilizing EEG Signals". doi:10.48550/arXiv.2303.06033. {{cite journal}}: Cite journal requires |journal= (help)

[13] Shirzad, Hamed; Velingker, Ameya; Venkatachalam, Balaji; Sutherland, Danica J.; Sinop, Ali Kemal (2023). "Exphormer: Sparse Transformers for Graphs". doi:10.48550/arXiv.2303.06147. {{cite journal}}: Cite journal requires |journal= (help)

[14] Elmas, Tuğrulcan; Selim, Stephane; Houssiaux, Célia (2023). "Measuring and Detecting Virality on Social Media: The Case of Twitter's Viral Tweets Topic". doi:10.48550/arXiv.2303.06120. {{cite journal}}: Cite journal requires |journal= (help)

[15] Robine, Jan; Höftmann, Marc; Uelwer, Tobias; Harmeling, Stefan (2023). "Transformer-based World Models Are Happy With 100k Interactions". doi:10.48550/arXiv.2303.07109. {{cite journal}}: Cite journal requires |journal= (help)

[16] Wang, Cong; Pan, Jinshan; Lin, Wanyu; Dong, Jiangxin; Wu, Xiao-Ming (2023). "SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency". doi:10.48550/arXiv.2303.07033. {{cite journal}}: Cite journal requires |journal= (help)

[17] Chen, Sixiang; Ye, Tian; Shi, Jun; Liu, Yun; Jiang, JingXia; Chen, Erkang; Chen, Peng (2023). "DEHRFormer: Real-time Transformer for Depth Estimation and Haze Removal from Varicolored Haze Scenes". doi:10.48550/arXiv.2303.06905. {{cite journal}}: Cite journal requires |journal= (help)

[18] Ainslie, Joshua; Lei, Tao; de Jong, Michiel; Ontañón, Santiago; Brahma, Siddhartha; Zemlyanskiy, Yury; Uthus, David; Guo, Mandy; Lee-Thorp, James; Tay, Yi; Sung, Yun-Hsuan; Sanghai, Sumit (2023). "CoLT5: Faster Long-Range Transformers with Conditional Computation". doi:10.48550/arXiv.2303.09752. {{cite journal}}: Cite journal requires |journal= (help)

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

Timeline of transformers

Contents

Full timeline

Meta information on the timeline

How the timeline was built

Feedback and comments

What the timeline is still missing

Timeline update strategy

See also

External links

References

Navigation menu

Timeline of transformers

Full timeline

Meta information on the timeline

How the timeline was built

Feedback and comments

What the timeline is still missing

Timeline update strategy

See also

External links

References

Navigation menu

Search