Timeline of transformers
This is a timeline of transformers.
|Year||Month and date||Event type||Details|
|2017||June||Google researchers first describe the transformer algorithm that would turbocharge the power of chatbots.|
|2018||June 11||OpenAI releases a paper entitled Improving Language Understanding by Generative Pre-Training, in which they introduces the Generative Pre-trained Transformer (GPT).|
|2018||November||Google open sources BERT (Bidirectional Encoder Representations from Transformers) model, which was trained in just four days. The model has 345M parameters and was trained on a dataset that includes English Wikipedia (12GB) and BookCorpus (4GB), for a total size of 16GB.|
|2019||February 14||OpenAI releases Generative Pre-trained Transformer 2 (GPT-2).|
|2019||July||Facebook AI (FAIR) and the University of Washington introduce a new language model called RoBERTa (Robustly optimized BERT approach). The model was developed jointly by researchers from FAIR and UW and is based on the BERT (Bidirectional Encoder Representations from Transformers) model architecture. RoBERTa has 125M parameters in its base configuration. The model was trained on a large and diverse dataset that includes the original BERT dataset, consisting of English Wikipedia (12GB) and BookCorpus (4GB), as well as several other sources such as CC-News (76GB), which includes 63 million English news articles from Sep/2016-Feb/2019, OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. The total size of the dataset used to train RoBERTa is 161GB.|
|2019||July 26||A paper describes DLGNet, a transformer-based model for dialogue response generation. By this time, the use of transformer-based models such as GPT-2 has shown significant improvements in capturing long-range structures in language modeling tasks. DLGNet uses a combination of the long-range transformer architecture, injection of random informative paddings, joint modeling of dialogue context and response, and 100% tokenization coverage from byte pair encoding. DLGNet outperforms state-of-the-art multi-turn dialogue models on Movie Triples and Ubuntu Dialogue datasets, achieving best performance to date based on several metrics, including BLEU, ROUGE, and distinct n-gram.|
|2019||August||NVIDIA introduces Megatron-LM, a transformer language model with 8.3 billion parameters, 8-way model parallelism, and 64-way data parallelism trained on 512 GPUs. It is the largest transformer model trained to date and is 24x the size of BERT and 5.6x the size of GPT-2. NVIDIA used a 37 GB WebText dataset downloaded from Reddit for training and split it into a 95:5 ratio (training and validation sets). Megatron-LM surpassed previous state-of-the-art results in wikitext perplexity and Lambada accuracy. NVIDIA solved the problem of training massive transformer models by using model parallelism that splits the parameters across multiple GPUs without requiring new compiler or code re-wiring.|
|2020||April||Facebook AI Research labs introduces Megatron-11b, which is a unidirectional language model with 11B parameters and is based on Megatron-LM. The model was developed by FAIR and trained using intra-layer model parallelism, with each layer's parameters split across eight GPUs. Like RoBERTa, Megatron-11b was trained on a dataset that includes the original BERT dataset, consisting of English Wikipedia (12GB) and BookCorpus (4GB), as well as CC-News (76GB), which includes 63 million English news articles from Sep/2016-Feb/2019, OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. The total size of the dataset used to train Megatron-11b is 161GB. Megatron-11b would demonstrated impressive performance on a range of language tasks, including language modeling and text generation, which underscores the potential of large-scale language models to advance natural language processing research.|
|2020||June 11||OpenAI releases Generative Pre-trained Transformer 3 (GPT-3) in beta.|
|2021||October||Transformer launch||NVIDIA and Microsoft jointly introduce Megatron-Turing NLG 530B, also known as MT-NLG, which is the successor to Microsoft Turing NLG 17B and NVIDIA Megatron-LM 8.3B. The MT-NLG model is three times larger than GPT-3, with 530 billion parameters, and is trained on over 4,000 GPUs. The dataset sources used to train the MT-NLG model include The Pile v1 plus an additional 14 datasets, such as Books3, OpenWebText2, Stack Exchange, PubMed Abstracts, and GitHub, among others. The total size of the dataset is estimated to be over 825GB, with a potential size of 1.86TB (1,863GB). The MT-NLG model represents a significant advancement in natural language generation and has the potential to be used in a wide range of applications.|
|2021||December||Transformer launch||Lab Meta (previously known as Facebook AI Research) introduces language model Fairseq. The model is not related to Megatron and uses different technologies for training. Fairseq has 13B parameters and was trained using 2,363 GPU-days, assuming 1,024 GPUs for a total of around three days. Like RoBERTa and Megatron-11b, Fairseq was trained on a dataset that includes the original BERT dataset, which consists of English Wikipedia (12GB) and BookCorpus (4GB), as well as CC-News (76GB), OpenWebText/Reddit upvoted (38GB), and Stories (31GB), which consists of 1M story documents from the CC. However, a new addition to the dataset is English CC100 in Wikipedia style from Jan/2018-Dec/2018, which adds 292GB to the dataset. The total size of the dataset used to train Fairseq is 453GB.|
|2022||June 14||Google releases all Switch Transformer models in T5X/JAX, including the 1.6T param Switch-C and the 395B param Switch-XXL models. The Switch Transformer is a type of neural network layer used in the transformer architecture, which replaces the regular feed-forward neural network (FFN) layer. The Switch Transformer is unique because it has multiple FFNs, also known as experts, in each layer, as opposed to the single FFN found in a standard transformer layer.|
|2023||February 15||A paper presents a pilot study that evaluates the cognitive abilities of two recently released generative transformer models, ChatGPT and DALL-E 2, in decision-making and spatial reasoning. The study constructs input prompts following neutral a priori guidelines and finds that DALL-E 2 can generate at least one correct image for each spatial reasoning prompt, but most images generated are incorrect. Similarly, ChatGPT demonstrates some level of rational decision-making, but many of its decisions violate at least one of the axioms under the classical Von Neumann-Morgenstern utility theorem. ChatGPT's outputs tend to be unpredictable and can make irrational decisions for simpler problems while drawing correct conclusions for more complex bet structures. The paper discusses the challenges of scaling up such cognitive evaluations for generative models and conducting them with a closed set of answer keys.|
|2023||February 18||A paper evaluates the performance of Generative Pre-trained Transformer (GPT) models for machine translation, covering various aspects such as the quality of different GPT models, the effect of prompting strategies, robustness towards domain shifts and document-level translation. The experiment includes eighteen different translation directions involving high and low resource languages, as well as non English-centric translations. The results show that GPT models achieve competitive translation quality for high resource languages, while having limited capabilities for low resource languages. Hybrid approaches, which combine GPT models with other translation systems, can further enhance the translation quality. The paper provides valuable insights for researchers and practitioners in the field to understand the potential and limitations of GPT models for translation.|
|2023||March 7||A paper discusses using a single transformer neural network to predict ImageNet parameters of other neural networks, which can be used for initialization to boost training of diverse ImageNet models in PyTorch. This approach aims to democratize pretraining and can lead to faster convergence and competitive final performance in other datasets.|
|2023||March 7||The Graph Decision Transformer (GDT) is introduced as a novel offline reinforcement learning approach that models input sequences as causal graphs, allowing for better long-term dependency learning. GDT uses a graph transformer to process the graph inputs with relation-enhanced mechanisms and can optionally use a sequence transformer for fine-grained spatial information. The experiments demonstrate that GDT outperforms or matches state-of-the-art offline RL methods on image-based Atari and OpenAI Gym tasks.|
|2023||March 7||An article proposes a new framework called Event Voxel Set Transformer (EVSTr) for spatiotemporal representation learning on event streams, which are sparse and asynchronous event data from neuromorphic vision sensors. Unlike conventional methods that project event data into dense frames, EVSTr first converts it into a voxel set and hierarchically aggregates voxel features to obtain robust representations. The core of EVSTr is an event voxel transformer encoder with a multi-scale neighbor embedding layer (MNEL) for local information aggregation and a voxel self-attention layer (VSAL) for global representation modeling. The framework incorporates a long-term temporal structure through a segmental consensus strategy for modeling motion patterns over a sequence of segmented voxel sets. EVSTr achieves state-of-the-art performance on object classification and action recognition tasks while maintaining low model complexity, and a new dataset (NeuroHAR) is presented for addressing the lack of real-world event-based datasets for action recognition.|
|2023||March 9||A paper proposes a method for diagnosing major depressive disorder (MDD) and predicting drug response in patients with MDD using EEG signals. The method utilizes transformers, which are modified recursive neural networks with a novel architecture, to effectively evaluate the time dependency of time series. The transformer model outperforms other deep learning schemes such as CNN, LSTM, and CNN-LSTM, achieving high accuracy and recall rates for both MDD diagnosis and drug response prediction. The proposed method has the potential to assist healthcare professionals in early diagnosis and treatment of MDD patients. The main novelty of the research is the use of transformers for analyzing EEG signal analysis, which allows for effective examination of the time dependence of time series.|
|2023||March 10||A paper introduces Exphormer, a framework for building scalable graph transformers using a sparse attention mechanism based on virtual global nodes and expander graphs. These mechanisms allow for linear complexity in the size of the graph while maintaining theoretical properties and competitive empirical results on various graph datasets. Exphormer is shown to be able to scale to larger datasets than previous graph transformer architectures.|
|2023||March 10||Research (application)||A paper focuses on measuring and detecting viral social media posts, particularly on Twitter. The authors use Twitter's "Viral Tweets" topic to create labeled datasets of tweets and propose a new metric to accurately represent viral tweets. They also develop a transformers-based model to predict viral tweets and provide access to the code and tweet IDs.|
|2023||March 13||An article discusses a transformer-based world model that is able to efficiently learn from real-world episodes in a reinforcement learning setting. The model uses a transformer to attend to compact latent states, taken actions, and experienced or predicted rewards at different time steps. This allows the world model to directly access previous states and learn long-term dependencies while remaining computationally efficient. The transformer-based world model generates new and meaningful experiences that can be used to train a policy, which outperforms previous reinforcement learning algorithms on the Atari 100k benchmark.|
|2023||March 13||A paper introduces a new approach for image dehazing using a depth-consistency self-prompt Transformer. The approach is designed to enforce depth consistency between hazy input images and their clear counterparts, which is important for effective dehazing. The proposed approach involves generating a prompt based on deep features extracted from the input images, which is then embedded into the model using a prompt embedding module. A prompt attention module is also introduced to pay more attention to haze residuals for better removal. The approach is shown to outperform state-of-the-art methods on both synthetic and real-world datasets in terms of perception metrics. Additionally, the paper proposes a new continuous self-prompt inference that can iteratively correct the dehazing model towards better haze-free image generation.|
|2023||March 13||A paper presents DEHRFormer, a real-time transformer for simultaneous depth estimation and haze removal from varicolored haze scenes. The proposed approach includes a single encoder and two task-specific decoders, with the transformer decoders designed to decode coupling features from the encoder and project them into clean image and depth map, respectively. A novel learning paradigm is introduced, which utilizes contrastive learning and domain consistency learning to tackle the weak-generalization problem for real-world dehazing while predicting the same depth map from the same scene with varicolored haze. Experiments demonstrate that DEHRFormer outperforms previous depth estimation networks and dehazing approaches across diverse varicolored haze scenes.|
|2023||March 17||CoLT5 is introduced as a new Transformer model that uses conditional computation to process long documents more efficiently by allocating more resources to important tokens. It achieves state-of-the-art performance on the long-input SCROLLS benchmark and can handle inputs up to 64k in length.|
Meta information on the timeline
How the timeline was built
The initial version of the timeline was written by FIXME.
Funding information for this timeline is available.
Feedback and comments
Feedback for the timeline can be provided at the following places:
What the timeline is still missing
- Generative pre-trained transformer
Timeline update strategy
- Radford, Alec; Narasimhan, Karthik; Salimans, Tim; Sutskever, Ilya (11 June 2018). "Improving Language Understanding by Generative Pre-Training" (PDF). OpenAI. p. 12. Archived from the original (PDF) on 26 January 2021. Retrieved 23 January 2021.
- "AI: Megatron the Transformer, and its related language models". lifearchitect.ai. 24 September 2021. Retrieved 12 March 2023.
- Olabiyi, Oluwatobi; Mueller, Erik T. (2019). "DLGNet: A Transformer-based Model for Dialogue Response Generation". doi:10.48550/arXiv.1908.01841.
- "Megatron Unleashed: NVIDIA's NLP Model "Megatron-LM" is the Largest Transformer Ever Trained | Exxact Blog". www.exxactcorp.com. Retrieved 11 March 2023.
- "https://twitter.com/LiamFedus/status/1536791574612303872". Twitter. Retrieved 8 March 2023. External link in
- Davis, Jonathan (2 May 2021). "Understanding Google's Switch Transformer". Medium. Retrieved 8 March 2023.
- Tang, Zhisheng; Kejriwal, Mayank (15 February 2023). "A Pilot Evaluation of ChatGPT and DALL-E 2 on Decision Making and Spatial Reasoning". arXiv:2302.09068 [cs]. doi:10.48550/arXiv.2302.09068. Retrieved 7 March 2023.
- Hendy, Amr; Abdelrehim, Mohamed; Sharaf, Amr; Raunak, Vikas; Gabr, Mohamed; Matsushita, Hitokazu; Kim, Young Jin; Afify, Mohamed; Awadalla, Hany Hassan (17 February 2023). "How Good Are GPT Models at Machine Translation? A Comprehensive Evaluation". arXiv:2302.09210 [cs]. doi:10.48550/arXiv.2302.09210.
- Knyazev, Boris; Hwang, Doha; Lacoste-Julien, Simon (2023). "Can We Scale Transformers to Predict Parameters of Diverse ImageNet Models?". doi:10.48550/arXiv.2303.04143.
- Hu, Shengchao; Shen, Li; Zhang, Ya; Tao, Dacheng (2023). "Graph Decision Transformer". doi:10.48550/arXiv.2303.03747.
- Xie, Bochen; Deng, Yongjian; Shao, Zhanpeng; Liu, Hai; Xu, Qingsong; Li, Youfu (2023). "Event Voxel Set Transformer for Spatiotemporal Representation Learning on Event Streams". doi:10.48550/arXiv.2303.03856.
- Saeedi, Abdolkarim; Maghsoudi, Arash; Rahatabad, Fereidoun Nowshiravan (2023). "Depression Diagnosis and Drug Response Prediction via Recurrent Neural Networks and Transformers Utilizing EEG Signals". doi:10.48550/arXiv.2303.06033.
- Shirzad, Hamed; Velingker, Ameya; Venkatachalam, Balaji; Sutherland, Danica J.; Sinop, Ali Kemal (2023). "Exphormer: Sparse Transformers for Graphs". doi:10.48550/arXiv.2303.06147.
- Elmas, Tuğrulcan; Selim, Stephane; Houssiaux, Célia (2023). "Measuring and Detecting Virality on Social Media: The Case of Twitter's Viral Tweets Topic". doi:10.48550/arXiv.2303.06120.
- Robine, Jan; Höftmann, Marc; Uelwer, Tobias; Harmeling, Stefan (2023). "Transformer-based World Models Are Happy With 100k Interactions". doi:10.48550/arXiv.2303.07109.
- Wang, Cong; Pan, Jinshan; Lin, Wanyu; Dong, Jiangxin; Wu, Xiao-Ming (2023). "SelfPromer: Self-Prompt Dehazing Transformers with Depth-Consistency". doi:10.48550/arXiv.2303.07033.
- Chen, Sixiang; Ye, Tian; Shi, Jun; Liu, Yun; Jiang, JingXia; Chen, Erkang; Chen, Peng (2023). "DEHRFormer: Real-time Transformer for Depth Estimation and Haze Removal from Varicolored Haze Scenes". doi:10.48550/arXiv.2303.06905.
- Ainslie, Joshua; Lei, Tao; de Jong, Michiel; Ontañón, Santiago; Brahma, Siddhartha; Zemlyanskiy, Yury; Uthus, David; Guo, Mandy; Lee-Thorp, James; Tay, Yi; Sung, Yun-Hsuan; Sanghai, Sumit (2023). "CoLT5: Faster Long-Range Transformers with Conditional Computation". doi:10.48550/arXiv.2303.09752.