Difference between revisions of "Talk:Timeline of large language models"

From Timelines
Jump to: navigation, search
 
(10 intermediate revisions by the same user not shown)
Line 8: Line 8:
 
! Year !! Month and date !! Model name !! Number of parameters !! Event type !! Details
 
! Year !! Month and date !! Model name !! Number of parameters !! Event type !! Details
 
|-
 
|-
| 2020 || March 10 || ELECTRA || || || {{w|Google}} researchers introduce ELECTRA (Efficiently Learning an Encoder that Classifies Token Replacements Accurately), a novel pre-training method for natural language processing (NLP) models. ELECTRA aims to achieve the benefits of models like BERT while being more computationally efficient. It introduces a replaced token detection (RTD) task, inspired by generative adversarial networks (GANs), where the model distinguishes between "real" and "fake" input data. Unlike previous methods that predict a small subset of masked tokens, ELECTRA applies the {{w|binary classification}} task to every input token, resulting in more efficient learning. The replacement tokens are generated by a separate {{w|neural network}} called the generator, which is trained jointly with the {{w|discriminator}} (ELECTRA model). After pre-training, the generator is dropped, and the discriminator is fine-tuned on specific NLP tasks. ELECTRA achieves optimal results on benchmarks like GLUE and SQuAD while using less compute compared to other models like RoBERTa and XLNet. It is released as an open-source model on {{w|TensorFlow}}, supporting tasks such as text classification, question answering, and [[w:Sequence labeling|sequence tagging]]. Pre-trained weights are also provided for ELECTRA-Large, ELECTRA-Base, and ELECTRA-Small.<ref>{{cite web |title=More Efficient NLP Model Pre-training with ELECTRA |url=https://ai.googleblog.com/2020/03/more-efficient-nlp-model-pre-training.html |website=ai.googleblog.com |access-date=28 June 2023 |language=en |date=10 March 2020}}</ref>
+
| 2018 || April 1 || Marian || || Early development || A paper introduces Marian, a highly efficient Neural Machine Translation (NMT) framework written entirely in C++. The framework includes an integrated automatic differentiation engine based on dynamic computation graphs. The authors discuss the design of the encoder-decoder framework and demonstrate that Marian, as a research-friendly toolkit, achieves fast training and translation speeds, making it a valuable tool for NMT research and development.<ref>{{cite journal |last1=Junczys-Dowmunt |first1=Marcin |last2=Grundkiewicz |first2=Roman |last3=Dwojak |first3=Tomasz |last4=Hoang |first4=Hieu |last5=Heafield |first5=Kenneth |last6=Neckermann |first6=Tom |last7=Seide |first7=Frank |last8=Germann |first8=Ulrich |last9=Aji |first9=Alham Fikri |last10=Bogoychev |first10=Nikolay |last11=Martins |first11=André F. T. |last12=Birch |first12=Alexandra |title=Marian: Fast Neural Machine Translation in C++ |date=2018 |doi=10.48550/arXiv.1804.00344}}</ref> NMT models, like those used in Marian, form a significant component of large language models.
|-
 
| 2022 || March 29 || || || || A paper investigates the optimal model size and number of tokens for training a transformer language model under a given compute budget. The researchers find that current large language models are significantly undertrained, and the model size and the number of training tokens should be scaled equally for compute-optimal training. They test this hypothesis by training a predicted compute-optimal model, Chinchilla, that uses the same compute budget as Gopher but with 70B parameters and 4x more data. Chinchilla outperforms Gopher, GPT-3, Jurassic-1, and Megatron-Turing NLG on a range of downstream evaluation tasks and reaches a state-of-the-art average accuracy of 67.5% on the MMLU benchmark, more than a 7% improvement over Gopher.<ref>{{cite journal |last1=Hoffmann |first1=Jordan |last2=Borgeaud |first2=Sebastian |last3=Mensch |first3=Arthur |last4=Buchatskaya |first4=Elena |last5=Cai |first5=Trevor |last6=Rutherford |first6=Eliza |last7=Casas |first7=Diego de Las |last8=Hendricks |first8=Lisa Anne |last9=Welbl |first9=Johannes |last10=Clark |first10=Aidan |last11=Hennigan |first11=Tom |last12=Noland |first12=Eric |last13=Millican |first13=Katie |last14=Driessche |first14=George van den |last15=Damoc |first15=Bogdan |last16=Guy |first16=Aurelia |last17=Osindero |first17=Simon |last18=Simonyan |first18=Karen |last19=Elsen |first19=Erich |last20=Rae |first20=Jack W. |last21=Vinyals |first21=Oriol |last22=Sifre |first22=Laurent |title=Training Compute-Optimal Large Language Models |date=2022 |doi=10.48550/arXiv.2203.15556}}</ref>
 
|-
 
| 2020 || May 28 || || || || A paper discusses the use of language models in few-shot learning, where a model is trained on a large corpus of text and then fine-tuned for a specific task. The authors demonstrate that scaling up language models greatly improves task-agnostic, few-shot performance. They trained GPT-3, a language model with 175 billion parameters, and tested its performance in the few-shot setting. GPT-3 achieved strong performance on many NLP tasks, including translation, question-answering, and cloze tasks, as well as tasks that require on-the-fly reasoning or domain adaptation. However, the authors also identify some datasets where GPT-3's few-shot learning struggles, as well as methodological issues related to training on large web corpora. The paper also discusses the broader societal impacts of this finding and of GPT-3 in general.<ref>{{cite journal |last1=Brown |first1=Tom B. |last2=Mann |first2=Benjamin |last3=Ryder |first3=Nick |last4=Subbiah |first4=Melanie |last5=Kaplan |first5=Jared |last6=Dhariwal |first6=Prafulla |last7=Neelakantan |first7=Arvind |last8=Shyam |first8=Pranav |last9=Sastry |first9=Girish |last10=Askell |first10=Amanda |last11=Agarwal |first11=Sandhini |last12=Herbert-Voss |first12=Ariel |last13=Krueger |first13=Gretchen |last14=Henighan |first14=Tom |last15=Child |first15=Rewon |last16=Ramesh |first16=Aditya |last17=Ziegler |first17=Daniel M. |last18=Wu |first18=Jeffrey |last19=Winter |first19=Clemens |last20=Hesse |first20=Christopher |last21=Chen |first21=Mark |last22=Sigler |first22=Eric |last23=Litwin |first23=Mateusz |last24=Gray |first24=Scott |last25=Chess |first25=Benjamin |last26=Clark |first26=Jack |last27=Berner |first27=Christopher |last28=McCandlish |first28=Sam |last29=Radford |first29=Alec |last30=Sutskever |first30=Ilya |last31=Amodei |first31=Dario |title=Language Models are Few-Shot Learners |date=2020 |doi=10.48550/arXiv.2005.14165}}</ref>
 
|-
 
| 2020 || July || || || || A paper discusses the limitations of neural text generation models in open-ended tasks like language modeling and story generation, due to the standard likelihood training and approximate decoding objectives. The authors specifically analyze these limitations for abstractive document summarization and find that such models tend to hallucinate content that is unfaithful to the input document. The paper presents the results of a human evaluation of several neural abstractive summarization systems, highlighting the substantial amount of hallucinated content in all model-generated summaries. However, the authors also show that pretrained models perform better in terms of generating faithful and factual summaries, as evaluated by humans. They propose that textual entailment measures may be a better evaluation metric for faithfulness than standard metrics, leading to better training and decoding criteria.<ref>{{cite journal |last1=Maynez |first1=Joshua |last2=Narayan |first2=Shashi |last3=Bohnet |first3=Bernd |last4=McDonald |first4=Ryan |title=On Faithfulness and Factuality in Abstractive Summarization |journal=Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics |date=July 2020 |pages=1906–1919 |doi=10.18653/v1/2020.acl-main.173 |url=https://aclanthology.org/2020.acl-main.173/ |publisher=Association for Computational Linguistics}}</ref>
 
|-
 
| 2022 || April 12 || || || || A paper describes a method for training language models to act as helpful and harmless assistants using {{w|reinforcement learning}} from human feedback. The authors demonstrate that this alignment training improves performance on almost all natural language processing evaluations and is compatible with training for specialized skills such as python coding and summarization. They explore an iterated online mode of training and investigate the robustness of the approach, identifying a linear relationship between the RL reward and the square root of the {{w|Kullback–Leibler divergence}} between the policy and its initialization. The authors also perform peripheral analyses and provide samples from their models using prompts from recent related work.<ref>{{cite journal |last1=Bai |first1=Yuntao |last2=Jones |first2=Andy |last3=Ndousse |first3=Kamal |last4=Askell |first4=Amanda |last5=Chen |first5=Anna |last6=DasSarma |first6=Nova |last7=Drain |first7=Dawn |last8=Fort |first8=Stanislav |last9=Ganguli |first9=Deep |last10=Henighan |first10=Tom |last11=Joseph |first11=Nicholas |last12=Kadavath |first12=Saurav |last13=Kernion |first13=Jackson |last14=Conerly |first14=Tom |last15=El-Showk |first15=Sheer |last16=Elhage |first16=Nelson |last17=Hatfield-Dodds |first17=Zac |last18=Hernandez |first18=Danny |last19=Hume |first19=Tristan |last20=Johnston |first20=Scott |last21=Kravec |first21=Shauna |last22=Lovitt |first22=Liane |last23=Nanda |first23=Neel |last24=Olsson |first24=Catherine |last25=Amodei |first25=Dario |last26=Brown |first26=Tom |last27=Clark |first27=Jack |last28=McCandlish |first28=Sam |last29=Olah |first29=Chris |last30=Mann |first30=Ben |last31=Kaplan |first31=Jared |title=Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback |date=2022 |doi=10.48550/arXiv.2204.05862}}</ref>
 
 
|-
 
|-
 
| 2022 || June 2 || || || || {{w|OpenAI}} publishes a blog post on the development of best practices for organizations developing or deploying large language models. The principles include prohibiting misuse of language models, mitigating unintentional harm by evaluating models, minimizing sources of bias, and collaborating with stakeholders. These practices are meant to mitigate the risks of language models and achieve their full potential to augment human capabilities. The authors express hope that other organizations will adopt these principles and advance public discussion on language model development and deployment. The support from other organizations shows the growing social concern over the safety of LLMs.<ref>{{cite web |title=Best practices for deploying language models |url=https://openai.com/blog/best-practices-for-deploying-language-models |website=openai.com |access-date=17 March 2023}}</ref>     
 
| 2022 || June 2 || || || || {{w|OpenAI}} publishes a blog post on the development of best practices for organizations developing or deploying large language models. The principles include prohibiting misuse of language models, mitigating unintentional harm by evaluating models, minimizing sources of bias, and collaborating with stakeholders. These practices are meant to mitigate the risks of language models and achieve their full potential to augment human capabilities. The authors express hope that other organizations will adopt these principles and advance public discussion on language model development and deployment. The support from other organizations shows the growing social concern over the safety of LLMs.<ref>{{cite web |title=Best practices for deploying language models |url=https://openai.com/blog/best-practices-for-deploying-language-models |website=openai.com |access-date=17 March 2023}}</ref>     
Line 27: Line 19:
 
|-
 
|-
 
| 2023 || February 17 || || || Research || A paper surveys the state of the art of hybrid language models architectures and strategies for complex question-answering (QA, CQA, CPS). While very large language models are good at leveraging public data on standard problems, they may require specific architecture, knowledge, skills, tasks, methods, sensitive data, performance, human approval, and versatile feedback to tackle more specific complex questions or problems. The paper identifies the key elements used with LLMs to solve complex questions or problems and discusses challenges associated with complex QA. The paper also reviews current solutions and promising strategies, using elements such as hybrid LLM architectures, human-in-the-loop reinforcement learning, prompting adaptation, neuro-symbolic and structured knowledge grounding, {{w|program synthesis}}, and others.<ref>{{cite journal |last1=Daull |first1=Xavier |last2=Bellot |first2=Patrice |last3=Bruno |first3=Emmanuel |last4=Martin |first4=Vincent |last5=Murisasco |first5=Elisabeth |title=Complex QA and language models hybrid architectures, Survey |journal=arXiv:2302.09051 [cs] |date=17 February 2023 |doi=10.48550/arXiv.2302.09051 |url=https://arxiv.org/abs/2302.09051}}</ref>
 
| 2023 || February 17 || || || Research || A paper surveys the state of the art of hybrid language models architectures and strategies for complex question-answering (QA, CQA, CPS). While very large language models are good at leveraging public data on standard problems, they may require specific architecture, knowledge, skills, tasks, methods, sensitive data, performance, human approval, and versatile feedback to tackle more specific complex questions or problems. The paper identifies the key elements used with LLMs to solve complex questions or problems and discusses challenges associated with complex QA. The paper also reviews current solutions and promising strategies, using elements such as hybrid LLM architectures, human-in-the-loop reinforcement learning, prompting adaptation, neuro-symbolic and structured knowledge grounding, {{w|program synthesis}}, and others.<ref>{{cite journal |last1=Daull |first1=Xavier |last2=Bellot |first2=Patrice |last3=Bruno |first3=Emmanuel |last4=Martin |first4=Vincent |last5=Murisasco |first5=Elisabeth |title=Complex QA and language models hybrid architectures, Survey |journal=arXiv:2302.09051 [cs] |date=17 February 2023 |doi=10.48550/arXiv.2302.09051 |url=https://arxiv.org/abs/2302.09051}}</ref>
|-
 
| 2023 || February 21 || || || Research || A paper presents a catalog of {{w|prompt engineering}} techniques in pattern form that have been applied successfully to solve common problems when conversing with large language models (LLMs), such as {{w|ChatGPT}}. Prompt patterns are reusable solutions to common problems faced when working with LLMs that can customize the outputs and interactions with an LLM. The paper provides a framework for documenting patterns for structuring prompts to solve a range of problems and presents a catalog of patterns that have been applied successfully to improve the outputs of LLM conversations. It also explains how prompts can be built from multiple patterns and illustrates prompt patterns that benefit from combination with other prompt patterns. The paper contributes to research on prompt engineering that applies LLMs to automate software development tasks.<ref>{{cite journal |last1=White |first1=Jules |last2=Fu |first2=Quchen |last3=Hays |first3=Sam |last4=Sandborn |first4=Michael |last5=Olea |first5=Carlos |last6=Gilbert |first6=Henry |last7=Elnashar |first7=Ashraf |last8=Spencer-Smith |first8=Jesse |last9=Schmidt |first9=Douglas C. |title=A Prompt Pattern Catalog to Enhance Prompt Engineering with ChatGPT |journal=arXiv:2302.11382 [cs] |date=21 February 2023 |doi=10.48550/arXiv.2302.11382 |url=https://arxiv.org/abs/2302.11382}}</ref>
 
|-
 
| 2023 || February 24 || || || Research || A paper proposes a system called LLM-Augmenter that improves large language models by using external knowledge and automated feedback. The system adds plug-and-play modules to a black-box LLM to ground responses in external knowledge and iteratively improve responses using feedback generated by utility functions. The system is validated on task-oriented dialog and open-domain question answering, showing a significant reduction in hallucinations without sacrificing fluency and informativeness. The source code and models are publicly available.<ref>{{cite journal |last1=Peng |first1=Baolin |last2=Galley |first2=Michel |last3=He |first3=Pengcheng |last4=Cheng |first4=Hao |last5=Xie |first5=Yujia |last6=Hu |first6=Yu |last7=Huang |first7=Qiuyuan |last8=Liden |first8=Lars |last9=Yu |first9=Zhou |last10=Chen |first10=Weizhu |last11=Gao |first11=Jianfeng |title=Check Your Facts and Try Again: Improving Large Language Models with External Knowledge and Automated Feedback |journal=arXiv:2302.12813 [cs] |date=1 March 2023 |doi=10.48550/arXiv.2302.12813 |url=https://arxiv.org/abs/2302.12813}}</ref>
 
|-
 
| 2023 || February 27 || || || || A paper proposes a framework that simplifies reward design in {{w|reinforcement learning}} (RL) by using natural language as a proxy for the reward function. The framework prompts a large language model, such as GPT-3, to evaluate the agent's behavior against the desired behavior described in the prompt and outputs a corresponding reward signal. The RL agent uses this reward to update its behavior. The approach is evaluated in three tasks, and the results demonstrate that RL agents trained with the framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via {{w|supervised learning}}.<ref>{{cite journal |last1=Kwon |first1=Minae |last2=Xie |first2=Sang Michael |last3=Bullard |first3=Kalesha |last4=Sadigh |first4=Dorsa |title=Reward Design with Language Models |journal=arXiv:2303.00001 [cs] |date=27 February 2023 |doi=10.48550/arXiv.2303.00001 |url=https://arxiv.org/abs/2303.00001}}</ref>
 
|-
 
| 2023 || February 27 || || || || A paper proposes a method called "rectification" for reducing the risk of LLMs generating toxic discourses. The method is based on the probability that the finished discourse will be considered toxic, and advises against token selections proportional to this probability. The approach utilizes a separate but smaller model for detoxification and does not require access to the internal representations of the LLM. The method significantly improves the generated discourse compared to base LLMs and other techniques in terms of both language and detoxification performance, and can be applied to diverse LLMs that share the same vocabulary.<ref>{{cite journal |last1=Cao |first1=Meng |last2=Fatemi |first2=Mehdi |last3=Cheung |first3=Jackie Chi Kit |last4=Shabanian |first4=Samira |title=Systematic Rectification of Language Models via Dead-end Analysis |journal=arXiv:2302.14003 [cs] |date=27 February 2023 |doi=10.48550/arXiv.2302.14003 |url=https://arxiv.org/abs/2302.14003}}</ref>
 
|-
 
| 2023 || February 27 || || || Research || A paper discusses the use of open source code to train large language models (LLMs) and the potential security, privacy, and licensing implications of this practice. LLMs for code are commonly trained on large unsanitized corpora of source code scraped from the internet, leading to the memorization and verbatim emission of content by the models. The paper argues that the use of {{w|copyleft}} code to train LLMs is a legal and ethical dilemma, and provides actionable recommendations to address this issue. Overall, the paper highlights the importance of considering the implications of using [[w:Open-source software|open source code]] in training LLMs.<ref>{{cite journal |last1=Al-Kaswan |first1=Ali |last2=Izadi |first2=Maliheh |title=The (ab)use of Open Source Code to Train Large Language Models |journal=arXiv:2302.13681 [cs] |date=28 February 2023 |doi=10.48550/arXiv.2302.13681 |url=https://arxiv.org/abs/2302.13681}}</ref>
 
 
|-
 
|-
 
| 2023 || February 28 || || || || GEMBA (GPT Estimation Metric Based Assessment) is presented as a GPT-based metric for evaluating translation quality both with and without a reference translation. The authors evaluate four prompt variants in two modes and investigate seven versions of GPT models, including ChatGPT. Their method achieves state-of-the-art accuracy in both modes compared to human labels and provides insight into the usefulness of pre-trained, generative large language models for translation quality assessment.<ref>{{cite journal |last1=Kocmi |first1=Tom |last2=Federmann |first2=Christian |title=Large Language Models Are State-of-the-Art Evaluators of Translation Quality |journal=arXiv:2302.14520 [cs] |date=28 February 2023 |doi=10.48550/arXiv.2302.14520 |url=https://arxiv.org/abs/2302.14520}}</ref><ref>{{cite web |title=Large Language Models Are State-of-the-Art Evaluators of Translation Quality |url=https://www.arxiv-vanity.com/papers/2302.14520/ |website=arxiv-vanity.com |access-date=16 May 2023}}</ref>
 
| 2023 || February 28 || || || || GEMBA (GPT Estimation Metric Based Assessment) is presented as a GPT-based metric for evaluating translation quality both with and without a reference translation. The authors evaluate four prompt variants in two modes and investigate seven versions of GPT models, including ChatGPT. Their method achieves state-of-the-art accuracy in both modes compared to human labels and provides insight into the usefulness of pre-trained, generative large language models for translation quality assessment.<ref>{{cite journal |last1=Kocmi |first1=Tom |last2=Federmann |first2=Christian |title=Large Language Models Are State-of-the-Art Evaluators of Translation Quality |journal=arXiv:2302.14520 [cs] |date=28 February 2023 |doi=10.48550/arXiv.2302.14520 |url=https://arxiv.org/abs/2302.14520}}</ref><ref>{{cite web |title=Large Language Models Are State-of-the-Art Evaluators of Translation Quality |url=https://www.arxiv-vanity.com/papers/2302.14520/ |website=arxiv-vanity.com |access-date=16 May 2023}}</ref>
|-
 
| 2023 || February 28 || || || Research || A paper discusses the potential use of large language models in {{w|psycholinguistics}}. The authors note that while these models are not detailed models of human linguistic processing, they are highly successful in their primary task of providing a model for language. They suggest that large language models can be useful in psycholinguistics as a practical tool, for comparative purposes, and philosophically, as a means of rethinking the relationship between language and thought.<ref>{{cite journal |last1=Houghton |first1=Conor |last2=Kazanina |first2=Nina |last3=Sukumaran |first3=Priyanka |title=Beyond the limitations of any imaginable mechanism: large language models and psycholinguistics |journal=arXiv:2303.00077 [cs] |date=28 February 2023 |doi=10.48550/arXiv.2303.00077 |url=https://arxiv.org/abs/2303.00077 |access-date=10 March 2023}}</ref>
 
|-
 
| 2023 || February 28 || || || || A study proposes using LLMs for the automatic analysis of dream reports, specifically focusing on references to emotions. The authors use off-the-shelf and bespoke approaches and find that the bespoke text classification method achieves high performance and is robust against potential biases. This approach could find application in the analysis of large dream datasets and improve the reproducibility and comparability of results across studies. The study of dream content in dream research is typically performed through manual scoring of verbal reports provided by dreamers. This task is time-consuming and requires trained annotators.<ref>{{cite journal |last1=Bertolini |first1=Lorenzo |last2=Elce |first2=Valentina |last3=Michalak |first3=Adriana |last4=Bernardi |first4=Giulio |last5=Weeds |first5=Julie |title=Automatic Scoring of Dream Reports' Emotional Content with Large Language Models |journal=arXiv:2302.14828 [cs] |date=28 February 2023 |doi=10.48550/arXiv.2302.14828 |url=https://arxiv.org/abs/2302.14828}}</ref>
 
|-
 
| 2023 || February 28 || || || || A paper discusses In-Context Instruction Learning (ICIL), a new approach to instruction learning for LLMs that significantly improves zero-shot task generalization performance. ICIL uses a single fixed prompt that concatenates cross-task demonstrations to evaluate all tasks, and it is complementary to instruction-based fine-tuning. The authors demonstrate that ICIL improves the performance of both pretrained and instruction-fine-tuned models, including the most powerful instruction-fine-tuned baseline (text-davinci-003) by 9.3%.<ref>{{cite journal |last1=Ye |first1=Seonghyeon |last2=Hwang |first2=Hyeonbin |last3=Yang |first3=Sohee |last4=Yun |first4=Hyeongu |last5=Kim |first5=Yireun |last6=Seo |first6=Minjoon |title=In-Context Instruction Learning |journal=arXiv:2302.14691 [cs] |date=28 February 2023 |doi=10.48550/arXiv.2302.14691 |url=https://arxiv.org/abs/2302.14691}}</ref>
 
|-
 
| 2023 || March 1 || || || Research || A paper introduces a method to train language models like ChatGPT to understand concepts precisely using succinct representations based on {{w|category theory}}. The representations provide concept-wise invariance properties and a new learning algorithm that can accurately learn complex concepts or fix misconceptions. The approach also allows for the generation of a hierarchical decomposition of the representations, which can be manually verified by examining each part individually.<ref>{{cite journal |last1=Yuan |first1=Yang |title=Succinct Representations for Concepts |date=2023 |doi=10.48550/arXiv.2303.00446}}</ref>
 
|-
 
| 2023 || March 1 || || || || A study evaluates the value of domain adaptation in nuclear medicine by adapting language models for the purpose of 5-point Deauville score prediction based on clinical 18F-fluorodeoxyglucose (FDG) PET/CT reports. The researchers used multiple general-purpose transformer language models to classify the reports into Deauville scores 1-5, and then adapted the models to the nuclear medicine domain using masked language modeling. Domain adaptation improved the performance of all language models, and the best performing model (domain-adapted RoBERTa) achieved a five-class accuracy of 77.4%, which was better than the physician's performance (66%), the best vision model's performance (48.1%), and was similar to the multimodal model's performance (77.2%).<ref>{{cite journal |last1=Huemann |first1=Zachary |last2=Lee |first2=Changhee |last3=Hu |first3=Junjie |last4=Cho |first4=Steve Y. |last5=Bradshaw |first5=Tyler |title=Domain-adapted large language models for classifying nuclear medicine reports |journal=arXiv:2303.01258 [cs] |date=1 March 2023 |doi=10.48550/arXiv.2303.01258 |url=https://arxiv.org/abs/2303.01258}}</ref>
 
|-
 
| 2023 || March 6 || || || Research || A paper explores the potential of using LLMs as zero-shot human models for {{w|human-robot interaction}} (HRI). Human models are important for HRI, but they are challenging to create. LLMs have consumed vast amounts of human-generated text data and can be used as human models without prior knowledge or interaction data. The authors conducted experiments on three social datasets and found that LLMs can achieve performance comparable to purpose-built models, but there are limitations such as sensitivity to prompts and spatial/numerical reasoning issues. The authors demonstrate how LLM-based human models can be integrated into a {{w|social robot}}'s planning process and applied in HRI scenarios through a case study on a simulated trust-based table-clearing task and a robot utensil-passing experiment. The results show that LLMs offer a promising approach to human modeling for HRI, but it is incomplete.<ref>{{cite journal |last1=Zhang |first1=Bowen |last2=Soh |first2=Harold |title=Large Language Models as Zero-Shot Human Models for Human-Robot Interaction |journal=arXiv:2303.03548 [cs] |date=6 March 2023 |doi=10.48550/arXiv.2303.03548 |url=https://arxiv.org/abs/2303.03548}}</ref>
 
 
|-
 
|-
 
| 2023 || March 3 || Two stage framework<ref>{{cite web |title=Prophet |url=https://github.com/MILVLG/prophet |website=github.com |publisher=Vision and Language Group@ MIL |access-date=16 May 2023 |date=16 May 2023}}</ref> || || Research || A paper proposes a framework called Prophet that uses answer heuristics to prompt LLMs for knowledge-based visual question answering (VQA). Previous methods used LLMs to acquire necessary knowledge for answering, but these methods did not fully activate the capacity of LLMs due to insufficient input information. Prophet trains a vanilla VQA model on a knowledge-based VQA dataset without external knowledge and extracts two types of answer heuristics: answer candidates and answer-aware examples. These answer heuristics are encoded into prompts to enhance the capacity of LLMs. Prophet outperforms existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively.<ref>{{cite journal |last1=Shao |first1=Zhenwei |last2=Yu |first2=Zhou |last3=Wang |first3=Meng |last4=Yu |first4=Jun |title=Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering |journal=arXiv:2303.01903 [cs] |date=3 March 2023 |doi=10.48550/arXiv.2303.01903 |url=https://arxiv.org/abs/2303.01903}}</ref>
 
| 2023 || March 3 || Two stage framework<ref>{{cite web |title=Prophet |url=https://github.com/MILVLG/prophet |website=github.com |publisher=Vision and Language Group@ MIL |access-date=16 May 2023 |date=16 May 2023}}</ref> || || Research || A paper proposes a framework called Prophet that uses answer heuristics to prompt LLMs for knowledge-based visual question answering (VQA). Previous methods used LLMs to acquire necessary knowledge for answering, but these methods did not fully activate the capacity of LLMs due to insufficient input information. Prophet trains a vanilla VQA model on a knowledge-based VQA dataset without external knowledge and extracts two types of answer heuristics: answer candidates and answer-aware examples. These answer heuristics are encoded into prompts to enhance the capacity of LLMs. Prophet outperforms existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively.<ref>{{cite journal |last1=Shao |first1=Zhenwei |last2=Yu |first2=Zhou |last3=Wang |first3=Meng |last4=Yu |first4=Jun |title=Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering |journal=arXiv:2303.01903 [cs] |date=3 March 2023 |doi=10.48550/arXiv.2303.01903 |url=https://arxiv.org/abs/2303.01903}}</ref>
 
|-
 
|-
| 2023 || March 6 || || || Research || A paper proposes a perspective on prompts for LLMs that distinguishes between diegetic and non-diegetic prompts, and studies how users write with LLMs using different user interfaces. The results show that when the interface offered multiple suggestions and provided an option for non-diegetic prompting, participants preferred choosing from multiple suggestions over controlling them via non-diegetic prompts. When participants provided non-diegetic prompts it was to ask for inspiration, topics or facts. Single suggestions in particular were guided both with diegetic and non-diegetic information. The paper informs human-AI interaction with generative models by revealing that writing non-diegetic prompts requires effort, people combine diegetic and non-diegetic prompting, and they use their draft and suggestion timing to strategically guide LLMs.<ref>{{cite journal |last1=Dang |first1=Hai |last2=Goller |first2=Sven |last3=Lehmann |first3=Florian |last4=Buschek |first4=Daniel |title=Choice Over Control: How Users Write with Large Language Models using Diegetic and Non-Diegetic Prompting |journal=arXiv:2303.03199 [cs] |date=6 March 2023 |doi=10.1145/3544548.3580969 |url=https://doi.org/10.48550/arXiv.2303.03199 |access-date=8 March 2023}}</ref>
+
| 2023 || March 7 || SynthIE || || || A paper presents SynthIE as a novel approach that leverages LLMs for synthetic data generation, even for tasks where LLMs can't directly solve the problem. It operates by prompting the LLM to generate text for a given structured output, exploiting task asymmetry to create high-quality, large-scale data. This methodology is demonstrated in the challenging domain of closed information extraction, where ground-truth data is scarce. SynthIE produces a dataset of 1.8 million data points, surpassing existing datasets in quality through human evaluation. The resulting SynthIE models, fine-tuned on this data, outperform comparable models by significant margins, achieving a 57-point improvement in micro F1 and a 79-point improvement in macro F1. All associated resources are publicly available.<ref>{{cite journal |last1=Josifoski |first1=Martin |last2=Sakota |first2=Marija |last3=Peyrard |first3=Maxime |last4=West |first4=Robert |title=Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction |journal=arXiv:2303.04132 [cs] |date=7 March 2023 |doi=10.48550/arXiv.2303.04132 |url=https://arxiv.org/abs/2303.04132}}</ref>
|-
 
| 2023 || Marh 14 || || || || Google shares health AI updates including progress on their Medical PaLM 2, expert-level medical language model (LLM) research which demonstrated consistently expert-level performance on medical exam questions, scoring 85%. The company has partnered with Jacaranda Health and Chang Gung Memorial Hospital to build AI models that can help simplify acquiring and interpreting ultrasound images to identify important information like gestational age in expecting mothers and early detection of breast cancer. They're also partners with Mayo Clinic with the purpose to extend the reach of their AI model, with the goal of helping more patients receive radiotherapy treatment sooner. Additionally, Google works with partners on the ground to bring their research on tuberculosis (TB) AI-powered chest x-ray screening into the care setting.<ref>{{cite web |title=Our latest health AI research updates |url=https://blog.google/technology/health/ai-llm-medpalm-research-thecheckup/ |website=Google |access-date=21 March 2023 |language=en-us |date=14 March 2023}}</ref>
 
|-
 
| 2023 || May 21 || || || || Rodney Brooks, a robotics researcher and AI expert, argues that large language models like OpenAI's ChatGPT are not as intelligent as people believe and are far from being able to compete with humans on an intellectual level. Brooks highlights that these models lack an underlying understanding of the world and merely exhibit correlations in language. Current language models can sound like they understand, but they lack the ability to logically infer meaning, leading to potential misinterpretations. Brooks emphasizes that these models are good at generating answers that sound right but may not be accurate. He shares his experience of relying on large language models for coding tasks and finding that they often provide confidently wrong answers. Brooks concludes that while future iterations of AI may bring interesting advancements, they are unlikely to achieve {{w|artificial general intelligence}} (AGI).<ref>{{cite web |title=AI Expert Says ChatGPT Is Way Stupider Than People Realize |url=https://futurism.com/the-byte/ai-expert-chatgpt-way-stupider |website=Futurism |access-date=24 May 2023}}</ref>
 
 
|-
 
|-
| 2023 || March 23 || || || || An article investigates the potential implications of {{w|large language model}}s (LLMs), such as {{w|Generative Pretrained Transformer}}s (GPTs), on the U.S. labor market. The authors propose a new rubric for assessing LLM capabilities and their potential effects on jobs. The study finds that around 80% of the U.S. workforce could have at least 10% of their work tasks affected by the introduction of LLMs, while approximately 19% of workers may see at least 50% of their tasks impacted. The study suggests that LLMs such as GPTs exhibit traits of general-purpose technologies, indicating that they could have considerable economic, social, and policy implications.<ref>{{cite journal |last1=Eloundou |first1=Tyna |last2=Manning |first2=Sam |last3=Mishkin |first3=Pamela |last4=Rock |first4=Daniel |title=GPTs are GPTs: An Early Look at the Labor Market Impact Potential of Large Language Models |date=2023 |doi=10.48550/arXiv.2303.10130}}</ref>
+
| 2023 || March 14 || || || || Google shares health AI updates including progress on their Medical PaLM 2, expert-level medical language model (LLM) research which demonstrated consistently expert-level performance on medical exam questions, scoring 85%. The company has partnered with Jacaranda Health and Chang Gung Memorial Hospital to build AI models that can help simplify acquiring and interpreting ultrasound images to identify important information like gestational age in expecting mothers and early detection of breast cancer. They're also partners with Mayo Clinic with the purpose to extend the reach of their AI model, with the goal of helping more patients receive radiotherapy treatment sooner. Additionally, Google works with partners on the ground to bring their research on tuberculosis (TB) AI-powered chest x-ray screening into the care setting.<ref>{{cite web |title=Our latest health AI research updates |url=https://blog.google/technology/health/ai-llm-medpalm-research-thecheckup/ |website=Google |access-date=21 March 2023 |language=en-us |date=14 March 2023}}</ref>
 
|-
 
|-
 
|}
 
|}

Latest revision as of 12:25, 12 October 2023


Extended Timeline

These events were removed from the main timeline.

Year Month and date Model name Number of parameters Event type Details
2018 April 1 Marian Early development A paper introduces Marian, a highly efficient Neural Machine Translation (NMT) framework written entirely in C++. The framework includes an integrated automatic differentiation engine based on dynamic computation graphs. The authors discuss the design of the encoder-decoder framework and demonstrate that Marian, as a research-friendly toolkit, achieves fast training and translation speeds, making it a valuable tool for NMT research and development.[1] NMT models, like those used in Marian, form a significant component of large language models.
2022 June 2 OpenAI publishes a blog post on the development of best practices for organizations developing or deploying large language models. The principles include prohibiting misuse of language models, mitigating unintentional harm by evaluating models, minimizing sources of bias, and collaborating with stakeholders. These practices are meant to mitigate the risks of language models and achieve their full potential to augment human capabilities. The authors express hope that other organizations will adopt these principles and advance public discussion on language model development and deployment. The support from other organizations shows the growing social concern over the safety of LLMs.[2]
2022 September Competition Nvidia announces the launch of its BioNeMo LLM service to help researchers build new artificial intelligence models for biology.[3]
2023 February 9 A paper presents a collaborative design framework that combines interactive evolution and LLMs to simulate the human design process. The framework uses interactive evolution to exploit user feedback and LLMs for a complex creative task of recombining and varying ideas. The process begins with a brief and a set of candidate designs, generated by a language model or proposed by users. Users provide feedback to an interactive genetic algorithm that selects, recombines, and mutates the most promising designs. The framework was evaluated on three game design tasks with human designers collaborating remotely.[4]
2023 February 14 Research A paper presents a framework called ChatCAD, which integrates LLMs with computer-aided diagnosis (CAD) networks for medical images. ChatCAD uses LLMs to enhance the output of multiple CAD networks by summarizing and reorganizing the information presented in natural language text format. This approach merges the strengths of LLMs' medical domain knowledge and logical reasoning with the vision understanding capability of existing medical-image CAD models. The goal is to create a more user-friendly and understandable system for patients compared to conventional CAD systems. The paper suggests that LLMs can also be used to improve the performance of vision-based medical-image CAD models in the future.[5]
2023 February 17 Research A paper surveys the state of the art of hybrid language models architectures and strategies for complex question-answering (QA, CQA, CPS). While very large language models are good at leveraging public data on standard problems, they may require specific architecture, knowledge, skills, tasks, methods, sensitive data, performance, human approval, and versatile feedback to tackle more specific complex questions or problems. The paper identifies the key elements used with LLMs to solve complex questions or problems and discusses challenges associated with complex QA. The paper also reviews current solutions and promising strategies, using elements such as hybrid LLM architectures, human-in-the-loop reinforcement learning, prompting adaptation, neuro-symbolic and structured knowledge grounding, program synthesis, and others.[6]
2023 February 28 GEMBA (GPT Estimation Metric Based Assessment) is presented as a GPT-based metric for evaluating translation quality both with and without a reference translation. The authors evaluate four prompt variants in two modes and investigate seven versions of GPT models, including ChatGPT. Their method achieves state-of-the-art accuracy in both modes compared to human labels and provides insight into the usefulness of pre-trained, generative large language models for translation quality assessment.[7][8]
2023 March 3 Two stage framework[9] Research A paper proposes a framework called Prophet that uses answer heuristics to prompt LLMs for knowledge-based visual question answering (VQA). Previous methods used LLMs to acquire necessary knowledge for answering, but these methods did not fully activate the capacity of LLMs due to insufficient input information. Prophet trains a vanilla VQA model on a knowledge-based VQA dataset without external knowledge and extracts two types of answer heuristics: answer candidates and answer-aware examples. These answer heuristics are encoded into prompts to enhance the capacity of LLMs. Prophet outperforms existing state-of-the-art methods on two challenging knowledge-based VQA datasets, OK-VQA and A-OKVQA, delivering 61.1% and 55.7% accuracies on their testing sets, respectively.[10]
2023 March 7 SynthIE A paper presents SynthIE as a novel approach that leverages LLMs for synthetic data generation, even for tasks where LLMs can't directly solve the problem. It operates by prompting the LLM to generate text for a given structured output, exploiting task asymmetry to create high-quality, large-scale data. This methodology is demonstrated in the challenging domain of closed information extraction, where ground-truth data is scarce. SynthIE produces a dataset of 1.8 million data points, surpassing existing datasets in quality through human evaluation. The resulting SynthIE models, fine-tuned on this data, outperform comparable models by significant margins, achieving a 57-point improvement in micro F1 and a 79-point improvement in macro F1. All associated resources are publicly available.[11]
2023 March 14 Google shares health AI updates including progress on their Medical PaLM 2, expert-level medical language model (LLM) research which demonstrated consistently expert-level performance on medical exam questions, scoring 85%. The company has partnered with Jacaranda Health and Chang Gung Memorial Hospital to build AI models that can help simplify acquiring and interpreting ultrasound images to identify important information like gestational age in expecting mothers and early detection of breast cancer. They're also partners with Mayo Clinic with the purpose to extend the reach of their AI model, with the goal of helping more patients receive radiotherapy treatment sooner. Additionally, Google works with partners on the ground to bring their research on tuberculosis (TB) AI-powered chest x-ray screening into the care setting.[12]
  1. Junczys-Dowmunt, Marcin; Grundkiewicz, Roman; Dwojak, Tomasz; Hoang, Hieu; Heafield, Kenneth; Neckermann, Tom; Seide, Frank; Germann, Ulrich; Aji, Alham Fikri; Bogoychev, Nikolay; Martins, André F. T.; Birch, Alexandra (2018). "Marian: Fast Neural Machine Translation in C++". doi:10.48550/arXiv.1804.00344. 
  2. "Best practices for deploying language models". openai.com. Retrieved 17 March 2023. 
  3. "Nvidia boosts generative AI for biology with BioNeMo". VentureBeat. 12 January 2023. Retrieved 11 March 2023. 
  4. Lanzi, Pier Luca; Loiacono, Daniele (9 February 2023). "ChatGPT and Other Large Language Models as Evolutionary Engines for Online Interactive Collaborative Game Design". arXiv:2303.02155 [cs]. doi:10.48550/arXiv.2303.02155. 
  5. Wang, Sheng; Zhao, Zihao; Ouyang, Xi; Wang, Qian; Shen, Dinggang (2023). "ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image using Large Language Models". doi:10.48550/arXiv.2302.07257. 
  6. Daull, Xavier; Bellot, Patrice; Bruno, Emmanuel; Martin, Vincent; Murisasco, Elisabeth (17 February 2023). "Complex QA and language models hybrid architectures, Survey". arXiv:2302.09051 [cs]. doi:10.48550/arXiv.2302.09051. 
  7. Kocmi, Tom; Federmann, Christian (28 February 2023). "Large Language Models Are State-of-the-Art Evaluators of Translation Quality". arXiv:2302.14520 [cs]. doi:10.48550/arXiv.2302.14520. 
  8. "Large Language Models Are State-of-the-Art Evaluators of Translation Quality". arxiv-vanity.com. Retrieved 16 May 2023. 
  9. "Prophet". github.com. Vision and Language Group@ MIL. 16 May 2023. Retrieved 16 May 2023. 
  10. Shao, Zhenwei; Yu, Zhou; Wang, Meng; Yu, Jun (3 March 2023). "Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering". arXiv:2303.01903 [cs]. doi:10.48550/arXiv.2303.01903. 
  11. Josifoski, Martin; Sakota, Marija; Peyrard, Maxime; West, Robert (7 March 2023). "Exploiting Asymmetry for Synthetic Training Data Generation: SynthIE and the Case of Information Extraction". arXiv:2303.04132 [cs]. doi:10.48550/arXiv.2303.04132. 
  12. "Our latest health AI research updates". Google. 14 March 2023. Retrieved 21 March 2023.