Learning to Plan for Language Modeling from Unlabeled Data (2024)

\mdfdefinestyle

mystylerightline=true,innerleftmargin=10,innerrightmargin=10,outerlinewidth=3pt,topline=false,rightline=true,bottomline=false,skipabove=skipbelow=

Nathan Cornille, Marie-Francine Moens & Florian Mai
Department of Computer Science
KU Leuven
Leuven, Belgium
{nathan.cornille,sien.moens,florian.mai}@kuleuven.be

Abstract

By training to predict the next token in an unlabeled corpus, large language models learn to perform many tasks without any labeled data.However, their next-token-prediction objective arguably limits their performance in scenarios that require planning, such as writing a coherent article.In this paper, we train a module for planning the future writing process via a self-supervised learning objective.Given the textual context, this planning module learns to predict future abstract writing actions, which correspond to centroids in a clustered text embedding space.By conditioning on these actions, our model extends the successful language model formula to more abstract planning in an unsupervised way.Empirically, we demonstrate that our method improves language modeling performance in general, particularly with respect to the text structure.Because our framework uses a planner module that is unsupervised and external to the language model, new planner modules can be trained at large scale and easily be shared with the community.

1 Introduction

Despite their simple learning objective, language models (LMs) can solve a surprising range of challenging generation tasks such as summarization(Zhang etal., 2024b), style transfer(Reif etal., 2022), and machine translation(Zhu etal., 2023). This is widely attributed to the fact that the self-supervised next-token-prediction objective allows for training large models on unprecedented amounts of unlabeled data(Radford etal., 2019), unlocking new capabilities at sufficient scale(Wei etal., 2022a), e.g. in-context learning(Brown etal., 2020) and chain-of-thought reasoning(Wei etal., 2022b).The dual-process theory posits that human intelligence comprises intuitive (type 1) and deliberate (type 2) reasoning systems(Evans, 1984; Kahneman, 2011).Bengio etal. (2021) describe current self-supervised deep learning systems as most successful at system 1 tasks such as perception, reading, and talking. However, it is currently unclear how to transfer the success of self-supervision to system 2 tasks that require planning, such as writing a coherent article.

Most popular approaches for eliciting deliberate reasoning capabilities in LMs are based on either advanced prompting techniques such as chain-of-thought(Wei etal., 2022b), plan-and-solve(Wang etal., 2023a), or tree-of-thought(Long, 2023), or on finetuning on task-specific reasoning data(Havrilla etal., 2024).While these techniques already lead to substantial improvements, they ignore the key ingredient that made LMs so successful in the first place: the ability to obtain ever-improving performance through scalable self-supervised pretraining.Hence, it is widely recognized that combining planning with self-supervised foundation models is a promising research direction(Yang etal., 2023).

In the context of sequential decision making, planning is the process of determining a decision policy from a model of the environment(Sutton & Barto, 2018).In this paper, we propose a method for learning to plan future writing from unlabeled data. Concretely, as illustrated in Figure1, we first transform unlabeled text into a sequence of abstract writing actions which we obtain from a clustered text embedding space. We then train an external planner network which predicts the next writing action from the current context.Finally, we integrate the new learned planning capability into a regular LM by finetuning it on the self-supervised next-token-prediction task in a way that allows it to extract relevant information from predicted writing actions.111The code is available on Github.

In summary, we report the following findings:

  1. 1.

    Our model makes effective use of abstract writing actions predicted by the planner.

  2. 2.

    An external planner outperforms an LM-internal planner.

  3. 3.

    The LM improvements are specifically due to improvements in anticipating and matching the text structure due to planning.

2 Related Work

2.1 Deliberative Reasoning and Planning in LMs

Existing work on deliberative reasoning in LMs can be roughly categorized into three approaches: 1) Prompting-based approaches such as chain-of-thought(Wei etal., 2022b; Kojima etal., 2022) and plan-and-solve(Wang etal., 2023a) elicit reasoning and planning capabilities from pretrained LMs via advanced prompting methods but without any additional finetuning(Wei etal., 2022b; Kojima etal., 2022; Wang etal., 2022b; a; 2023a; Zheng etal., 2023b; Kim etal., 2023; Xue etal., 2023; Zhao etal., 2023; Sun etal., 2024; Yao etal., 2023; Hao etal., 2023). 2) Through a combination with external tools, LMs can learn to use e.g. calculators when appropriate(Schick etal., 2023; Chen etal., 2023; Paranjape etal., 2023) or receive feedback from domain specific solvers(Nye etal., 2021; Valmeekam etal., 2023). 3) Finally, finetuning on task-specific reasoning data can result in substantially improved in-domain performance for e.g. mathematical reasoning(Hendrycks etal., 2021; Cobbe etal., 2021; Havrilla etal., 2024), code generation(Shen etal., 2023), and navigation tasks(Carta etal., 2023).While the above three approaches lead to substantial performance gains on individual reasoning tasks, they deviate from the paradigm of large-scale self-supervised training on unlabeled data that gave rise to the dominance of LMs, which our method addresses.The Planning Tokens(Wang etal., 2023b) method is perhaps most relevant to our work: Here, the dataset is augmented with planning tokens that represent high-level information similar to our writing actions.However, planning tokens are generated by the LM itself rather than by an external planner, and are hence limited in their flexibility and ultimately performance.Most importantly, again their method is applied to a narrow mathematical reasoning dataset rather than a general-purpose corpus.Nonetheless, in our experiments, we compare our external planner to the LM-internal planning approach by Wang etal. (2023b).

2.2 Hierarchical Text Processing in Deep Learning

By planning in the space of abstract writing actions derived from text units (e.g., sentences), our method explicitly addresses the hierarchical nature of language.This is a common theme in NLP models; for example, hierarchical attention mechanisms have been employed for summarization(Nallapati etal., 2016), text classification(Yang etal., 2016) and machine translation(Miculicich etal., 2018).Several works have decomposed text generation tasks into sentence-level and word-level prediction tasks(Tan etal., 2017; Perez-Beltrachini etal., 2019; Marfurt & Henderson, 2021; Jhamtani & Berg-Kirkpatrick, 2020), but in contrast to our work do not condition on abstract writing actions.Ippolito etal. (2020) propose to do language modeling at the sentence-level, while we ultimately still perform language modeling at the token level.

2.3 Language Modeling

Language models have been an important subject of study for decades.While initial models were based on count-based n-gram LMs, the models that are ubiquitous today are based on neural LMs(Bengio etal., 2000).Algorithmic advances are one major source of progress in neural language modeling(Ho etal., 2024), through e.g. recurrent LMs(Mikolov etal., 2010; 2011), specialized attention mechanisms such as pointer-networks(Vinyals etal., 2015; Merity etal., 2017), and eventually transformer-based(Vaswani etal., 2017) language models(Radford, 2018).While the contribution of our paper is on the algorithmic side, it is important to remember that, as Ho etal. (2024) identify, in recent years progress has been driven mostly by scaling up similar architectures(Radford etal., 2019; Brown etal., 2020; Hoffmann etal., 2022), which gives rise to the key motivation of our paper: New algorithms for planning and reasoning in LMs need to scale to unlabeled data to be relevant long-term.

Related methods

Among the important algorithmic improvements, Retrieval Augmented Generation (RAG)(Guu etal., 2020; Lewis etal., 2020) informs next-token generation by querying a large database of text.A special case of RAG, nearest neighbor language models(Khandelwal etal., 2020), retrieve the closest matches to the past based on similarity in a latent space, transform the distances into a probability distribution, and interpolate it with the LM’s distribution without any additional finetuning.Orthogonal to the above, we propose a method that predicts a sentence into the future in order to inform the generation of the next tokens via finetuning from unlabeled data.In Controllable Text Generation(Zhang etal., 2024a; Mudgal etal., 2024; Guo etal., 2022; Deng & Raffel, 2023; Yang & Klein, 2021), generation is guided by user-provided attributes such as topic, style, or keywords.Our method can be seen as a form of controlled text generation, where guiding (latent) attributes are predicted by an external model rather than provided by the user.Aiming at separating the global context from the local context, topic-guided LMs(Lau etal., 2017; Wang etal., 2019) condition the generation on topic models inferred from the context. Recent results indicate that this does not improve language modeling because well-trained LMs already possess a good awareness of the current topic(Zheng etal., 2023a).In contrast, our planner predicts future writing actions, which often carry information that is structural rather than topical.

3 Methodology

We consider the general task of language modeling, i.e., estimating the probability p(x1x2xn)𝑝subscript𝑥1subscript𝑥2subscript𝑥𝑛p(x_{1}x_{2}\dots x_{n})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) for any text sequence X=x1xn𝒳𝑋subscript𝑥1subscript𝑥𝑛𝒳X=x_{1}\dots x_{n}\in\mathcal{X}italic_X = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X.As is standard practice, we factorize p(x1xn)=i=1np(xi|x1xi1)𝑝subscript𝑥1subscript𝑥𝑛superscriptsubscriptproduct𝑖1𝑛𝑝conditionalsubscript𝑥𝑖subscript𝑥1subscript𝑥𝑖1p(x_{1}\dots x_{n})=\prod\limits_{i=1}^{n}p(x_{i}|x_{1}\dots x_{i-1})italic_p ( italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) = ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT | italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ).However, in contrast to a standard LM, we condition our LM on additional writing actions a𝒜𝑎𝒜a\in\mathcal{A}italic_a ∈ caligraphic_A, which are predicted by an external planner module P:𝒳𝒜:P𝒳𝒜\operatorname{P}:\mathcal{X}\rightarrow\mathcal{A}roman_P : caligraphic_X → caligraphic_A.To this end, we split the input X𝑋Xitalic_X into text units X=t1tm𝑋subscript𝑡1subscript𝑡𝑚X=t_{1}\dots t_{m}italic_X = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT consisting of one or more tokens tj=x1jxnjjsubscript𝑡𝑗subscriptsuperscript𝑥𝑗1subscriptsuperscript𝑥𝑗subscript𝑛𝑗t_{j}=x^{j}_{1}\dots x^{j}_{n_{j}}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_x start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUBSCRIPT.A text unit could for example be a paragraph or chapter, or even be determined dynamically.However, in this paper we choose sentences for simplicity.We associate one writing action ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT with each text unit tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, which we write as X=a1t1amtmsuperscript𝑋subscript𝑎1subscript𝑡1subscript𝑎𝑚subscript𝑡𝑚X^{\prime}=a_{1}t_{1}\dots a_{m}t_{m}italic_X start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT.222We choose this notation for convenience. The actions ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT are not in textual form.Hence, we frame language modeling as estimating j=1mi=1njp(xij|a1t1aj1tj1ajx1jxi1j)superscriptsubscriptproduct𝑗1𝑚superscriptsubscriptproduct𝑖1subscript𝑛𝑗𝑝conditionalsuperscriptsubscript𝑥𝑖𝑗subscript𝑎1subscript𝑡1subscript𝑎𝑗1subscript𝑡𝑗1subscript𝑎𝑗superscriptsubscript𝑥1𝑗superscriptsubscript𝑥𝑖1𝑗\prod\limits_{j=1}^{m}\prod\limits_{i=1}^{n_{j}}p(x_{i}^{j}|a_{1}t_{1}\dots a_%{j-1}t_{j-1}a_{j}x_{1}^{j}\dots x_{i-1}^{j})∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT ∏ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT end_POSTSUPERSCRIPT italic_p ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT … italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ), i.e., the probability of the next token in the current text unit given the sequence of previous text units with their corresponding writing actions and the previous tokens in the current text unit, via our language model pθsubscript𝑝𝜃p_{\theta}italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT.

Our method consists of three core contributions as sketched in Figure1: First, we show how to derive an abstract writing action for every text unit in our unlabeled data, providing us with training data. Second, we propose a simple planner for predicting the next writing action. Finally, we finetune the language model by conditioning on predicted actions.

Learning to Plan for Language Modeling from Unlabeled Data (1)
Learning to Plan for Language Modeling from Unlabeled Data (2)
Learning to Plan for Language Modeling from Unlabeled Data (3)

3.1 Generating Abstract Writing Actions from Unlabeled Data

Given a training corpus 𝒳𝒳\mathcal{X}caligraphic_X with articles X=t1tn𝒳𝑋subscript𝑡1subscript𝑡𝑛𝒳X=t_{1}\dots t_{n}\in\mathcal{X}italic_X = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ∈ caligraphic_X, we first embed every text unit into a low-dimensional vector 𝐳j=E(tj)dsubscript𝐳𝑗Esubscript𝑡𝑗superscript𝑑\mathbf{z}_{j}=\operatorname{E}(t_{j})\in\mathbb{R}^{d}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT = roman_E ( italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ) ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT via a text encoder EE\operatorname{E}roman_E (e.g. Sentence-BERT(Reimers & Gurevych, 2019)). In principle, we could now associate each text unit with its corresponding embedding, i.e., choose the action embedding 𝐚𝐣=𝐳𝐣subscript𝐚𝐣subscript𝐳𝐣\mathbf{\overrightarrow{a_{j}}}=\mathbf{z_{j}}over→ start_ARG bold_a start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT end_ARG = bold_z start_POSTSUBSCRIPT bold_j end_POSTSUBSCRIPT. However, while this would retain most information, in this case ajsubscript𝑎𝑗a_{j}italic_a start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT is very difficult to predict due to the large number of options.Instead, we posit that it is necessary to compute abstractions that correspond to common writing patterns.Similar to Wang etal. (2023b), we employ a simple k-means clustering step in the embedding space to narrow down the possible actions. We reuse the resulting cluster centroids as the actions embeddings 𝐚d𝐚superscript𝑑\mathbf{\overrightarrow{a}}\in\mathbb{R}^{d}over→ start_ARG bold_a end_ARG ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT.Since the embeddings carry semantic information, we expect the resulting centroids to correspond to general writing patterns.We set k=1024𝑘1024k=1024italic_k = 1024 by default. In Section5.3, we investigate the effect of k𝑘kitalic_k and what our clusterings have learned.

3.2 Predicting Future Writing Actions with a Planner Module

During inference, we do not have access to the true writing actions, so they have to be predicted. We could simply treat writing actions as tokens and try to predict them along with every other token by finetuning the language model on the new data, as is done in Wang etal. (2023b).However, we argue that it is beneficial to treat the prediction of writing actions differently from tokens, namely through an external prediction module, for three reasons: (1) Importance: Writing actions weigh higher in importance than individual tokens, as they carry information regarding entire text units (e.g., sentences, paragraphs), rather than subword units. (2) Planning: The predicted writing action has significant influence over the future generated text, demanding elaborate planning to ensure the right choice is made, which next-token prediction as in LLMs cannot provide. (3) Modularity: An external planner is not specific to any language model. Hence, it can be trained once and henceforth be used for conditioning various language models downstream.

In this paper, we propose a simple planner that only predicts the next writing action. However, our framework can encompass planners of arbitrary complexity, allowing for long-term prediction multiple steps ahead and the integration of a look-ahead search, similar to successful reinforcement learning approaches to playing games(Schrittwieser etal., 2020).

We model action prediction as a classification task, by training it via cross-entropy to predict the next action obtained from unlabeled data in the first step (Section3.1).The default choice for such a text classification setup is to embed the whole context at once.However, in preliminary experiments, we found it helpful to embed the input in the same space as the output.As illustrated in Figure1(b), our planner first embeds each text unit tjsubscript𝑡𝑗t_{j}italic_t start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT independently into a single vector 𝐳jsubscript𝐳𝑗\mathbf{z}_{j}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT using the same (frozen) text encoder as in Section3.1:

𝐙i1:={E(t1)+𝐩1,,E(ti1)+𝐩i1}.assignsubscript𝐙𝑖1Esubscript𝑡1subscript𝐩1Esubscript𝑡𝑖1subscript𝐩𝑖1\mathbf{Z}_{i-1}:=\left\{\operatorname{E}(t_{1})+\mathbf{p}_{1},\dots,%\operatorname{E}(t_{i-1})+\mathbf{p}_{i-1}\right\}.bold_Z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT := { roman_E ( italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) + bold_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , roman_E ( italic_t start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) + bold_p start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT } .

𝐩𝐩\mathbf{p}bold_p is an absolute position embedding to integrate sequence information(Vaswani etal., 2017).Given the embedding matrix 𝐙i1𝒵=(d)+subscript𝐙𝑖1𝒵superscriptsuperscript𝑑\mathbf{Z}_{i-1}\in\mathcal{Z}=\left(\mathbb{R}^{d}\right)^{+}bold_Z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ∈ caligraphic_Z = ( blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT ) start_POSTSUPERSCRIPT + end_POSTSUPERSCRIPT with (𝐙i1)jsubscriptsubscript𝐙𝑖1𝑗\left(\mathbf{Z}_{i-1}\right)_{j}( bold_Z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT representing vector 𝐳jsubscript𝐳𝑗\mathbf{z}_{j}bold_z start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT, we predict the subsequent writing action aisubscript𝑎𝑖a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT. Given that the embedding is a set of vectors, we employ a simple Transformer encoder with mean pooling and a linear classifier head:

𝐙i1=Transformer(𝐙i1);𝐨i1=1i1j=1i1(𝐙i1)j;𝐩𝐫𝒜=softmax(𝐖o𝐨i1+𝐛).formulae-sequencesubscriptsuperscript𝐙𝑖1Transformersubscript𝐙𝑖1formulae-sequencesubscript𝐨𝑖11𝑖1superscriptsubscript𝑗1𝑖1subscriptsubscriptsuperscript𝐙𝑖1𝑗subscript𝐩𝐫𝒜softmaxsubscript𝐖𝑜subscript𝐨𝑖1𝐛\mathbf{Z}^{\prime}_{i-1}=\operatorname{Transformer}\left(\mathbf{Z}_{i-1}%\right);\quad\mathbf{o}_{i-1}=\frac{1}{i-1}\sum\limits_{j=1}^{i-1}\left(%\mathbf{Z}^{\prime}_{i-1}\right)_{j};\quad\mathbf{pr}_{\mathcal{A}}=%\operatorname{softmax}\left(\mathbf{W}_{o}\mathbf{o}_{i-1}+\mathbf{b}\right).bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = roman_Transformer ( bold_Z start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) ; bold_o start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG italic_i - 1 end_ARG ∑ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_i - 1 end_POSTSUPERSCRIPT ( bold_Z start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT ) start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ; bold_pr start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT = roman_softmax ( bold_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT bold_o start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT + bold_b ) .

Here, 𝐩𝐫𝒜subscript𝐩𝐫𝒜\mathbf{pr}_{\mathcal{A}}bold_pr start_POSTSUBSCRIPT caligraphic_A end_POSTSUBSCRIPT is the predicted probability distribution over actions.Since the semantic relationships between actions are encoded in their embeddings 𝐚d𝐚superscript𝑑\mathbf{\overrightarrow{a}}\in\mathcal{R}^{d}over→ start_ARG bold_a end_ARG ∈ caligraphic_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT which we obtained from clustering, we initialize the output matrix 𝐖o|𝒜|×dsubscript𝐖𝑜superscript𝒜𝑑\mathbf{W}_{o}\in\mathbb{R}^{\left|\mathcal{A}\right|\times d}bold_W start_POSTSUBSCRIPT italic_o end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT | caligraphic_A | × italic_d end_POSTSUPERSCRIPT with action embeddings.

3.3 Fine-tuning the Language Model Conditioned on Predicted Writing Actions

State-of-the-art LMs are already powerful next-token-predictors that possess various capabilities.When training these models to make use of the information in provided writing actions, we want to finetune them in a minimally invasive way in order to avoid catastrophic forgetting and overspecialization to a particular training domain.To achieve this, we propose an adapter-style conditioning module.

For an input sequence of length P𝑃Pitalic_P, for each position pP𝑝𝑃p\in Pitalic_p ∈ italic_P, we select an action to be merged into the model.Specifically, we select the action whose centroid embedding is closest to the sentence embedding of the text unit to which the token at the next position p+1𝑝1p+1italic_p + 1 corresponds.This is done because the final-layer prediction at that position p𝑝pitalic_p predicts the probability of the token at position p+1𝑝1p+1italic_p + 1.Call this action ap+1subscript𝑎𝑝1a_{p+1}italic_a start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT.Following Zhang etal. (2023), we insert actions only in the last L𝐿Litalic_L layers of the language model.In each layer l𝑙litalic_l, we first embed the action using an embedding table EAlsuperscriptsubscript𝐸𝐴𝑙E_{A}^{l}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT. We will show in Section5.3 that it is beneficial to initialize EAlsuperscriptsubscript𝐸𝐴𝑙E_{A}^{l}italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT with the action embeddings 𝐚p+1dsubscript𝐚𝑝1superscript𝑑\mathbf{\overrightarrow{a}}_{p+1}\in\mathbb{R}^{d}over→ start_ARG bold_a end_ARG start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d end_POSTSUPERSCRIPT that correspond to the centroids of the clustering phase.We project this embedding once more with a linear layer 𝐖Ald×dsuperscriptsubscript𝐖𝐴𝑙superscriptsuperscript𝑑𝑑\mathbf{W}_{A}^{l}\in\mathbb{R}^{d^{\prime}\times d}bold_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ∈ blackboard_R start_POSTSUPERSCRIPT italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT × italic_d end_POSTSUPERSCRIPT to get the action representation 𝐫plsuperscriptsubscript𝐫𝑝𝑙\mathbf{r}_{p}^{l}bold_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT at position p𝑝pitalic_p and layer l𝑙litalic_l:

𝐫pl=𝐖AlEAl(ap+1).superscriptsubscript𝐫𝑝𝑙superscriptsubscript𝐖𝐴𝑙superscriptsubscript𝐸𝐴𝑙subscript𝑎𝑝1\mathbf{r}_{p}^{l}=\mathbf{W}_{A}^{l}E_{A}^{l}(a_{p+1}).bold_r start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT = bold_W start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT italic_E start_POSTSUBSCRIPT italic_A end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_l end_POSTSUPERSCRIPT ( italic_a start_POSTSUBSCRIPT italic_p + 1 end_POSTSUBSCRIPT ) .

The action representation is then added element-wise to the dsuperscript𝑑d^{\prime}italic_d start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT-dimensional feature representation inside the transformer at layer l𝑙litalic_l and position p𝑝pitalic_p, just after the multiplication with attention weights, and just before the output projection.This is similar to the Llama-Adapter(Zhang etal., 2023) with adaptation prompt sequence length 1, except that we do not use gating, which we found not to be helpful.With action-information inserted into the model as explained above, we finetune the LM by minimizing cross-entropy for next-token-prediction as

=𝔼X=[t1tj1x1jxij]P(𝒳)logpθ(xij|a^1t1a^j1tj1a^jx1jxi1j),subscript𝔼𝑋delimited-[]subscript𝑡1subscript𝑡𝑗1superscriptsubscript𝑥1𝑗superscriptsubscript𝑥𝑖𝑗similar-to𝑃𝒳subscript𝑝𝜃conditionalsuperscriptsubscript𝑥𝑖𝑗subscript^𝑎1subscript𝑡1subscript^𝑎𝑗1subscript𝑡𝑗1subscript^𝑎𝑗superscriptsubscript𝑥1𝑗superscriptsubscript𝑥𝑖1𝑗\mathcal{L}=\mathbb{E}_{X=[t_{1}\ldots t_{j-1}x_{1}^{j}\dots x_{i}^{j}]\sim P(%\mathcal{X})}\log p_{\theta}\left(x_{i}^{j}|\hat{a}_{1}t_{1}\dots\hat{a}_{j-1}%t_{j-1}\hat{a}_{j}x_{1}^{j}\dots x_{i-1}^{j}\right),caligraphic_L = blackboard_E start_POSTSUBSCRIPT italic_X = [ italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT … italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ] ∼ italic_P ( caligraphic_X ) end_POSTSUBSCRIPT roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT | over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT italic_t start_POSTSUBSCRIPT italic_j - 1 end_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT … italic_x start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) ,

where P(𝒳)𝑃𝒳P(\mathcal{X})italic_P ( caligraphic_X ) is the distribution of sequences and a^^𝑎\hat{a}over^ start_ARG italic_a end_ARG denotes planner-predicted writing actions.

An alternative approach is to finetune the language model with oracle writing actions, which we also evaluate in Section5.1.One the one hand we expect that tuning with oracle actions will increase the mismatch between training (oracle actions) and testing (predicted actions), but on the other hand the more reliable oracle actions might induce the LM to rely more strongly on them.

4 Experimental Setup

4.1 Hypotheses

The goal of our paper is to demonstrate the viability of our method for learning an external planning algorithm from unlabeled data whose predictions inform language modeling.Hence, our experiments are designed to test the following hypotheses:(H1) Our method successfully integrates the information from predicted writing actions into language modeling.(H2) The external planner is more successful than an internal planner.(H3) Our method particularly improves performance in terms of text structure.

4.2 Experimental Details

In all our experiments, we train and evaluate our models on a subset of English Wikipedia articles.To this end, we utilize a cleaned Wikipedia dump from March of 2022 that is publicly available via Huggingface333https://huggingface.co/datasets/wikipedia. We use the ”20220301.en” version..We train on a subset of 285310 articles (corresponding to roughly 300M tokens), we split off 1000 articles for early-stopping validation,and 1000 articles for testing.Due to the high computational cost of training a model, most of our experiments are performed on the small GPT2 model(Radford etal., 2019). However, in order to show that our method is viable for larger, more powerful models, we additionally perform an experiment with the recent OLMo 1B(Groeneveld etal., 2024).The full list of hyperparameters is provided in AppendixA.

In Section5.2 we compare our external planner against the LM-internal planner byWang etal. (2023b). Since this model uses finetuning to adapt the LM itself to predict writing actions, a fair comparison of internal and external planner requires adaptation of our method to this scenario, which we describe in detail in AppendixC.Briefly, the adaptation consists of inserting externally predicted writing actions as tokens, which allows every token that follows to condition on it via attention. Moreover, the adaptation allows finetuning of the original LM parameters.

4.3 Evaluation

Primary evaluation

The primary objective of our work is to improve language modeling whose standard metric is perplexity PPL=exp()𝑃𝑃𝐿PPL=\exp(\mathcal{L})italic_P italic_P italic_L = roman_exp ( caligraphic_L ).

Surface-level generation evaluation

Because perplexity does not evaluate actual generations of the model, we also report ROUGE-2 (F1)(Lin, 2004) and MAUVE(Pillutla etal., 2021) scores.ROUGE-2 measures the overlap between generated text and real text in terms of bigrams.Concretely, given a real text X=x1xn=t1tm𝑋subscript𝑥1subscript𝑥𝑛subscript𝑡1subscript𝑡𝑚X=x_{1}\dots x_{n}=t_{1}\dots t_{m}italic_X = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_x start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT, we first use the trained models to generate text continuations X^=t1tit^i+1t^m^𝑋subscript𝑡1subscript𝑡𝑖subscript^𝑡𝑖1subscript^𝑡superscript𝑚\hat{X}=t_{1}\dots t_{i}\hat{t}_{i+1}\dots\hat{t}_{m^{\prime}}over^ start_ARG italic_X end_ARG = italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT over^ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_i + 1 end_POSTSUBSCRIPT … over^ start_ARG italic_t end_ARG start_POSTSUBSCRIPT italic_m start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT end_POSTSUBSCRIPT given the prefix t1tisubscript𝑡1subscript𝑡𝑖t_{1}\dots t_{i}italic_t start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_t start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for some i𝑖iitalic_i.We then compare true and generated continuations at different lengths, we report the average of the different lengths in Section5.1 and the individual results in AppendixB.

MAUVE is a metric designed to evaluate open-ended text generation.It aims to measure the divergence between the probability distribution of the model and the true data distribution.To do so, it takes a set of true articles and generated articles, and deduces a tractable probability distribution from each by embedding and clustering the articles to form histograms.It then calculates a divergence frontier(Djolonga etal., 2020) between the two distributions, which is analogous to a precision-recall curve applied to generative models.The area under the curve is the MAUVE score.Because MAUVE doesn’t require pairwise comparisons, we generate texts unconditionally, i.e. without context. We generate 1024 tokens, which corresponds roughly to the average article length.

Abstract generation evaluation

Additionally, we hypothesize that our planning improves the language modeling performance in particular with respect to text structure.To this end, we measure how well model-generated texts corresponds to ground-truth texts in terms of writing action sequences.We use the procedure described in Section3.1 to obtain the sequence of writing actions that corresponds to a real or generated text.

We provide two metrics for this: Levenshtein(Levenshtein etal., 1966) distance and latent perplexity(Deng etal., 2022).The Levenshtein distance (also referred to as edit distance) measures the number of insertions, deletions, and substitutions necessary to make the sequences equal.Similar to ROUGE-2, it compares generated text to ground-truth text in a pairwise manner. Hence, we report results for the same text continuations as for ROUGE-2.

For latent perplexity, we train a simple HMM-based critic to model the distribution of sequences of abstract writing actions of ground-truth text. Latent perplexity results as the perplexity of the sequence of abstract writing actions in the generated text.Latent perplexity evaluates open-ended text generation, and we thus report it on the same unconditionally generated texts as MAUVE.

Planner evaluation

While the end-goal of our work is improving language modeling, the way in which we aim to do that hinges on two components.First, how accurately the planner module can predict oracle writing actions, and second how successfully we can finetune the language model to make use of the information provided by the planner.To measure the former in isolation, we report two metrics: accuracy (i.e., percentage of writing actions predicted correctly), and average rank.The planner predicts a score for each possible action. To calculate the average rank, we order the predicted scores of each action from high to low and observe which rank the oracle action has on average.

4.4 Baselines

Since our method involves finetuning a language model via the action adapter (cmp. Figure1(c)), improved performance over the language model with no finetuning (henceforth called None) is expected due to adaptation to the finetuning dataset (Wikipedia).However, since the goal of this paper is to identify how well the language model is able to make use of the information in the predicted writing actions (called Predicted), we would like to isolate the benefit introduced from predicted writing actions alone for our main experiments.To this end, we employ the Fixed baseline, which is finetuned on the same dataset, but always receives the same (uninformative) writing action regardless of the context.Finally, we employ the Oracle baseline, which always receives the true next writing action, to serve as an upper bound for planners trained on our extracted writing actions.The different baselines correspond to different writing actions being provided to the language model, but each baseline always uses the same type during training and testing. For Predicted, however, it is possible to train using either oracle actions (Predicted-OA) or planner-predicted actions (Predicted-PA).

Wang etal. (2023b) train an LM-internal planner, to which we compare our external planner.To enable a fair comparison that isolates the effect of internality/externality of the planner, we re-implement their approach in our framework.The implementation details can be found in AppendixC.

5 Results

5.1 Main Result

Table1 shows the results.The main result is that the LM trained with planner-predicted actions (Predicted-PA)outperforms the Fixed baseline by 1.14 PPL for GPT-2 and 0.35 PPL for OLMo. It also outperforms Fixed on both surface-level (ROUGE-2, MAUVE) and abstract (Latent PPL, Edit) generation metrics.The results also show a trade-off between finetuning the LM on planner-predicted actions (Predicted-PA) and oracle actions (Predicted-OA).Finetuning on oracle actions damages perplexity, as it introduces a mismatch between training and testing data.Despite this, it is sometimes better in terms of generation metrics.An explanation can be found in the different extent to which the resulting LMs actually follow the plans proposed by their planner, i.e., whether they produce sentences that belong to the cluster corresponding to the action used in generating the sentence.We find that Predicted-OA follows the plan about 40% of the time, compared to only 20% for Predicted-PA.This makes sense, as oracle actions are more informative than predicted actions.Hence, there is a trade-off between reducing the train/test mismatch to achieve low perplexity and following the plan to achieve high generation quality.Finding a balance between these two is a promising direction for future work.

We also observe that there remain differences of 1.63 PPL and 0.47 PPL with respect to Oracle.While it is not possible to train a planner that is exactly equivalent to oracle access, this does suggest that a planner that is better at matching oracle actions can further improve language modeling performance.Moreover, in AppendixD we find that, while the oracle action is much better than a random choice, it is not optimal.Hence, further perplexity gains could be made by learning to consistently predict those actions that lead to an even lower perplexity than the oracle actions.

Base-LMModelPPL \downarrowMAUVE \uparrowLatent PPL \downarrowROUGE-2 \uparrowEdit \downarrow
GPT-2 SmallOracle23.92----
Baselines
None32.890.20438.100.0134.26
Fixed26.690.38352.80.0153.88
Proposed
Predicted-OA26.940.4591.60.0193.69
Predicted-PA25.550.44205.90.0173.78
OLMo 1BOracle/Oracle10.99----
Baselines
None13.080.14374.220.0194.83
Fixed11.810.45250.900.0223.42
Proposed
Predicted-OA11.990.4374.20.0263.39
Predicted-PA11.460.56178.20.0253.31

5.2 External vs. Internal Planner

Conducting a fair comparison between internal and external planner as described in AppendixC, we find that the external planner performs 1.22 PPL better than the internal planner by Wang etal. (2023b) if we restrict finetuning to parameter-efficient finetuning with LoRA, and .81 PPL better in full finetuning.

5.3 Ablations

Model Architecture

In Table2, we perform an ablation showing the contribution of several architecture components. Having a multi-vector representation improves the performance of the planner, which in turn is slightly beneficial when conditioning the language model on predicted actions.As expected, initializing the action embeddings with the respective cluster centroids is also crucial to the performance of the model.While initializing the planner with action embeddings helped for preliminary experiments at small scale, the effect disappears with more training data.

Clustering

Figure2 shows the model behavior as we increase the number of possible actions.The perplexity improves up to 8192 actions, after which it decreases again.This is explained by the tradeoff between informativeness and predictability: More clusters correspond to less compression of information contained in the sentence embeddings. On the other hand, more available actions make the prediction task harder, as indicated by an increasing average rank.

In Tables6 and 7 in the Appendix, we analyze the ten largest clusters qualitatively. We find that they often correspond well to structures typically found in a Wikipedia article. For example, an article about a person typically contains information about a person’s origin, education, and professional life in a certain order, which a planner would be able to predict in terms of writing actions.Some actions, however, are rather topical than structural. For example, cluster #7 contains mentions of Italian people and places.This indicates that our procedure of generating abstract writing actions (see Section3.1) can be improved further, e.g. through an encoder whose representations contain more structural rather than topical information.

ModelAcc. \uparrowRank \downarrowPPL \downarrow
full model24.736.425.55
no sent rep21.938.325.59
no P init24.337.325.51
no AD init24.736.427.29

Learning to Plan for Language Modeling from Unlabeled Data (4)

#ContentExample sentence
1DemographicsThe rural population was 693,566 (89.03%) and urban 85,496 (10.97%).
2PlotDuring a private argument about whether or not they should move, Elaine reveals she is pregnant.
3Business2010: Snecma and GE formed CFM Materials as a 50/50 joint vefnture.
4GeographyKhuzama is situated in Jakhama circle of Kohima District in Nagaland.
5SchoolsThe district consists of an elementary school, a middle school, and a high school.
6EducationHe has a bachelor’s degree from the University of Florida.
7ItalyMaffeo Olivieri (1484, Brescia — 1543 or 1544) was an Italian sculptor and wood carver.
8AcronymsComput.
9OriginShokler was born in Cincinnati on April 27, 1896.
10CareerIn 1954 he played an impressionable navigator opposite Gregory Peck in The Purple Plain.

6 Discussion

The experimental results (Section5.1) confirm that our method is able to enhance language model performance by integrating externally predicted writing actions (H1) inferred from unlabeled data.This finding holds both for small LMs and more powerful, larger LMs. Naturally, the effect is smaller for larger models because their general language modeling ability is already stronger.

As evidenced by experimental results in a fair comparison (Section5.2), externality of the planner (H2) is beneficial.Besides performance, an external planner enables greater modularity such that the language model and the planner can be developed and reused independently.

The enhanced LM ability of our method can be attributed to a stronger ability to plan their writing structurally (H3), since conditioning on predicted writing actions causes the model to follow the ground truth text structure, which is indicated by a lower Levenshtein distance and latent perplexity in comparison to finetuning without a planner (Section5.1).Our qualitative analysis of clusters (Section5.3) suggests that writing actions indeed correspond to common structures in the dataset.

Finally, our method’s performance depends on the technical contributions that we have made, such as a better neural architecture for predicting writing actions (Section5.3) and conditioning the LM on predicted actions rather than ground truth actions during training (Section5.1).

6.1 Limitations

Our study is limited in some ways.First, while we try to show the generality of our method through experiments with a more powerful, medium-sized OLMo 1B model, due to computational constraints, we cannot train models of even larger scale and strength.Moreover, while we trained on the Wikipedia corpus, which is often considered general-purpose, its diversity is still considerably more limited than the datasets modern LLMs are trained with, which typically subsume a snapshot of the web.Finally, since LMs only achieve nontrivial performance on zero- and few-shot tasks after large amounts of pretraining compute, we do not evaluate the benefit of our method on such downstream tasks.In the future, we want to apply our method to an LLM of medium size (e.g. OLMo 7B) finetuned on a diverse dataset (e.g. Dolma(Soldaini etal., 2024)).

Second, while we improve over a standard architecture with a model better suited for predicting future text, there are several obvious avenues for improvement that remain unexplored.Since we obtain abstract writing actions via clustering in a text embedding space, they do not necessarily encode information that is best suited to help language modeling.We view learning actions tailored to this use case as a promising research direction.Moreover, we want to explore planning multiple text units in the future and using a search method such as beam search or Monte Carlo Tree Search at inference time to improve planning and find the action that optimizes downstream LM performance.

7 Conclusion

Large Language Models have impressive capabilities, but they lack the ability to plan their writing well.Our proposed method utilizes an independently trained, external planner module, whose outputs can be used to inform and improve the generation process of the language model.Crucially, the planner is trained from unlabeled data in a self-supervised fashion.These qualities enable researchers to develop and train new planners at large scale and publicly share them, allowing anyone to integrate new planning capabilities with the language model of their choice.Hence, we envision a future where progress on planning algorithms advances at a similar speed as language models today.

Acknowledgments

This research was financed by the CALCULUS project—Commonsense and Anticipation enriched Learning of Continuous representations—European Research Council Advanced Grant H2020-ERC-2017-ADG 788506, http://calculus-project.eu/.

References

  • Arthur & Vassilvitskii (2007)David Arthur and Sergei Vassilvitskii.k-means++: the advantages of careful seeding.In SODA, pp. 1027–1035. SIAM, 2007.
  • Bengio etal. (2000)Yoshua Bengio, Réjean Ducharme, and Pascal Vincent.A neural probabilistic language model.pp. 932–938, 2000.
  • Bengio etal. (2021)Yoshua Bengio, Yann LeCun, and GeoffreyE. Hinton.Deep learning for AI.Commun. ACM, 64(7):58–65, 2021.
  • Brown etal. (2020)TomB. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan,Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, AmandaAskell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, TomHenighan, Rewon Child, Aditya Ramesh, DanielM. Ziegler, Jeffrey Wu, ClemensWinter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, ScottGray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, AlecRadford, Ilya Sutskever, and Dario Amodei.Language models are few-shot learners.In NeurIPS, 2020.
  • Browne etal. (2012)Cameron Browne, EdwardJack Powley, Daniel Whitehouse, SimonM. Lucas, PeterI.Cowling, Philipp Rohlfshagen, Stephen Tavener, DiegoPerez Liebana, SpyridonSamothrakis, and Simon Colton.A survey of monte carlo tree search methods.IEEE Trans. Comput. Intell. AI Games, 4(1):1–43, 2012.
  • Carta etal. (2023)Thomas Carta, Clément Romac, Thomas Wolf, Sylvain Lamprier, OlivierSigaud, and Pierre-Yves Oudeyer.Grounding large language models in interactive environments withonline reinforcement learning.In ICML, volume 202 of Proceedings of Machine LearningResearch, pp. 3676–3713. PMLR, 2023.
  • Chen etal. (2023)Zhipeng Chen, Kun Zhou, Beichen Zhang, Zheng Gong, Xin Zhao, and Ji-Rong Wen.Chatcot: Tool-augmented chain-of-thought reasoning on chat-basedlarge language models.In EMNLP (Findings), pp. 14777–14790. Association forComputational Linguistics, 2023.
  • Cobbe etal. (2021)Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, LukaszKaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano,Christopher Hesse, and John Schulman.Training Verifiers to Solve Math Word Problems, November2021.URL http://arxiv.org/abs/2110.14168.arXiv:2110.14168 [cs].
  • Deng & Raffel (2023)Haikang Deng and Colin Raffel.Reward-augmented decoding: Efficient controlled text generation witha unidirectional reward model.In Houda Bouamor, Juan Pino, and Kalika Bali (eds.),Proceedings of the 2023 Conference on Empirical Methods in NaturalLanguage Processing, pp. 11781–11791, Singapore, December 2023.Association for Computational Linguistics.doi: 10.18653/v1/2023.emnlp-main.721.URL https://aclanthology.org/2023.emnlp-main.721.
  • Deng etal. (2022)Yuntian Deng, Volodymyr Kuleshov, and Alexander Rush.Model criticism for long-form text generation.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Proceedings of the 2022 Conference on Empirical Methods in NaturalLanguage Processing, pp. 11887–11912, Abu Dhabi, United Arab Emirates,December 2022. Association for Computational Linguistics.doi: 10.18653/v1/2022.emnlp-main.815.URL https://aclanthology.org/2022.emnlp-main.815.
  • Djolonga etal. (2020)Josip Djolonga, Mario Lucic, Marco Cuturi, Olivier Bachem, Olivier Bousquet,and Sylvain Gelly.Precision-recall curves using information divergence frontiers.In International Conference on Artificial Intelligence andStatistics, pp. 2550–2559. PMLR, 2020.
  • Evans (1984)Jonathan StBT Evans.Heuristic and analytic processes in reasoning.British Journal of Psychology, 75(4):451–468, 1984.
  • Groeneveld etal. (2024)Dirk Groeneveld, IzBeltagy, Pete Walsh, Aksh*ta Bhagia, Rodney Kinney, OyvindTafjord, AnanyaHarsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, etal.Olmo: Accelerating the science of language models.arXiv preprint arXiv:2402.00838, 2024.
  • Guo etal. (2022)Han Guo, Bowen Tan, Zhengzhong Liu, Eric Xing, and Zhiting Hu.Efficient (soft) Q-learning for text generation with limited gooddata.In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.),Findings of the Association for Computational Linguistics: EMNLP 2022,pp. 6969–6991, Abu Dhabi, United Arab Emirates, December 2022. Associationfor Computational Linguistics.doi: 10.18653/v1/2022.findings-emnlp.518.URL https://aclanthology.org/2022.findings-emnlp.518.
  • Guu etal. (2020)Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang.Retrieval augmented language model pre-training.In ICML, volume 119 of Proceedings of Machine LearningResearch, pp. 3929–3938. PMLR, 2020.
  • Hao etal. (2023)Shibo Hao, YiGu, Haodi Ma, JoshuaJiahua Hong, Zhen Wang, DaisyZhe Wang, andZhiting Hu.Reasoning with language model is planning with world model.In EMNLP, pp. 8154–8173. Association for ComputationalLinguistics, 2023.
  • Havrilla etal. (2024)Alex Havrilla, Yuqing Du, SharathChandra Raparthy, Christoforos Nalmpantis,Jane Dwivedi-Yu, Maksym Zhuravinskyi, Eric Hambro, Sainbayar Sukhbaatar, andRoberta Raileanu.Teaching large language models to reason with reinforcement learning.arXiv preprint arXiv:2403.04642, 2024.
  • Hendrycks etal. (2021)Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, DawnSong, and Jacob Steinhardt.Measuring massive multitask language understanding.In ICLR. OpenReview.net, 2021.
  • Ho etal. (2024)Anson Ho, Tamay Besiroglu, Ege Erdil, David Owen, Robi Rahman, ZifanCarl Guo,David Atkinson, Neil Thompson, and Jaime Sevilla.Algorithmic progress in language models.arXiv preprint arXiv:2403.05812, 2024.
  • Hoffmann etal. (2022)Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, TrevorCai, Eliza Rutherford, Diego deLasCasas, LisaAnne Hendricks, JohannesWelbl, Aidan Clark, Tom Hennigan, Eric Noland, Katherine Millican, Georgevanden Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan,Erich Elsen, Oriol Vinyals, JackW. Rae, and Laurent Sifre.An empirical analysis of compute-optimal large language modeltraining.In NeurIPS, 2022.
  • Honnibal etal. (2020)Matthew Honnibal, Ines Montani, Sofie VanLandeghem, Adriane Boyd, etal.spacy: Industrial-strength natural language processing in python.2020.
  • Hu etal. (2022)EdwardJ. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li,Shean Wang, LuWang, and Weizhu Chen.Lora: Low-rank adaptation of large language models.In ICLR. OpenReview.net, 2022.
  • Ippolito etal. (2020)Daphne Ippolito, David Grangier, Douglas Eck, and Chris Callison-Burch.Toward better storylines with sentence-level language models.In ACL, pp. 7472–7478. Association for ComputationalLinguistics, 2020.
  • Jhamtani & Berg-Kirkpatrick (2020)Harsh Jhamtani and Taylor Berg-Kirkpatrick.Narrative text generation with a latent discrete plan.In EMNLP (Findings), volume EMNLP 2020 of Findingsof ACL, pp. 3637–3650. Association for Computational Linguistics, 2020.
  • Kahneman (2011)Daniel Kahneman.Thinking, fast and slow.macmillan, 2011.
  • Khandelwal etal. (2020)Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis.Generalization through memorization: Nearest neighbor languagemodels.In ICLR. OpenReview.net, 2020.
  • Kim etal. (2023)Geunwoo Kim, Pierre Baldi, and Stephen McAleer.Language models can solve computer tasks.In NeurIPS, 2023.
  • Kojima etal. (2022)Takeshi Kojima, ShixiangShane Gu, Machel Reid, Yutaka Matsuo, and YusukeIwasawa.Large language models are zero-shot reasoners.In NeurIPS, 2022.
  • Lau etal. (2017)JeyHan Lau, Timothy Baldwin, and Trevor Cohn.Topically driven neural language model.In ACL (1), pp. 355–365. Association for ComputationalLinguistics, 2017.
  • Levenshtein etal. (1966)VladimirI Levenshtein etal.Binary codes capable of correcting deletions, insertions, andreversals.In Soviet physics doklady, volume10, pp. 707–710. SovietUnion, 1966.
  • Lewis etal. (2020)Patrick S.H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, VladimirKarpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih,Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela.Retrieval-augmented generation for knowledge-intensive NLP tasks.In NeurIPS, 2020.
  • Lin (2004)Chin-Yew Lin.Rouge: A package for automatic evaluation of summaries.In Text summarization branches out, pp. 74–81, 2004.
  • Long (2023)Jieyi Long.Large language model guided tree-of-thought, 2023.URL https://arxiv.org/abs/2305.08291.
  • Marfurt & Henderson (2021)Andreas Marfurt and James Henderson.Sentence-level planning for especially abstractive summarization.In Proceedings of the Third Workshop on New Frontiers inSummarization, pp. 1–14, 2021.
  • Merity etal. (2017)Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher.Pointer sentinel mixture models.In ICLR (Poster). OpenReview.net, 2017.
  • Miculicich etal. (2018)Lesly Miculicich, Dhananjay Ram, Nikolaos Pappas, and James Henderson.Document-level neural machine translation with hierarchical attentionnetworks.In EMNLP, pp. 2947–2954. Association for ComputationalLinguistics, 2018.
  • Mikolov etal. (2010)Tomás Mikolov, Martin Karafiát, Lukás Burget, JanCernocký, and Sanjeev Khudanpur.Recurrent neural network based language model.In INTERSPEECH, pp. 1045–1048. ISCA, 2010.
  • Mikolov etal. (2011)Tomás Mikolov, Stefan Kombrink, Lukás Burget, Jan Cernocký,and Sanjeev Khudanpur.Extensions of recurrent neural network language model.In ICASSP, pp. 5528–5531. IEEE, 2011.
  • Mudgal etal. (2024)Sidharth Mudgal, Jong Lee, Harish Ganapathy, YaGuang Li, Tao Wang, YanpingHuang, Zhifeng Chen, Heng-Tze Cheng, Michael Collins, Trevor Strohman, JilinChen, Alex Beutel, and Ahmad Beirami.Controlled decoding from language models.In Forty-first International Conference on Machine Learning,2024.URL https://openreview.net/forum?id=bVIcZb7Qa0.
  • Nallapati etal. (2016)Ramesh Nallapati, Bowen Zhou, CíceroNogueira dos Santos, ÇaglarGülçehre, and Bing Xiang.Abstractive text summarization using sequence-to-sequence rnns andbeyond.In CoNLL, pp. 280–290. ACL, 2016.
  • Nye etal. (2021)MaxwellI. Nye, MichaelHenry Tessler, JoshuaB. Tenenbaum, and BrendenM.Lake.Improving coherence and consistency in neural sequence models withdual-system, neuro-symbolic reasoning.In NeurIPS, pp. 25192–25204, 2021.
  • Paranjape etal. (2023)Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, LukeZettlemoyer, and MarcoTulio Ribeiro.Art: Automatic multi-step reasoning and tool-use for large languagemodels, 2023.URL https://arxiv.org/abs/2303.09014.
  • Paszke etal. (2019)Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, etal.Pytorch: An imperative style, high-performance deep learning library.Advances in neural information processing systems, 32, 2019.
  • Pedregosa etal. (2011)F.Pedregosa, G.Varoquaux, A.Gramfort, V.Michel, B.Thirion, O.Grisel,M.Blondel, P.Prettenhofer, R.Weiss, V.Dubourg, J.Vanderplas, A.Passos,D.Cournapeau, M.Brucher, M.Perrot, and E.duch*esnay.Scikit-learn: Machine learning in Python.Journal of Machine Learning Research, 12:2825–2830,2011.
  • Perez-Beltrachini etal. (2019)Laura Perez-Beltrachini, Yang Liu, and Mirella Lapata.Generating summaries with topic templates and structuredconvolutional decoders.In ACL (1), pp. 5107–5116. Association forComputational Linguistics, 2019.
  • Pillutla etal. (2021)Krishna Pillutla, Swabha Swayamdipta, Rowan Zellers, John Thickstun, SeanWelleck, Yejin Choi, and Zaid Harchaoui.Mauve: Measuring the gap between neural text and human text usingdivergence frontiers.Advances in Neural Information Processing Systems,34:4816–4828, 2021.
  • Radford (2018)Alec Radford.Improving Language Understanding by GenerativePre-Training.2018.URLhttps://www.semanticscholar.org/paper/Improving-Language-Understanding-by-Generative-Radford/cd18800a0fe0b668a1cc19f2ec95b5003d0a5035.
  • Radford etal. (2019)Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, IlyaSutskever, etal.Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019.
  • Reif etal. (2022)Emily Reif, Daphne Ippolito, Ann Yuan, Andy Coenen, Chris Callison-Burch, andJason Wei.A recipe for arbitrary text style transfer with large languagemodels.In ACL (2), pp. 837–848. Association for ComputationalLinguistics, 2022.
  • Reimers & Gurevych (2019)Nils Reimers and Iryna Gurevych.Sentence-bert: Sentence embeddings using siamese bert-networks.In EMNLP/IJCNLP (1), pp. 3980–3990. Association forComputational Linguistics, 2019.
  • Schick etal. (2023)Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, MariaLomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom.Toolformer: Language models can teach themselves to use tools.In NeurIPS, 2023.
  • Schrittwieser etal. (2020)Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan,Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis,Thore Graepel, TimothyP. Lillicrap, and David Silver.Mastering atari, go, chess and shogi by planning with a learnedmodel.Nat., 588(7839):604–609, 2020.
  • Shen etal. (2023)BoShen, Jiaxin Zhang, Taihong Chen, Daoguang Zan, Bing Geng, AnFu, MuhanZeng, Ailun Yu, Jichuan Ji, Jingyang Zhao, etal.Pangu-coder2: Boosting large language models for code with rankingfeedback.arXiv preprint arXiv:2307.14936, 2023.
  • Soldaini etal. (2024)Luca Soldaini, Rodney Kinney, Aksh*ta Bhagia, Dustin Schwenk, David Atkinson,Russell Authur, Ben Bogin, Khyathi Chandu, Jennifer Dumas, Yanai Elazar,Valentin Hofmann, AnanyaHarsh Jha, Sachin Kumar, LiLucy, Xinxi Lyu, NathanLambert, Ian Magnusson, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik,Crystal Nam, MatthewE. Peters, Abhilasha Ravichander, Kyle Richardson,Zejiang Shen, Emma Strubell, Nishant Subramani, Oyvind Tafjord, Pete Walsh,Luke Zettlemoyer, NoahA. Smith, Hannaneh Hajishirzi, IzBeltagy, DirkGroeneveld, Jesse Dodge, and Kyle Lo.Dolma: An Open Corpus of Three Trillion Tokens for Language ModelPretraining Research.arXiv preprint, 2024.URL https://arxiv.org/abs/2402.00159.
  • Song etal. (2020)Kaitao Song, XuTan, Tao Qin, Jianfeng Lu, and Tie-Yan Liu.Mpnet: Masked and permuted pre-training for language understanding.In NeurIPS, 2020.
  • Sun etal. (2024)Simeng Sun, Yang Liu, Shuohang Wang, Dan Iter, Chenguang Zhu, and Mohit Iyyer.PEARL: prompting large language models to plan and execute actionsover long documents.In EACL (1), pp. 469–486. Association for ComputationalLinguistics, 2024.
  • Sutton & Barto (2018)RichardS Sutton and AndrewG Barto.Reinforcement learning: An introduction.MIT press, 2018.
  • Tan etal. (2017)Jiwei Tan, Xiaojun Wan, and Jianguo Xiao.Abstractive document summarization with a graph-based attentionalneural model.In ACL (1), pp. 1171–1181. Association forComputational Linguistics, 2017.
  • Valmeekam etal. (2023)Karthik Valmeekam, Matthew Marquez, Sarath Sreedharan, and SubbaraoKambhampati.On the planning abilities of large language models - A criticalinvestigation.In NeurIPS, 2023.
  • Vaswani etal. (2017)Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,AidanN. Gomez, Lukasz Kaiser, and Illia Polosukhin.Attention is all you need.In NIPS, pp. 5998–6008, 2017.
  • Vinyals etal. (2015)Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly.Pointer networks.In NIPS, pp. 2692–2700, 2015.
  • Wang etal. (2023a)Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, RoyKa-Wei Lee, andEe-Peng Lim.Plan-and-solve prompting: Improving zero-shot chain-of-thoughtreasoning by large language models.In ACL (1), pp. 2609–2634. Association forComputational Linguistics, 2023a.
  • Wang etal. (2019)Wenlin Wang, Zhe Gan, Hongteng Xu, Ruiyi Zhang, Guoyin Wang, Dinghan Shen,Changyou Chen, and Lawrence Carin.Topic-guided variational auto-encoder for text generation.In NAACL-HLT (1), pp. 166–177. Association forComputational Linguistics, 2019.
  • Wang etal. (2023b)Xinyi Wang, Lucas Caccia, Oleksiy Ostapenko, Xingdi Yuan, and AlessandroSordoni.Guiding Language Model Reasoning with Planning Tokens,December 2023b.URL http://arxiv.org/abs/2310.05707.arXiv:2310.05707 [cs].
  • Wang etal. (2022a)Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, EdChi, and Denny Zhou.Rationale-augmented ensembles in language models, 2022a.URL https://arxiv.org/abs/2207.00747.
  • Wang etal. (2022b)Xuezhi Wang, Jason Wei, Dale Schuurmans, QuocV Le, EdH. Chi, Sharan Narang,Aakanksha Chowdhery, and Denny Zhou.Self-consistency improves chain of thought reasoning in languagemodels, 2022b.URL https://openreview.net/forum?id=1PL1NIMMrw.
  • Wei etal. (2022a)Jason Wei, YiTay, Rishi Bommasani, Colin Raffel, Barret Zoph, SebastianBorgeaud, Dani Yogatama, Maarten Bosma, Denny Zhou, Donald Metzler, EdH.Chi, Tatsunori Hashimoto, Oriol Vinyals, Percy Liang, Jeff Dean, and WilliamFedus.Emergent abilities of large language models.Trans. Mach. Learn. Res., 2022, 2022a.
  • Wei etal. (2022b)Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia,EdH. Chi, QuocV. Le, and Denny Zhou.Chain-of-thought prompting elicits reasoning in large languagemodels.In NeurIPS, 2022b.
  • Wolf etal. (2020)Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue,Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz,Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite,Julien Plu, Canwen Xu, TevenLe Scao, Sylvain Gugger, Mariama Drame, QuentinLhoest, and AlexanderM. Rush.Transformers: State-of-the-art natural language processing.In EMNLP (Demos), pp. 38–45. Association forComputational Linguistics, 2020.
  • Xue etal. (2023)Tianci Xue, Ziqi Wang, Zhenhailong Wang, Chi Han, Pengfei Yu, and Heng Ji.Rcot: Detecting and rectifying factual inconsistency in reasoning byreversing chain-of-thought, 2023.URL https://arxiv.org/abs/2305.11499.
  • Yang & Klein (2021)Kevin Yang and Dan Klein.FUDGE: Controlled text generation with future discriminators.In Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, DilekHakkani-Tur, IzBeltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty,and Yichao Zhou (eds.), Proceedings of the 2021 Conference of the NorthAmerican Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, pp. 3511–3535, Online, June 2021. Association forComputational Linguistics.doi: 10.18653/v1/2021.naacl-main.276.URL https://aclanthology.org/2021.naacl-main.276.
  • Yang etal. (2023)Sherry Yang, Ofir Nachum, Yilun Du, Jason Wei, Pieter Abbeel, and DaleSchuurmans.Foundation models for decision making: Problems, methods, andopportunities.arXiv preprint arXiv:2303.04129, 2023.
  • Yang etal. (2016)Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, AlexanderJ. Smola, andEduardH. Hovy.Hierarchical attention networks for document classification.In HLT-NAACL, pp. 1480–1489. The Association forComputational Linguistics, 2016.
  • Yao etal. (2023)Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, andKarthik Narasimhan.Tree of thoughts: Deliberate problem solving with large languagemodels.In NeurIPS, 2023.
  • Zhang etal. (2024a)Hanqing Zhang, Haolin Song, Shaoyu Li, Ming Zhou, and Dawei Song.A survey of controllable text generation using transformer-basedpre-trained language models.ACM Comput. Surv., 56(3):64:1–64:37,2024a.
  • Zhang etal. (2023)Renrui Zhang, Jiaming Han, Aojun Zhou, Xiangfei Hu, Shilin Yan, Pan Lu,Hongsheng Li, Peng Gao, and YuQiao.Llama-adapter: Efficient fine-tuning of language models withzero-init attention.arXiv preprint arXiv:2303.16199, 2023.
  • Zhang etal. (2024b)Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen McKeown, andTatsunoriB. Hashimoto.Benchmarking Large Language Models for News Summarization.Transactions of the Association for Computational Linguistics,12:39–57, 01 2024b.ISSN 2307-387X.doi: 10.1162/tacl˙a˙00632.URL https://doi.org/10.1162/tacl_a_00632.
  • Zhao etal. (2023)Ruochen Zhao, Xingxuan Li, Shafiq Joty, Chengwei Qin, and Lidong Bing.Verify-and-edit: A knowledge-enhanced chain-of-thought framework.In ACL (1), pp. 5823–5840. Association forComputational Linguistics, 2023.
  • Zheng etal. (2023a)Carolina Zheng, Keyon Vafa, and David Blei.Revisiting topic-guided language models.Transactions on Machine Learning Research, 2023a.
  • Zheng etal. (2023b)Chuanyang Zheng, Zhengying Liu, Enze Xie, Zhenguo Li, and YuLi.Progressive-hint prompting improves reasoning in large languagemodels, 2023b.URL https://arxiv.org/abs/2304.09797.
  • Zhu etal. (2023)Wenhao Zhu, Hongyi Liu, Qingxiu Dong, Jingjing Xu, Lingpeng Kong, Jiajun Chen,Lei Li, and Shujian Huang.Multilingual machine translation with large language models:Empirical results and analysis.arXiv preprint arXiv:2304.04675, 2023.

Appendix A Hyperparameters and Experimental Details

Implementation

Data processing

All of our experiments are based on a cleaned Wikipedia dump from March of 2022 that is publicly available via Huggingface444https://huggingface.co/datasets/wikipedia. We do not apply any further preprocessing to the dataset.

For generating abstract writing actions (Section3.1),we first use spaCy(Honnibal etal., 2020) to split every article into sentences. Then, we use MPNet-base-v2(Song etal., 2020) via the SentenceTransformer library(Reimers & Gurevych, 2019)555https://huggingface.co/sentence-transformers/all-mpnet-base-v2 to encode sentences into embeddings.For clustering, we use the existing k-means implementation from Scikit-Learn(Pedregosa etal., 2011) with k-means++ initialization(Arthur & Vassilvitskii, 2007) and run it for the default maximum of 300 iterations.

Neural network implementation

All methods compared in our paper are implemented within the same PyTorch-based(Paszke etal., 2019) framework.Language models are loaded via the Huggingface Transformers library(Wolf etal., 2020).Experiments with the GPT2 model666https://huggingface.co/openai-community/gpt2 model were run on a single 12GB GPU, whereas experiments including OLMo 1B777https://huggingface.co/allenai/OLMo-1B were run on a single 24GB GPU.

Evaluation

To compute the Levenshtein distance, we use the python-Levenshtein package (https://pypi.org/project/python-Levenshtein/) with default settings. For ROUGE and MAUVE, we used the python packages ‘rouge‘ and ‘mauve-text‘, respectively.For latent perplexity, we made minor changes to the implementation by Deng etal. (2022), available at https://github.com/florianmai/criticize_text_generation.

Hyperparameters

Table 4 shows hyperparameters used for our experiments.

HyperparameterValue
Context window size128
Train \mid test \mid val split sizes285310 \mid 1000 \mid 1000
K-means initializationk-means++
Number of tokens generated for edit distance128
Default action count1024
Action embedding dimension768
Language Model Finetuning
Batch size32
Learning rate (lr)1e-4
Planner Training
Batch size32
Learning rate (lr)1e-4

Appendix B Detailed ROUGE and Edit Distance Results

Table 5 shows the Rouge-2 and edit distances between true and generated continuations, when cut off at different generation lengths.Because the edit distance scales linearly with the number of tokens, we normalize results: for 128 tokens we report the actual edit distance, for 256 the edit distance divided by 2, etc.

Base-LMModel12825651210242048
GPT-2ROUGE-2 \uparrow
None0.01250.01340.01360.0130.0116
Fixed0.01480.01670.01650.01540.01
Predicted-OA0.01850.02050.02110.01960.02
Predicted-PA0.01660.01780.01840.01690.01
Edit \downarrow
None4.824.424.124.013.94
Fixed4.634.093.763.503.39
Predicted-OA4.393.883.553.353.25
Predicted-PA4.453.923.673.503.37
OLMoROUGE-2 \uparrow
None0.02120.0210.01940.01680.0141
Fixed0.02420.02470.0230.02030.0165
Predicted-OA0.02750.02960.02860.02550.0209
Predicted-PA0.02770.02840.02730.0240.0192
Edit \downarrow
None5.014.684.714.844.92
Fixed4.213.693.383.072.77
Predicted-OA4.153.643.293.062.84
Predicted-PA4.23.643.252.892.55

Appendix C Internal vs External Planner Setup

We want to compare our approach that uses an external planner to using the language model itself as a planner, as is done in Wang etal. (2023b).

If V𝑉Vitalic_V is the vocabulary size, W𝑊Witalic_W the number of writing actions, and hhitalic_h the hidden dimension, using the language model as planner entails expanding the V×h𝑉V\times hitalic_V × italic_h-dimensional final prediction layer to a (V+W)×h𝑉𝑊(V+W)\times h( italic_V + italic_W ) × italic_h-dimensional layer.Just training those extra W×h𝑊W\times hitalic_W × italic_h parameters does not suffice to let the language model learn to predict planning tokens, which is why Wang etal. (2023b) also finetune the existing language model parameters (either full-finetuning, or parameter-efficient finetuning with LoRA (Hu etal., 2022)).

Besides having an external planner, our model also differs from Wang etal. (2023b) in the way predicted actions are presented to the model.We present the predicted writing actions to the language model with the adapter-style conditioning module detailed in section 3.3, whereas Wang etal. (2023b) insert the actions like tokens in between sentences.The insert-style requires extending the input embedding matrix of the language model in the same way as the final prediction layer above.

Consider a sequence of tokens X=x11xn11x12xn22x1mxnmm𝑋superscriptsubscript𝑥11superscriptsubscript𝑥subscript𝑛11superscriptsubscript𝑥12superscriptsubscript𝑥subscript𝑛22superscriptsubscript𝑥1𝑚superscriptsubscript𝑥subscript𝑛𝑚𝑚X=x_{1}^{1}...x_{n_{1}}^{1}x_{1}^{2}...x_{n_{2}}^{2}......x_{1}^{m}...x_{n_{m}%}^{m}italic_X = italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT … italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT … … italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT … italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT, where xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the i𝑖iitalic_ith token of the j𝑗jitalic_jth text unit in the sequence.Let A=a1am𝐴subscript𝑎1subscript𝑎𝑚A=a_{1}...a_{m}italic_A = italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT … italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT be the corresponding sequence of writing actions where each ai{0W1}subscript𝑎𝑖0𝑊1a_{i}\in\{0\dots W-1\}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ∈ { 0 … italic_W - 1 }.Call idx-xijidx-superscriptsubscript𝑥𝑖𝑗\text{idx-}x_{i}^{j}idx- italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT the index in the vocabulary of xijsuperscriptsubscript𝑥𝑖𝑗x_{i}^{j}italic_x start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT, we then pass the following indices to the language model:

(a1+V)idx-x11idx-xn11(am+V)idx-x1midx-xnmmsubscript𝑎1𝑉idx-superscriptsubscript𝑥11idx-superscriptsubscript𝑥subscript𝑛11subscript𝑎𝑚𝑉idx-superscriptsubscript𝑥1𝑚idx-superscriptsubscript𝑥subscript𝑛𝑚𝑚(a_{1}+V)\>\text{idx-}x_{1}^{1}\ldots\text{idx-}x_{n_{1}}^{1}\ldots\ldots(a_{m%}+V)\>\text{idx-}x_{1}^{m}\ldots\text{idx-}x_{n_{m}}^{m}( italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT + italic_V ) idx- italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … idx- italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 1 end_POSTSUPERSCRIPT … … ( italic_a start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT + italic_V ) idx- italic_x start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT … idx- italic_x start_POSTSUBSCRIPT italic_n start_POSTSUBSCRIPT italic_m end_POSTSUBSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_m end_POSTSUPERSCRIPT

To isolate the effect of using an external planner vs using the language model as planner, we evaluate our model in an adapted setting where we also fine-tune original parameters, and condition on planner-predicted actions in insert-style rather than adapter-style.

Because learning to plan and learning to condition on the plan are intertwined for the internal planner, it does not allow training just a planner separately beforehand in order to condition on planner-predicted actions. Hence we condition the internal planner on oracle actions during training.

When training the model to condition on oracle actions, there is a slight mismatch with how it will be used at generation time (when it uses self-predicted actions): During training with oracle actions, the action embedding always corresponded to the centroid closest to the embedding of the text unit following it, while that does not necessary hold during generation.One option to reduce this mismatch during generation would be to retroactively change past predicted actions in the context of the language model to match the generated sentence that follows them.We leave evaluation of this option for future work.

Appendix D Are Oracle Actions Optimal?

An oracle action is the action whose clustered embedding is closest to the concrete embedding of the true next sentence.Table 1 showed that conditioning on an oracle code consistently leads to lower perplexity than conditioning on a fixed or predicted code.A natural question is whether providing a language model with the oracle action leads to lowest perplexity among all actions.

To test this, we first evaluate every action at every step, resulting in a ranked list of actions per step.Then, we compute the perplexity that one would get when always choosing the k𝑘kitalic_k-th best action. We plot the the results for every k𝑘kitalic_k in Figure9.

Learning to Plan for Language Modeling from Unlabeled Data (5)

The figure shows that the average perplexity using oracle codes is equivalent to using the 18th best action (out of 1024) in each step, meaning that on average there are 17 actions that would lead to a lower perplexity than the oracle action.

The fact that better-than-oracle-actions exist does not guarantee that it is feasible to train a planner that is able to consistently find those better-than-oracle-actions.One possible alternative explanation is the better-than-oracle actions are better purely due to luck: If one would make 1024 noisy variations of the oracle action embedding, some of those would also be better than the oracle embedding by chance. We investigate this alternative hypothesis by adding Gaussian noise (μ=0𝜇0\mu=0italic_μ = 0, σ𝜎\sigmaitalic_σ set to empirical standard deviation across the action embeddings over all layers) to the oracle action embeddings.

Figure 4 shows that indeed about half of the noise variations outperform the oracle embedding. However, it also shows that the lowest perplexity achievable by selecting the best random noise is significantly higher than that of selecting the best alternative action.This suggest random luck is not the full explanation for the success of the better-than-oracle actions, and it might be feasible to predict them consistently.

Learning to Plan for Language Modeling from Unlabeled Data (6)

If indeed it is feasible, this suggest the planner can be made more effective by training it to predict the value of an action directly (in terms of the perplexity it leads to), rather than only training it to match oracle actions.A planner that can successfully estimate the values of its actions could then be combined with an algorithm like MCTS(Browne etal., 2012) at inference time for even greater performance gains. We consider this a promising avenue for future work.

Appendix E Examples of Clusters

Table7 shows an extension of Table3, i.e., multiple examples per action cluster.

#ContentExample sentences
1Demographics1: According to the 2011 census, it had 841 inhabitants, all of whom were Albanian.2: The population was 23 as of 2002.3: Based on the 2007 national census conducted by the Central Statistical Agency of Ethiopia (CSA), this woreda has a total population of 112,396 of whom 56,245 are men and 56,151 women, no urban inhabitants were reported.4: 1790 - Population: 16,014.5: Its population as of 1990 was 344; its population as of 2000 was 680.
2Plot1: Feeling angry and betrayed, Nadine separates from Chloe, but they make amends.2: The pair ultimately decide to move in together after Britney kicks out her previous roommate.3: After Coré’s death, Shane becomes strange and distant.4: When Nico finds out that he never showed the police Antonia’s diary she questions their relationship and Philipp’s love for her again.5: Louise reveals that she has been helping both Bob and Gene with their pranks without the other knowing and announces that she is quitting.
3Business1: Then in May 1998, all of PolyGram and its associated labels were purchased by Seagram which announced its plan to integrate PolyGram with UMG to produce an estimated cost savings, within a couple of years, of between US$275 million and $300 million annually. 2: As of April 2020, the ownership of the pipeline was as represented in the table below. 3: Wendy’s International is owned by The Wendy’s Company. 4: October 2007: Nakheel announced the sale of Ireland, and Shanghai in October 2007. 5: In 2016, Endeavor – a Hollywood-based entertainment group – acquired a reported 70%-controlling stake in Frieze, which includes its publishing, art fair and music interests.
4Geography1: Districts of Khyber Pakhtunkhwa 2: and in Kyrenia District. 3: Birhors are found mainly in the area covered by the old Hazaribagh, Ranchi and Singhbhum districts before these were broken down into numerous smaller units, in Jharkhand. 4: Perumpanachy falls under Madapally Panchayat and Chanaganacherry Thaluk. 5: It is about from Kuknur and from Lakkundi.
5Schools1: The old high school building is now being reused as the middle school. 2: 2 High School is one of the four-star high schools in Jiangsu Province. 3: The school has complete grade school levels, and high school levels. 4: The sole entrance and exit to the school is along Texas State Highway Loop 1604. 5: Students that attend the school come from the central parts of Prince County.
#ContentExample sentences
6Education1: In 1957, he graduated with a thesis on Henry James. 2: In 1996, he graduated with a BA in Political Science and Government, Philosophy, and Business Administration. 3: He studied at Dalton School in New York and The Putney School in Vermont, later on graduating from Harvard University in 1965, in Far Eastern history. 4: Guinle meanwhile earned a degree in Accountancy from the University of Morón in 2000. 5: In 1986, Darren Barrett attended Berklee College of Music on a full scholarship, receiving a BA in Professional Music in 1990.
7Italy1: Non qui, Barbara, nessuno ci sta guardando (1989) 2: Marchesa Olga Benucci Granieri Solaro — Monaldo wife, mother of Vittoria and Costanza and mother-in-law of Martino, Olga is a very caring woman to the image of his own family. 3: Ugo Gregoretti (28 September 1930 – 5 July 2019) was an Italian film, television and stage director, actor, screenwriter, author and television host. 4: This was the second Garibaldi group that succeeded the first group of the same name. 5: Gaetano Marco ”Guy” Nardulli (born May 31, 1974 in Norridge, Illinois) is an American actor and producer who is most associated for his character role as a street fighter turned MMA professional fighter in the 2006 movie -And Then It Breaks with actress Anne Dudek.
8Acronyms1: Soc. 2: Dir. 3: Ber. 4: Ed. 5: 2nd ed.
9Origin1: Ashley Sessa was born and raised in Schwenksville, Pennsylvania.2: Born on August 9, 1883, in Reading, Pennsylvania, Daisy Elizabeth Adams was educated in Reading, Pennsylvania.3: He was born in Keokuk, Iowa in 1877.4: Ada E. Schnitzer was born November 22, 1887, in Hannibal, Missouri, the daughter of O. C. Schnitzer.5: Morse was born in Abington, Massachusetts on August 19, 1911.
10Career1: He appeared in the 1995 Seinfeld episode ”The Beard” playing Robert’s boss.2: Tisch also made an appearance on the reality show Shark Tank in season 5.3: He starred as the surgeon Theodor Hristea, who, after some of his belongings are stolen, involves himself in the inquiry and directs the interrogation of a seemingly innocent man.4: He also appeared, uncredited, in the movie It’s a Wonderful Life (1946), playing piano in the scene where George Bailey gets thrown out of Nick’s Bar.5: Clint Eastwood was among the first of many actors to adopt this wandering ronin with no name persona for foreign films, which he used to great effect in his Western roles, especially in Spaghetti Westerns directed by Sergio Leone where he played the Man with No Name, a character similar to Mifune’s seemingly-nameless ronin in Yojimbo.

Appendix F Complexity Analysis

In the following, we discuss the number of parameters that our method requires as well as the computational complexity.

Parameter count

The base language model has approximately 130M (GPT-2) or approximately 1B (OLMo-1B) parameters. The adapter needed for finetuning add about 14M parameters. The planner consists of a sentence transformer with 110M frozen parameters plus a small Transformer encoder with 6M additional parameters.Having a separate planner with its own sentence representations means we can reuse the same planner for different language models. As a memory-efficient alternative we can use the sentence representations of the LM (which have slightly worse overall performance), losing the portability advantage.

Model complexity

The complexity (number of computations) of our proposed sentence-based planner (Section 3.2), is 𝒪(N2M)𝒪superscript𝑁2𝑀\mathcal{O}(\frac{N^{2}}{M})caligraphic_O ( divide start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG ) for the representation function (computation of 𝐙i1𝐙subscript𝑖1\mathbf{Z’}_{i-1}bold_Z ’ start_POSTSUBSCRIPT italic_i - 1 end_POSTSUBSCRIPT) plus 𝒪(M3)𝒪superscript𝑀3\mathcal{O}(M^{3})caligraphic_O ( italic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) for the prediction function, where N𝑁Nitalic_N is the total number of tokens and M𝑀Mitalic_M the number of sentences. Since the representation function (110M parameters vs 6M for the prediction function) represents the vast majority of the complexity, invoking the planner is roughly M𝑀Mitalic_M times less expensive than the language model (GPT-2: 128M parameters, OLMo: 1B parameters), which have complexity 𝒪(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ).

At training time, due to its externality, the planner can be trained once and henceforth be integrated with a variety of language models. Therefore we focus on inference-time cost here. The planner consists of a representation function (sentence-transformer) and a prediction function (Transformer encoder on top of sentence embeddings).

Representation function: The bulk of the compute of the planner is in the representation function (110M parameters). Since our architecture encodes sentences in the previous context independently, sentence embeddings can be efficiently cached to avoid re-encoding when moving from one sentence to another. Computing the sentence embedding with a Transformer-based architecture has quadratic complexity in the length of the sentence. Assuming a simplified view where each sentence has length N/M𝑁𝑀N/Mitalic_N / italic_M, where N𝑁Nitalic_N is the total number of tokens in an article and M𝑀Mitalic_M is the number of sentences, encoding the necessary representations for all calls to the planner in an article has complexity 𝒪(M(N/M)2)=𝒪(N2M)𝒪𝑀superscript𝑁𝑀2𝒪superscript𝑁2𝑀\mathcal{O}(M\cdot(N/M)^{2})=\mathcal{O}(\frac{N^{2}}{M})caligraphic_O ( italic_M ⋅ ( italic_N / italic_M ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) = caligraphic_O ( divide start_ARG italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG start_ARG italic_M end_ARG ). In contrast, a non-sentence-based representation (see ablation ‘no sent rep’ in Table2) has complexity 𝒪(N2)𝒪superscript𝑁2\mathcal{O}(N^{2})caligraphic_O ( italic_N start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ). Since there are many sentences in an article, the sentence-based planner is considerably more efficient.

Prediction function: The additional Transformer encoder is lightweight (6M parameters) and has complexity 𝒪(M2)𝒪𝑀superscript2\mathcal{O}(M’^{2})caligraphic_O ( italic_M ’ start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ), where M’ denotes the number of sentences in the current context. Overall, this brings the complexity to 𝒪(M3)𝒪superscript𝑀3\mathcal{O}(M^{3})caligraphic_O ( italic_M start_POSTSUPERSCRIPT 3 end_POSTSUPERSCRIPT ) across the article. This could potentially be brought down to 𝒪(M2)𝒪superscript𝑀2\mathcal{O}(M^{2})caligraphic_O ( italic_M start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ) by using a Transformer decoder instead of a Transformer encoder for the prediction function, which allows for caching due to causal masking.

Appendix G Color-coded Example Article

In order to give the reader an impression of how abstract writing actions are distributed within an article, below we print an article that is color-coded with different actions.Different text colors correspond to different actions.

G.1 Article

In 1917, Weeks defeated incumbent John J. Mullen by 230 votes to become Mayor of Everett. The Boston Daily Globe described the race between Mullen and Weeks as ”one of the bitterest campaigns in years” and in his inaugural address, Weeks referred to his predecessor as a ”caterwauling demagogue” and vowed to overturn many of his acts, including firing of Police Chief William E. Hill and the closure of the Everett Tuberculous Hospital. In 1918, Christopher Harrison defeated Weeks by 390 votes, with Mullen, who supported Harrison after being eliminated in the preliminary election, taking credit for ”putting [him] over”.

In 1922, Weeks was the Progressive Party candidate for United States Senate. He finished sixth with less than 1% of the vote. In 1923 Weeks moved to Reading, Massachusetts. However, in 1933, he returned to Everett to run for Mayor. He made the runoff election, but was defeated by another former Mayor, James A. Roche. During the 1934 gubenatoral election Weeks supported Democrat James Michael Curley. In 1935 Curley appointed Weeks to the State Alcoholic Beverage Control Commission. In 1941, Weeks again ran for Mayor of Everett. He finished last in a four candidate primary.

Legal careerIn 1922, Weeks defended George H. Mansfield, a former Everett resident who was charged with murdering his lover, Alice Jones. The medical examiner later ruled that Jones committed suicide and District Attorney Thomas C. O’Brien asked the grand jury to return no bill against Mansfield. In 1924, Weeks served as a special counsel for defendants accused of being part of a extortion ring led by former Middlesex County district attorney William J. Corcoran. They were found guilty and Corcoran was sentenced to 7 to 10 years in prison. Weeks also defended Corcoran when he and Daniel H. Coakley were also charged with conspiracy to extort later that year. They were found not guilty on all counts. In 1927 Weeks represented Jerry Gedzium, a convicted murder who was appealing his death sentence. The conviction was upheld by the Massachusetts Supreme Judicial Court.

See also1916 Massachusetts legislature

References

1880 births1972 deathsMayors of Everett, MassachusettsMassachusetts lawyersMassachusetts Progressives (1912)Massachusetts RepublicansMembers of the Massachusetts House of RepresentativesPeople from Reading, Massachusetts

Learning to Plan for Language Modeling from Unlabeled Data (2024)

FAQs

How much data is needed to train a language model? ›

The amount of training data needed depends on elements like problem type, model complexity, number of features, and error tolerance. While no fixed rules exist, the popular guideline is having 10 times or more examples than features.

Which machine learning deals with unlabeled data to find patterns from data? ›

In the machine learning universe, unlabeled data is primarily used in unsupervised learning models. Here, the algorithm sifts through this kind of data to discover patterns, correlations, or clusters, without any previous indication about what to look for.

Which is the recommended model for pattern recognition in unlabeled data? ›

Autoemcoders is a recommended Model for Pattern Recognition in Unlabeled data.

How much data is ChatGPT 4 trained on? ›

ChatGPT receives more than 10 million queries per day and, in November 2023, hit 100 million weekly users. The chatbot was trained on a massive corpus of text data, around 570GB of datasets, including web pages, books, and other sources.

How much data is GPT-3 trained on? ›

Summary. To summarise: GPT-3 is a very large language model (the largest till date) with about 175B parameters. It is trained on about 45TB of text data from different datasets.

Which learning is used to find the hidden pattern in unlabeled data is called? ›

Unsupervised learning, also known as unsupervised machine learning, uses machine learning (ML) algorithms to analyze and cluster unlabeled data sets. These algorithms discover hidden patterns or data groupings without the need for human intervention.

What are the 4 types of data that machine learning can use? ›

What type of data does machine learning need? Data can come in many forms, but machine learning models rely on four primary data types. These include numerical data, categorical data, time series data, and text data.

Which type of machine learning groups unlabeled data using similarities in the data? ›

Clustering. Clustering is a technique for exploring raw, unlabeled data and breaking it down into groups (or clusters) based on similarities or differences.

How to train an unlabeled dataset? ›

Training a neural network with unlabeled data often involves using unsupervised learning techniques. One common approach is autoencoders, where the network learns to reconstruct input data without explicit labels. Clustering methods like K-means can also be applied to discover patterns in unlabeled datasets.

What is an example of unlabeled data? ›

Unlabeled data on the other hand, doesn't have any meaningful tags or labels and usually consists of natural or human-created samples such as photos, audio recordings, videos, news articles, tweets, or x-rays that can be easily obtained.

What machine learning algorithms can be used with unlabeled data? ›

Unsupervised Machine Learning

Unsupervised learning (UL) is a machine learning algorithm that works with datasets without labeled responses. It is most commonly used to find hidden patterns in large unlabeled datasets through cluster analysis.

How much data should be used as training data? ›

The rule-of-thumb rule is that you need at least ten times as many data points as there are features in your dataset.

How much data was GPT-2 trained on? ›

Generative Pre-trained Transformer 2 (GPT-2) is a large language model by OpenAI and the second in their foundational series of GPT models. GPT-2 was pre-trained on a dataset of 8 million web pages.

How much data is needed to train BERT? ›

Training a large language model like BERT or GPT requires a substantial amount of data to achieve good performance. While there's no strict minimum amount, typically these models are trained on datasets containing hundreds of gigabytes to terabytes of text data.

How many GB is 1 billion parameters? ›

Typically, LLMs are trained with single- or half-precision floating point numbers (float32 and float16). One float16 has 16 bits, or 2 bytes, and so one billion parameters require 2 gigabytes.

Top Articles
Spectrum Internet Plans, Prices, and Packages
How Brazil's Rebeca Andrade, world's other gymnasts match up with Simone Biles at Olympics
Hotels Near 6491 Peachtree Industrial Blvd
Encore Atlanta Cheer Competition
Knoxville Tennessee White Pages
What spices do Germans cook with?
Linkvertise Bypass 2023
Top Financial Advisors in the U.S.
Unlocking the Enigmatic Tonicamille: A Journey from Small Town to Social Media Stardom
Decaying Brackenhide Blanket
Xm Tennis Channel
2024 U-Haul ® Truck Rental Review
Guidewheel lands $9M Series A-1 for SaaS that boosts manufacturing and trims carbon emissions | TechCrunch
Kitty Piggy Ssbbw
Patrick Bateman Notebook
Vistatech Quadcopter Drone With Camera Reviews
Everything you need to know about Costco Travel (and why I love it) - The Points Guy
Yard Goats Score
Espn Horse Racing Results
Johnnie Walker Double Black Costco
Lisas Stamp Studio
R. Kelly Net Worth 2024: The King Of R&B's Rise And Fall
Www.dunkinbaskinrunsonyou.con
Dewalt vs Milwaukee: Comparing Top Power Tool Brands - EXTOL
Nesb Routing Number
Busted Mugshots Paducah Ky
Cable Cove Whale Watching
R Baldurs Gate 3
Aes Salt Lake City Showdown
TJ Maxx‘s Top 12 Competitors: An Expert Analysis - Marketing Scoop
Dell 22 FHD-Computermonitor – E2222H | Dell Deutschland
Town South Swim Club
Ipcam Telegram Group
Rush County Busted Newspaper
Pixel Combat Unblocked
Chadrad Swap Shop
Memberweb Bw
Diana Lolalytics
Personalised Handmade 50th, 60th, 70th, 80th Birthday Card, Sister, Mum, Friend | eBay
Tsbarbiespanishxxl
Directions To The Closest Auto Parts Store
Thor Majestic 23A Floor Plan
Craigslist Farm And Garden Reading Pa
Cocorahs South Dakota
Cuckold Gonewildaudio
What is a lifetime maximum benefit? | healthinsurance.org
Tito Jackson, member of beloved pop group the Jackson 5, dies at 70
116 Cubic Inches To Cc
Evil Dead Rise - Everything You Need To Know
F9 2385
Adams County 911 Live Incident
7 National Titles Forum
Latest Posts
Article information

Author: Neely Ledner

Last Updated:

Views: 5574

Rating: 4.1 / 5 (62 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Neely Ledner

Birthday: 1998-06-09

Address: 443 Barrows Terrace, New Jodyberg, CO 57462-5329

Phone: +2433516856029

Job: Central Legal Facilitator

Hobby: Backpacking, Jogging, Magic, Driving, Macrame, Embroidery, Foraging

Introduction: My name is Neely Ledner, I am a bright, determined, beautiful, adventurous, adventurous, spotless, calm person who loves writing and wants to share my knowledge and understanding with you.