Natural Language Processing with Deep Learning CS224N/Ling284 Christopher Manning (based on a lecture by Antoine Bosselut) Lecture 12: Neural Language Generation Today: A bit more on projects and Natural Language Generation • A few more final project thoughts and tips 1. What is NLG? 2. The simple neural NLG model and training algorithm that we have already seen 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations 2 a. Care with datasets in model development • Many publicly available datasets are released with a train/dev/test structure • If there is no dev set or you want a separate tune set, then you should create one by splitting the training data • We weigh the usefulness of it being bigger against the reduction in train-set size • Cross-validation (q.v.) is a technique for maximizing data when you don’t have much • You build (estimate or train) a model on a training set • We measure progress and avoid overfitting using an independent dev or validation set • If you do that a lot, you overfit to the dev set; it can help to have a second dev2 set • A fixed test set ensures that all systems are assessed against the same gold data. • This is generally good and advised – even if using CV in model development • But it can be problematic when the test set turns out to have unusual properties that distort progress on the task. 3 The need for independent partitions of the data set • The train, tune, dev, and test sets need to be completely distinct • Be alert even to small overlaps, like repeated material due to email replies, etc. • It is invalid to give results testing on material you have trained on • You will get falsely good performance – we almost always overfit on train • You may need an independent tuning set • Any hyperparameters needed for independent data won’t be set correctly, if tune is same as train • If you keep running on the same evaluation set, you begin to overfit to it • Effectively you are “training” on the evaluation set … you are learning things that do and don’t work on that particular eval set and you only keep the things that ”work” … on that particular eval set • To get a valid measure of system performance you need another untrained on, independent test set … hence dev2 and final test sets • We're all on the honor system to do test-set runs only when development is complete • Use the final test set extremely few times … ideally only once 4 b. Getting your neural network to train • Start with a positive attitude! • Neural networks want to learn! • If the network isn’t learning, you’re doing something to prevent it from learning successfully! • Realize the grim reality: • There are lots of things that can cause neural nets to not learn at all or to not learn very well • Finding and fixing them (“debugging and tuning”) can often take a lot more time than implementing your model 😰 • It’s hard to work out what these things are • But experience, experimental care, examining carefully what’s happening inside the model, and rules of thumb all help! 5 Experimental strategy • Work incrementally! • Start with a very simple model and get it to work! • It’s very hard to fix a complex but broken model • Add bells and whistles one-by-one and get the model working with each (if you can) • E.g. from BiDAF: At first leave out character CNN and finish prediction LSTM and get that working. Indeed, maybe you could also leave out the modeling layer at first • Initially run your model on a tiny amount of data • You will see bugs much more easily on a tiny dataset … and it trains really quickly • Something like 4–10 examples is good • Often synthetic data is useful for this • Make sure you can get 100% on this data (testing on train) • Otherwise, your model is definitely either not powerful enough or it is broken 6 Experimental strategy • Then, train and run your model on a large dataset • It should still score close to 100% on the training data after optimization • Otherwise, you probably want to consider a more powerful model! • Overfitting to training data is not something to fear when doing deep learning • These models are usually good at generalizing because of the way distributed representations share statistical strength regardless of overfitting to training data • But, still, you now want good generalization performance: • Regularize your model until it doesn’t overfit on dev data • Strategies like L2 regularization or early stopping of training can be useful • But normally generous dropout is the secret to success 7 Details matter! • Look at your data, collect summary statistics • Look at your model’s outputs, do error analysis • Find ways to examine and visualize internal representations; see if they’re sensible • Attention distributions are often particularly visualizable • Tuning hyperparameters, learning rates, getting initialization right, etc. is often important to the successes of neural nets 8 c. Finding data for your projects • Some people collect their own data for a project – we like that! • You may have a project that uses “unsupervised” data • You can annotate a small amount of data • You can find a website that effectively provides annotations, such as likes, stars, ratings, responses, etc. • This let’s you learn about real word challenges of applying ML/NLP! • But be careful on scoping things so that this doesn’t take most of your time!!! • Some people have existing data from a research project or company • Fine to use providing you can provide data samples for submission, report, etc. • Most people make use of an existing, curated dataset built by previous researchers • You get a fast start and there is obvious prior work and baselines 9 Linguistic Data Consortium • https://catalog.ldc.upenn.edu/ • Stanford licenses this data; you can get access. Sign up/ask questions at: https://linguistics.stanford.edu/resources/resources-corpora • Treebanks, named entities, coreference data, lots of clean newswire text, lots of speech with transcription, parallel MT data, etc. • Look at their catalog • Don’t use for nonStanford purposes! 10 Many, many more • There are now many other datasets available online for all sorts of purposes • Look at Kaggle • Look at research papers to see what data they use • Traditional lists of datasets • https://machinelearningmastery.com/datasets-natural-language-processing/ • https://github.com/niderhoff/nlp-datasets • Lots of particular things: • For machine translation, look at: http://statmt.org – check out the WMT shared tasks • For dependency parsing: Universal Dependencies data: https://universaldependencies.org • https://gluebenchmark.com/tasks – a collection of NLU tasks • https://nlp.stanford.edu/sentiment/ – the Stanford Sentiment Treebank • https://research.fb.com/downloads/babi/ (Facebook bAbI-related controlled NLU/reasoning) • Ask on Ed or talk to course staff 11 🤗 Huggingface Datasets • https://huggingface.co/ datasets 12 Paperswithcode Datasets • https://www.paperswithcode.com /datasets?mod=texts&page=1 13 Today: Natural Language Generation 1. What is NLG? 2. The simple neural NLG model and training algorithm that we have already seen 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations 14 What is natural language generation? Natural language generation is one side of natural language processing. NLP = Natural Language Understanding (NLU) + Natural Language Generation (NLG) Any task involving language production for human consumption requires natural language generation NLG focuses on systems that produce coherent and useful language output for human consumption Deep Learning is powering (some) next-gen NLG systems! 15 Uses of natural language generation Machine Translation systems use NLG for output Digital assistant (dialogue) systems use NLG Summarization systems (for research articles, email, meetings, documents) use NLG 16 More interesting NLG uses Creative stories Data-to-text Visual description Craig finished his eleven NFL seasons with 8,189 rushing yards and 566 receptions for 4,911 receiving yards. (Rashkin et al.., EMNLP 2020) 17 (Parikh et al.., EMNLP 2020) (Krause et al. CVPR 2017) Today: Natural Language Generation 1. What is NLG? 2. The simple neural NLG model and training algorithm that we have already seen 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations 18 Basics of natural language generation (review of lecture 6) • In autoregressive text generation models, at each time step t, our model takes in a sequence of tokens of text as input 𝑦 !" and outputs a new token, 𝑦"" • For model 𝑓( . ) and vocab 𝑉, we get scores 𝑆 = 𝑓 𝑦!" , 𝜃 ∈ ℝ# 𝑃 𝑦" 𝑦!" 19 exp(𝑆* ) = ∑*! ∈ # exp(𝑆*! ) 𝑦"" 𝑦"")( 𝑦"")' 𝑦"$% 𝑦"$( 𝑦"" 𝑦")( 𝑦"$& 𝑦"$' … Trained one token at a time by maximum likelihood teacher forcing • Trained to maximize the probability of the next token 𝑦"∗ given preceding words {𝑦 ∗ }!" % ℒ = − % log 𝑃 𝑦!∗ 𝑦 ∗ &! ) !#$ • This is a classification task at each time step trying to predict the actual word 𝑦!∗ in the training data • Doing this is often called “teacher forcing” (because you reset at each time step to the ground truth) 20 𝑦(∗ 𝑦'∗ 𝑦&∗ 𝑦%∗ 𝑦-∗ 𝑦(∗ 𝑦'∗ 𝑦&∗ … … ∗ 𝑦.$& ∗ 𝑦.$' ∗ 𝑦.∗ 𝑦.$( ∗ 𝑦.$% ∗ 𝑦.$& ∗ 𝑦.$' ∗ 𝑦.$( Basics of natural language generation (review of lecture 6) • At inference time, our decoding algorithm defines a function to select a token from this distribution: 𝑦!! = 𝑔(𝑃 𝑦! 𝑦"! )) 𝑔( . ) is your decoding algorithm • The “obvious” decoding algorithm is to greedily choose the highest probability next token according to the model at each time step • While this basic algorithm sort of works, to do better, the two main avenues are to: 1. Improve the decoder 2. Improve the training 21 Of course, there’s also improving your training data or model architecture Today: Natural Language Generation 1. What is NLG? 2. The simple neural NLG model and training algorithm that we have already seen 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations 22 Decoding: what is it all about? • At each time step t, our model computes a vector of scores for each token in our vocabulary, S ∈ ℝ# : 𝑆 = 𝑓 𝑦"! 𝑓( . ) is your model • Then, we compute a probability distribution 𝑃 over these scores (usually with a softmax function): 𝑃 𝑦! = 𝑤 𝑦"! exp(𝑆# ) = ∑# ! ∈ % exp(𝑆# ! ) • Our decoding algorithm defines a function to select a token from this distribution: 𝑦!! = 𝑔(𝑃 𝑦! 𝑦"! )) 23 𝑔( . ) is your decoding algorithm Greedy methods • Recall: Lecture 7 on Neural Machine Translation… • Argmax Decoding • Selects the highest probability token in 𝑃 𝑦" 𝑦!" ) 𝑦!! = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑃 𝑦! = 𝑤 𝑦"! ) 𝒘∈𝑽 • Beam Search • Discussed in Lecture 7 on Machine Translation • At heart also a greedy algorithm, but with wider exploration of candidates 24 Greedy methods get repetitive Context: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. Continuation: The study, published in the Proceedings of the 25 National Academy of Sciences of the United States of America (PNAS), was conducted by researchers from the Universidad Nacional Autónoma de México (UNAM) and the Universidad Nacional Autónoma de México (UNAM/Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México… (Holtzman et. al., ICLR 2020) Why does repetition happen? 26 (Holtzman et. al., ICLR 2020) And it keeps going… 27 (Holtzman et. al., ICLR 2020) How can we reduce repetition? Simple option: • Heuristic: Don’t repeat n-grams More complex: • Maximize embedding distance between consecutive sentences (Celikyilmaz et al., 2018) • Doesn’t help with intra-sentence repetition • Coverage loss (See et al., 2017) • Prevents attention mechanism from attending to the same words • Unlikelihood objective (Welleck et al., 2020) • Penalize generation of already-seen tokens 28 Are greedy methods reasonable?                           29         (Holtzman et. al., ICLR 2020) Time to get random : Sampling! • Sample a token from the distribution of tokens 𝑦!+ ∼ 𝑃 𝑦+ = 𝑤 { 𝑦 ,+ ) • It’s random so you can sample any token! He wanted to go to the 30 Model restroom grocery store airport bathroom beach doctor hospital pub gym Decoding: Top-k sampling • Problem: Vanilla sampling makes every token in the vocabulary an option • Even if most of the probability mass in the distribution is over a limited set of options, the tail of the distribution could be very long and in aggregate have considerable mass (statistics speak: we have “heavy tailed” distributions) • Many tokens are probably really wrong in the current context • Why are we giving them individually a tiny chance to be selected? • Why are we giving them as a group a high chance to be selected? • Solution: Top-k sampling • Only sample from the top k tokens in the probability distribution 31 (Fan et al., ACL 2018; Holtzman et al., ACL 2018) Decoding: Top-k sampling • Solution: Top-k sampling • Only sample from the top k tokens in the probability distribution • Common values are k = 5, 10, 20 (but it’s up to you!) He wanted to go to the Model • Increase k for more diverse/risky outputs • Decrease k for more generic/safe outputs restroom grocery store airport bathroom beach doctor hospital pub gym 32 (Fan et al., ACL 2018; Holtzman et al., ACL 2018) Issues with Top-k sampling Top-k sampling can cut off too quickly! Top-k sampling can also cut off too slowly! 33 (Holtzman et. al., ICLR 2020) Decoding: Top-p (nucleus) sampling • Problem: The probability distributions we sample from are dynamic • When the distribution Pt is flatter, a limited k removes many viable options • When the distribution Pt is peakier, a high k allows for too many options to have a chance of being selected • Solution: Top-p sampling • Sample from all tokens in the top p cumulative probability mass (i.e., where mass is concentrated) • Varies k depending on the uniformity of Pt 34 (Holtzman et. al., ICLR 2020) Decoding: Top-p (nucleus) sampling • Solution: Top-p sampling • Sample from all tokens in the top p cumulative probability mass (i.e., where mass is concentrated) • Varies k depending on the uniformity of Pt 𝑃!( 𝑦! = 𝑤 { 𝑦 "! ) 35 𝑃!) 𝑦! = 𝑤 { 𝑦 "! ) 𝑃!* 𝑦! = 𝑤 { 𝑦 "! ) (Holtzman et. al., ICLR 2020) Scaling randomness: Softmax temperature • Recall: On timestep t, the model computes a prob distribution Pt by applying the softmax function to a vector of scores 𝑠 ∈ ℝ|"| exp(𝑆$ ) 𝑃# (𝑦# = 𝑤) = ∑$%∈" exp(𝑆$% ) • You can apply a temperature hyperparameter 𝜏 to the softmax to rebalance 𝑃# : 𝑃# 𝑦# = 𝑤 = exp 𝑆$ /𝜏 ∑$! ∈" exp 𝑆$! /𝜏 Raise the temperature 𝜏 > 1: 𝑃# becomes more uniform • More diverse output (probability is spread around vocab) Lower the temperature 𝜏 < 1: 𝑃# becomes more spiky • Less diverse output (probability is concentrated on top words) • • Note: softmax temperature is not a decoding algorithm! 36 It’s a technique you can apply at test time, in conjunction with a decoding algorithm (such as beam search or sampling) Improving decoding: re-balancing distributions • Problem: What if I don’t trust how well my model’s distributions are calibrated? • Don’t rely on ONLY your model’s distribution over tokens • One Approach: Re-balance Pt using retrieval from n-gram phrase statistics! • Cache a database of phrases from your training corpus and use to rebalance Pt Training Contexts Targets Obama was senator for Illinois Barack is married to Michelle Obama was born in Hawaii … … Obama is a native of Hawaii 37 Test Context Target Obama’s birthplace is ? Representations … Representation Distances Nearest k Normalization Aggregation 4 100 5 … 3 Hawaii 3 Illinois 4 Hawaii 5 Hawaii 0.7 Illinois 0.2 Hawaii 0.1 Hawaii 0.8 Illinois 0.2 Classification Hawaii 0.2 Illinois 0.2 … … Interpolation Hawaii 0.6 Illinois 0.2 … … (Khandelwal et. al., ICLR 2020) Improving Decoding: Re-ranking • Problem: What if I decode a bad sequence from my model? • Decode a bunch of sequences • 10 candidates is a common number, but it’s up to you • Define a score to approximate quality of sequences and re-rank by this score • Simplest is to use perplexity! • Careful! Remember that repetitive methods can generally get high perplexity. • Re-rankers can score a variety of properties: • style (Holtzman et al., 2018), discourse (Gabriel et al., 2021), entailment/factuality (Goyal et al., 2020), logical consistency (Lu et al., 2020), and many more … • Beware poorly-calibrated re-rankers • Can use multiple re-rankers in parallel 38 Decoding: Takeaways • Decoding is still a challenging problem in NLG – there’s a lot more work to be done! • A major realization of the last couple of years is that many of the problems that we see in neural NLG are not really problems with our learned language model probability distribution, but problems with the decoding algorithm • Human language production is a subtle presentation of information and can’t be modeled by simple properties like probability maximization • Different decoding algorithms can allow us to inject biases that encourage different properties of coherent natural language generation • Some of the most impactful advances in NLG of the last few years have come from simple but effective modifications to decoding algorithms 39 Today: Natural Language Generation 1. What is NLG? 2. The simple neural NLG model and training algorithm that we have already seen 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations 40 Are greedy decoders bad because of how they’re trained? Context: In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English. Continuation: The study, published in the Proceedings of the 41 National Academy of Sciences of the United States of America (PNAS), was conducted by researchers from the Universidad Nacional Autónoma de México (UNAM) and the Universidad Nacional Autónoma de México (UNAM/Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México/ Universidad Nacional Autónoma de México… (Holtzman et. al., ICLR 2020) Diversity Issues • Maximum Likelihood Estimation discourages diverse text generation 42 Unlikelihood Training • Given a set of undesired tokens 𝒞, lower their likelihood in context + ℒA> = − . log(1 − 𝑃 𝑦DEF 𝑦 ∗ ,+ )) B!"# ∈ 𝒞 • Keep teacher forcing objective and combine them for final loss function + ℒ=>? = − log 𝑃 𝑦+∗ 𝑦 ∗ ,+ ) + + + ℒA>? = ℒ=>? + 𝛼ℒA> • Set 𝒞 = 𝑦 ∗ !" and you’ll train the model to lower the likelihood of previously-seen tokens! • Limits repetition! • Increases the diversity of the text you learn to generate! 43 (Welleck et al., 2020) • Training with teacher forcing leads to exposure bias at generation time • During training, our model’s inputs are gold context tokens from real, human-generated texts       Exposure Bias           !#∗ !$∗ !%∗ !(∗ !!∗ !#∗ !$∗ !%∗ !($ !(" !%∗ !($ …        ∗ !&'% ∗ !&'$ ∗ !&'# !&∗ ∗ !&'( ∗ !&'% ∗ !&'$ ∗ !&'# !(&!( !(&!" !(&!$ !(& !(&!' !(&!( !(&!" !(&!$ ℒ456 = − log 𝑃 𝑦"∗ 𝑦 ∗ !" ) • At generation time, our model’s inputs are previously–decoded tokens … … ℒ789 = − log 𝑃 𝑦"" 𝑦" !" ) ∗ !!" 44 ∗ !!$ !(" … Exposure Bias Solutions • Scheduled sampling (Bengio et al., 2015) • With some probability p, decode a token and feed that as the next input, rather than the gold token. • Increase p over the course of training • Leads to improvements in practice, but can lead to strange training objectives • Dataset Aggregation (DAgger; Ross et al., 2011) • At various intervals during training, generate sequences from your current model • Add these sequences to your training set as additional examples Basically, variants of the same approach; see: https://nlpers.blogspot.com/2016/03/a-dagger-by-any-other-name-scheduled.html 45 Exposure Bias Solutions • Sequence re-writing (Guu*, Hashimoto*, et al., 2018) • Learn to retrieve a sequence from an existing corpus of human-written prototypes (e.g., dialogue responses) • Learn to edit the retrieved sequence by adding, removing, and modifying tokens in the prototype – this will still result in a more “human-like” generation • Reinforcement Learning: cast your text generation model as a Markov decision process • State s is the model’s representation of the preceding context • Actions a are the words that can be generated • Policy 𝜋 is the decoder • Rewards r are provided by an external score • Learn behaviors by rewarding the model when it exhibits them – go study CS 234 • Use REINFORCE or similar; it’s difficult because huge branching factor/search space 46 Reward Estimation • How should we define a reward function? Just use your evaluation metric! • BLEU (machine translation; Ranzato et al., ICLR 2016; Wu et al., 2016) • ROUGE (summarization; Paulus et al., ICLR 2018; Celikyilmaz et al., NAACL 2018) • CIDEr (image captioning; Rennie et al., CVPR 2017) • SPIDEr (image captioning; Liu et al., ICCV 2017) • Be careful about optimizing for the task as opposed to “gaming” the reward! • Evaluation metrics are merely proxies for generation quality! • “even though RL refinement can achieve better BLEU scores, it barely improves the human impression of the translation quality” – Wu et al., 2016 47 Reward Estimation • What behaviors can we tie to rewards? • Cross-modality consistency in image captioning (Ren et al., CVPR 2017) • Sentence simplicity (Zhang and Lapata, EMNLP 2017) • Temporal Consistency (Bosselut et al., NAACL 2018) • Utterance Politeness (Tan et al., TACL 2018) • Paraphrasing (Li et al., EMNLP 2018) • Sentiment (Gong et al., NAACL 2019) • Formality (Gong et al., NAACL 2019) • If you can formalize a behavior as a Python function (or train a neural network to approximate it!), you can train a text generation model to exhibit that behavior! 48 The dark side … • Need to pretrain a model with teacher forcing before doing RL training • Your reward function probably expects coherent language inputs … • Need to make use of an appropriate baseline: K ℒH> = − .(𝑟 𝑦!+ − 𝒃) log 𝑃(… ) +IJ • Use linear regression to predict it from the state s (Ranzato et al., 2015) • Decode a second sequence and use its reward as the baseline (Rennie et al., 2017) • Your model will learn the easiest way to exploit your reward function • Mitigate these shortcuts or hope that’s aligned with the behavior you want! 49 Training: Takeaways • Teacher forcing is still the main algorithm for training text generation models • Diversity is an issue with sequences generated from teacher forced models • New approaches focus on mitigating the effects of common words • Exposure bias causes text generation models to lose coherence easily • Models must learn to recover from their own bad samples • E.g., scheduled sampling, DAgger • Or not be allowed to generate bad text to begin with (e.g., retrieval + generation) • Training with RL can allow models to learn behaviors that are challenging to formalize • But learning can be very unstable! 50 Today: Natural Language Generation 1. What is NLG? 2. The simple neural NLG model and training algorithm that we have already seen 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations 51 Types of evaluation methods for text generation Ref: They walked to the grocery store . Gen: The woman went to the hardware store . Content Overlap Metrics 52 Model-based Metrics Human Evaluations (Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial) Content overlap metrics Ref: They walked to the grocery store . Gen: The woman went to the hardware store . • Compute a score that indicates the similarity between generated and gold-standard (human-written) text • Fast and efficient and widely used • Two broad categories: • N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.) • Semantic overlap metrics (e.g., PYRAMID, SPICE, SPIDEr, etc.) 53 N-gram overlap metrics Word overlap–based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.) • They’re not ideal for machine translation • They get progressively much worse for tasks that are more open-ended than machine translation • Worse for summarization, as longer output texts are harder to measure • Much worse for dialogue, which is more open-ended that summarization • Much, much worse story generation, which is also open-ended, but whose sequence length can make it seem you’re getting decent scores! 54 A simple failure case n-gram overlap metrics have no concept of semantic relatedness! Are you enjoying the CS224N lectures? Heck yes ! 55 Score: 0.61 Yes ! 0.25 You know it ! False negative 0 False positive 0.67 Yup . Heck no ! Semantic overlap metrics PYRAMID: • Incorporates human content selection variation in summarization evaluation. • Identifies Summarization Content Units (SCU)s to compare information content in summaries. (Nenkova, et al., 2007) 56 SPICE: Semantic propositional image caption evaluation is an image captioning metric that initially parses the reference text to derive an abstract scene graph representation. (Anderson et al., 2016). SPIDER: A combination of semantic graph similarity (SPICE) and n-gram similarity measure (CIDER), the SPICE metric yields a more complete quality evaluation metric. (Liu et al., 2017) Model-based metrics • Use learned representations of words and sentences to compute semantic similarity between generated and reference texts • No more n-gram bottleneck because text units are represented as embeddings! • Even though embeddings are pretrained, distance metrics used to measure the similarity can be fixed 57 Model-based metrics: Word distance functions Vector Similarity Embedding based similarity for semantic distance between text. • • • • Embedding Average (Liu et al., 2016) Vector Extrema (Liu et al., 2016) MEANT (Lo, 2017) YISI (Lo, 2019) Word Mover’s Distance Measures the distance between two sequences (e.g., sentences, paragraphs, etc.), using word embedding similarity matching. (Kusner et.al., 2015; Zhao et al., 2019) BERTSCORE Uses pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. (Zhang et.al. 2020) 58 Model-based metrics: Beyond word matching Sentence Movers Similarity : Based on Word Movers Distance to evaluate text in a continuous space using sentence embeddings from recurrent neural network representations. (Clark et.al., 2019) BLEURT: A regression model based on BERT returns a score that indicates to what extent the candidate text is grammatical and conveys the meaning of the reference text. (Sellam et.al. 2020) 59 Automatic metrics in general don’t really work L 60 (Liu et al, EMNLP 2016) Human evaluations • Automatic metrics fall short of matching human decisions • Human evaluation is most important form of evaluation for text generation systems • >75% generation papers at ACL 2019 included human evaluations • Gold standard in developing new automatic metrics • New automated metrics must correlate well with human evaluations! 61 Human evaluations • Ask humans to evaluate the quality of generated text • Overall or along some specific dimension: • fluency • coherence / consistency • factuality and correctness • commonsense • style / formality • grammaticality • typicality • redundancy 62 Note: Don’t compare human evaluation scores across differently conducted studies Even if they claim to evaluate the same dimensions! Human evaluation: Issues • Human judgments are regarded as the gold standard • Of course, we know that human eval is slow and expensive • … but are those the only problems? • Supposing you do have access to human evaluation: Does human evaluation solve all of your problems? • No! • Conducting human evaluation effectively is very difficult • are inconsistent • Humans: • • • • 63 can be illogical lose concentration misinterpret your question can’t always explain why they feel the way they do Learning from human feedback ADEM: HUSE: A learned metric from human judgments for dialog system evaluation in a chatbot setting. Human Unified with Statistical Evaluation (HUSE), determines the similarity of the output distribution and a human reference distribution. (Lowe et.al., 2017) (Hashimoto et.al. 2019) Evaluation: Takeaways • Content overlap metrics provide a good starting point for evaluating the quality of generated text. You will need to use one but they’re not good enough on their own. • Model-based metrics can be more correlated with human judgment, but behavior is not interpretable • Human judgments are critical • Only thing that can directly evaluate factuality – is the model saying correct things? • But humans are inconsistent! • In many cases, the best judge of output quality is YOU! • Look at your model generations. Don’t just rely on numbers! • Publicly release large samples of the output of systems that you create! 65 Today: Natural Language Generation 1. What is NLG? 2. The simple neural NLG model and training algorithm that we have already seen 3. Decoding from NLG models 4. Training NLG models 5. Evaluating NLG Systems 6. Ethical Considerations 66 Warning: Some of the content on the next few slides may be disturbing Ethics of text generation systems Tay • Chatbot released by Microsoft in 2016 • Within 24 hours, it started making toxic racist and sexist comments • What went wrong? https://en.wikipedia.org/wiki/Tay_(bot) 67 Ethics: Biases in text generation models (Warning: examples contain sensitive content) • Text generation models are often constructed from pretrained language models • Language models learn harmful patterns of bias from large language corpora • When prompted for this information, they repeat negative stereotypes 68 (Sheng et al., EMNLP 2019) Hidden Biases: Universal adversarial triggers (Warning: examples contain highly sensitive content) • The learned behaviors of text generation models are opaque • Adversarial inputs can trigger VERY toxic content • These models can be exploited in open-world contexts by illintentioned users 69 (Wallace et al., EMNLP 2019) Hidden Biases: Triggered innocuously (Warning: examples contain sensitive content) • Pretrained language models can degenerate into toxic text even from seemingly innocuous prompts • Models should not be deployed without proper safeguards to control for toxic content • Models should not be deployed without careful consideration of how users will interact with it 70 (Gehman et al., EMNLP Findings 2020) Ethics: Think about what you’re building • Large-scale pretrained language models allow us to build NLG systems for many new applications • Does the content we’re building a system to automatically generate… … really need to be generated? 71 (Zellers et al., NeurIPS 2019) Concluding Thoughts • Interacting with natural language generation systems quickly shows their limitations • Even in tasks with more progress, there are still many improvements ahead • Evaluation remains a huge challenge. • We need better ways of automatically evaluating performance of NLG systems • With the advent of large-scale language models, deep NLG research has been reset • It’s never been easier to jump in the space! • One of the most exciting and fun areas of NLP to work in! 72 Bizarre conversations with my chatbot 73