Natural Language Processing
with Deep Learning
CS224N/Ling284

Christopher Manning (based on a lecture by Antoine Bosselut)
Lecture 12: Neural Language Generation

Today: A bit more on projects and Natural Language Generation
• A few more final project thoughts and tips
1. What is NLG?
2. The simple neural NLG model and training algorithm that we have already seen
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations

2

a. Care with datasets in model development
• Many publicly available datasets are released with a train/dev/test structure
• If there is no dev set or you want a separate tune set, then you should create one by
splitting the training data
• We weigh the usefulness of it being bigger against the reduction in train-set size
• Cross-validation (q.v.) is a technique for maximizing data when you don’t have much
• You build (estimate or train) a model on a training set
• We measure progress and avoid overfitting using an independent dev or validation set
• If you do that a lot, you overfit to the dev set; it can help to have a second dev2 set
• A fixed test set ensures that all systems are assessed against the same gold data.
• This is generally good and advised – even if using CV in model development
• But it can be problematic when the test set turns out to have unusual properties that distort
progress on the task.
3

The need for independent partitions of the data set
• The train, tune, dev, and test sets need to be completely distinct
• Be alert even to small overlaps, like repeated material due to email replies, etc.

• It is invalid to give results testing on material you have trained on
• You will get falsely good performance – we almost always overfit on train

• You may need an independent tuning set
• Any hyperparameters needed for independent data won’t be set correctly, if tune is same as train

• If you keep running on the same evaluation set, you begin to overfit to it
• Effectively you are “training” on the evaluation set … you are learning things that do and don’t work
on that particular eval set and you only keep the things that ”work” … on that particular eval set

• To get a valid measure of system performance you need another untrained on,
independent test set … hence dev2 and final test sets
• We're all on the honor system to do test-set runs only when development is complete
• Use the final test set extremely few times … ideally only once
4

b. Getting your neural network to train
• Start with a positive attitude!
• Neural networks want to learn!
• If the network isn’t learning, you’re doing something to prevent it from learning successfully!

• Realize the grim reality:
• There are lots of things that can cause neural nets to not learn at all or to not learn
very well
• Finding and fixing them (“debugging and tuning”) can often take a lot more time than
implementing your model 😰

• It’s hard to work out what these things are
• But experience, experimental care, examining carefully what’s happening inside the model, and
rules of thumb all help!
5

Experimental strategy
• Work incrementally!
• Start with a very simple model and get it to work!
• It’s very hard to fix a complex but broken model
• Add bells and whistles one-by-one and get the model working with each (if you can)
• E.g. from BiDAF: At first leave out character CNN and finish prediction LSTM and get
that working. Indeed, maybe you could also leave out the modeling layer at first
• Initially run your model on a tiny amount of data
• You will see bugs much more easily on a tiny dataset … and it trains really quickly
• Something like 4–10 examples is good
• Often synthetic data is useful for this
• Make sure you can get 100% on this data (testing on train)
• Otherwise, your model is definitely either not powerful enough or it is broken
6

Experimental strategy
• Then, train and run your model on a large dataset
• It should still score close to 100% on the training data after optimization
• Otherwise, you probably want to consider a more powerful model!
• Overfitting to training data is not something to fear when doing deep learning
• These models are usually good at generalizing because of the way distributed representations
share statistical strength regardless of overfitting to training data

• But, still, you now want good generalization performance:
• Regularize your model until it doesn’t overfit on dev data
• Strategies like L2 regularization or early stopping of training can be useful
• But normally generous dropout is the secret to success
7

Details matter!

• Look at your data, collect summary statistics
• Look at your model’s outputs, do error analysis
• Find ways to examine and visualize internal representations; see if they’re
sensible
•

Attention distributions are often particularly visualizable

• Tuning hyperparameters, learning rates, getting initialization right, etc.
is often important to the successes of neural nets
8

c. Finding data for your projects
• Some people collect their own data for a project – we like that!
• You may have a project that uses “unsupervised” data
• You can annotate a small amount of data
• You can find a website that effectively provides annotations, such as likes, stars,
ratings, responses, etc.
• This let’s you learn about real word challenges of applying ML/NLP!

• But be careful on scoping things so that this doesn’t take most of your time!!!
• Some people have existing data from a research project or company
• Fine to use providing you can provide data samples for submission, report, etc.
• Most people make use of an existing, curated dataset built by previous researchers
• You get a fast start and there is obvious prior work and baselines
9

Linguistic Data Consortium
• https://catalog.ldc.upenn.edu/
• Stanford licenses this data; you can get access. Sign up/ask questions at:
https://linguistics.stanford.edu/resources/resources-corpora
• Treebanks, named entities, coreference data, lots of clean newswire text, lots of
speech with transcription, parallel MT data, etc.
• Look at their catalog
• Don’t use for nonStanford purposes!

10

Many, many more
• There are now many other datasets available online for all sorts of purposes
• Look at Kaggle
• Look at research papers to see what data they use
• Traditional lists of datasets
• https://machinelearningmastery.com/datasets-natural-language-processing/
• https://github.com/niderhoff/nlp-datasets

• Lots of particular things:
• For machine translation, look at: http://statmt.org – check out the WMT shared tasks
• For dependency parsing: Universal Dependencies data: https://universaldependencies.org
• https://gluebenchmark.com/tasks – a collection of NLU tasks
• https://nlp.stanford.edu/sentiment/ – the Stanford Sentiment Treebank
• https://research.fb.com/downloads/babi/ (Facebook bAbI-related controlled NLU/reasoning)

• Ask on Ed or talk to course staff
11

🤗 Huggingface Datasets
• https://huggingface.co/
datasets

12

Paperswithcode Datasets
• https://www.paperswithcode.com
/datasets?mod=texts&page=1

13

Today: Natural Language Generation
1. What is NLG?
2. The simple neural NLG model and training algorithm that we have already seen
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations

14

What is natural language generation?
Natural language generation is one side of natural
language processing. NLP =
Natural Language Understanding (NLU) +
Natural Language Generation (NLG)
Any task involving language production for human
consumption requires natural language generation
NLG focuses on systems that produce coherent and
useful language output for human consumption
Deep Learning is powering (some) next-gen NLG systems!
15

Uses of natural language generation
Machine Translation systems use NLG for output

Digital assistant (dialogue) systems use NLG

Summarization systems (for research articles,
email, meetings, documents) use NLG
16

More interesting NLG uses
Creative stories

Data-to-text

Visual description

Craig finished his eleven NFL seasons
with 8,189 rushing yards and 566
receptions for 4,911 receiving yards.

(Rashkin et al.., EMNLP 2020)
17

(Parikh et al.., EMNLP 2020)

(Krause et al. CVPR 2017)

Today: Natural Language Generation
1. What is NLG?
2. The simple neural NLG model and training algorithm that we have already seen
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations

18

Basics of natural language generation (review of lecture 6)
• In autoregressive text generation models, at each time step t, our model takes in a
sequence of tokens of text as input 𝑦 !" and outputs a new token, 𝑦""
• For model 𝑓( . ) and vocab 𝑉, we get scores 𝑆 = 𝑓 𝑦!" , 𝜃 ∈ ℝ#

𝑃 𝑦" 𝑦!"

19

exp(𝑆* )
=
∑*! ∈ # exp(𝑆*! )

𝑦""

𝑦"")(

𝑦"")'

𝑦"$%

𝑦"$(

𝑦""

𝑦")(

𝑦"$&

𝑦"$'

…

Trained one token at a time by maximum likelihood teacher forcing
• Trained to maximize the probability of the next token 𝑦"∗ given preceding words {𝑦 ∗ }!"
%

ℒ = − % log 𝑃 𝑦!∗ 𝑦 ∗ &! )
!#$

• This is a classification task at each time step trying to predict the actual word 𝑦!∗ in the training data
• Doing this is often called “teacher forcing” (because you reset at each time step to the ground truth)

20

𝑦(∗

𝑦'∗

𝑦&∗

𝑦%∗

𝑦-∗

𝑦(∗

𝑦'∗

𝑦&∗

…

…

∗
𝑦.$&

∗
𝑦.$'

<END>
∗
𝑦.∗
𝑦.$(

∗
𝑦.$%

∗
𝑦.$&

∗
𝑦.$'

∗
𝑦.$(

Basics of natural language generation (review of lecture 6)
• At inference time, our decoding algorithm defines a function to select a token from this
distribution:

𝑦!! = 𝑔(𝑃 𝑦! 𝑦"! ))

𝑔( . ) is your decoding algorithm

• The “obvious” decoding algorithm is to greedily choose the highest probability next
token according to the model at each time step
• While this basic algorithm sort of works, to do better, the two main avenues are to:
1. Improve the decoder
2. Improve the training
21

Of course, there’s also improving your
training data or model architecture

Today: Natural Language Generation
1. What is NLG?
2. The simple neural NLG model and training algorithm that we have already seen
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations

22

Decoding: what is it all about?
• At each time step t, our model computes a vector of scores for each token in our
vocabulary, S ∈ ℝ# :

𝑆 = 𝑓 𝑦"!

𝑓( . ) is your model

• Then, we compute a probability distribution 𝑃 over these scores (usually with a
softmax function):

𝑃 𝑦! = 𝑤 𝑦"!

exp(𝑆# )
=
∑# ! ∈ % exp(𝑆# ! )

• Our decoding algorithm defines a function to select a token from this distribution:

𝑦!! = 𝑔(𝑃 𝑦! 𝑦"! ))
23

𝑔( . ) is your decoding algorithm

Greedy methods
• Recall: Lecture 7 on Neural Machine Translation…
• Argmax Decoding
• Selects the highest probability token in 𝑃 𝑦" 𝑦!" )

𝑦!! = 𝐚𝐫𝐠𝐦𝐚𝐱 𝑃 𝑦! = 𝑤 𝑦"! )
𝒘∈𝑽

• Beam Search
• Discussed in Lecture 7 on Machine Translation
• At heart also a greedy algorithm, but with wider exploration of candidates

24

Greedy methods get repetitive
Context: In a shocking finding, scientist discovered a herd
of unicorns living in a remote, previously
unexplored valley, in the Andes Mountains. Even
more surprising to the researchers was the fact
that the unicorns spoke perfect English.

Continuation: The study, published in the Proceedings of the

25

National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM)
and the Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…

(Holtzman et. al., ICLR 2020)

Why does repetition happen?

26

(Holtzman et. al., ICLR 2020)

And it keeps going…

27

(Holtzman et. al., ICLR 2020)

How can we reduce repetition?
Simple option:
• Heuristic: Don’t repeat n-grams

More complex:
• Maximize embedding distance between consecutive sentences (Celikyilmaz et al., 2018)
• Doesn’t help with intra-sentence repetition
• Coverage loss (See et al., 2017)
• Prevents attention mechanism from attending to the same words
• Unlikelihood objective (Welleck et al., 2020)
• Penalize generation of already-seen tokens
28

Are greedy methods reasonable?
          

 










 

29





   
 
(Holtzman et. al., ICLR 2020)

Time to get random : Sampling!
• Sample a token from the distribution of tokens

𝑦!+ ∼ 𝑃 𝑦+ = 𝑤 { 𝑦 ,+ )
• It’s random so you can sample any token!

He wanted
to go to the

30

Model

restroom
grocery
store
airport
bathroom
beach
doctor
hospital
pub
gym

Decoding: Top-k sampling
• Problem: Vanilla sampling makes every token in the vocabulary an option
• Even if most of the probability mass in the distribution is over a limited set of
options, the tail of the distribution could be very long and in aggregate have
considerable mass (statistics speak: we have “heavy tailed” distributions)
• Many tokens are probably really wrong in the current context
• Why are we giving them individually a tiny chance to be selected?
• Why are we giving them as a group a high chance to be selected?
• Solution: Top-k sampling
• Only sample from the top k tokens in the probability distribution

31

(Fan et al., ACL 2018; Holtzman et al., ACL 2018)

Decoding: Top-k sampling
• Solution: Top-k sampling
• Only sample from the top k tokens in the probability distribution
• Common values are k = 5, 10, 20 (but it’s up to you!)

He wanted
to go to the

Model

• Increase k for more diverse/risky outputs
• Decrease k for more generic/safe outputs

restroom
grocery
store
airport
bathroom
beach
doctor
hospital
pub
gym

32

(Fan et al., ACL 2018; Holtzman et al., ACL 2018)

Issues with Top-k sampling

Top-k sampling can cut off too quickly!

Top-k sampling can also cut off too slowly!

33

(Holtzman et. al., ICLR 2020)

Decoding: Top-p (nucleus) sampling
• Problem: The probability distributions we sample from are dynamic
• When the distribution Pt is flatter, a limited k removes many viable options
• When the distribution Pt is peakier, a high k allows for too many options to have a
chance of being selected
• Solution: Top-p sampling
• Sample from all tokens in the top p cumulative probability mass (i.e., where mass is
concentrated)
• Varies k depending on the uniformity of Pt

34

(Holtzman et. al., ICLR 2020)

Decoding: Top-p (nucleus) sampling
• Solution: Top-p sampling
• Sample from all tokens in the top p cumulative probability mass (i.e., where mass is
concentrated)
• Varies k depending on the uniformity of Pt

𝑃!( 𝑦! = 𝑤 { 𝑦 "! )

35

𝑃!) 𝑦! = 𝑤 { 𝑦 "! )

𝑃!* 𝑦! = 𝑤 { 𝑦 "! )

(Holtzman et. al., ICLR 2020)

Scaling randomness: Softmax temperature
•

Recall: On timestep t, the model computes a prob distribution Pt by applying the softmax function to
a vector of scores 𝑠 ∈ ℝ|"|
exp(𝑆$ )
𝑃# (𝑦# = 𝑤) =
∑$%∈" exp(𝑆$% )

•

You can apply a temperature hyperparameter 𝜏 to the softmax to rebalance 𝑃# :
𝑃# 𝑦# = 𝑤 =

exp 𝑆$ /𝜏
∑$! ∈" exp 𝑆$! /𝜏

Raise the temperature 𝜏 > 1: 𝑃# becomes more uniform
• More diverse output (probability is spread around vocab)
Lower the temperature 𝜏 < 1: 𝑃# becomes more spiky
• Less diverse output (probability is concentrated on top words)

•
•

Note: softmax temperature is not a decoding algorithm!

36

It’s a technique you can apply at test time, in conjunction with a decoding algorithm
(such as beam search or sampling)

Improving decoding: re-balancing distributions
• Problem: What if I don’t trust how well my model’s distributions are calibrated?
• Don’t rely on ONLY your model’s distribution over tokens
• One Approach: Re-balance Pt using retrieval from n-gram phrase statistics!
• Cache a database of phrases from your training corpus and use to rebalance Pt
Training Contexts

Targets

Obama was senator for Illinois
Barack is married to Michelle
Obama was born in Hawaii
… …
Obama is a native of Hawaii

37

Test Context

Target

Obama’s birthplace is

?

Representations

…

Representation

Distances

Nearest k

Normalization

Aggregation

4
100
5
…
3

Hawaii 3
Illinois 4
Hawaii 5

Hawaii 0.7
Illinois 0.2
Hawaii 0.1

Hawaii 0.8
Illinois 0.2

Classification
Hawaii 0.2
Illinois 0.2
… …

Interpolation
Hawaii 0.6
Illinois 0.2
… …

(Khandelwal et. al., ICLR 2020)

Improving Decoding: Re-ranking
• Problem: What if I decode a bad sequence from my model?
• Decode a bunch of sequences
• 10 candidates is a common number, but it’s up to you
• Define a score to approximate quality of sequences and re-rank by this score
• Simplest is to use perplexity!
• Careful! Remember that repetitive methods can generally get high perplexity.

• Re-rankers can score a variety of properties:
• style (Holtzman et al., 2018), discourse (Gabriel et al., 2021), entailment/factuality (Goyal et al.,
2020), logical consistency (Lu et al., 2020), and many more …
• Beware poorly-calibrated re-rankers

• Can use multiple re-rankers in parallel
38

Decoding: Takeaways
• Decoding is still a challenging problem in NLG – there’s a lot more work to be done!
• A major realization of the last couple of years is that many of the problems that we see
in neural NLG are not really problems with our learned language model probability
distribution, but problems with the decoding algorithm
• Human language production is a subtle presentation of information and can’t be
modeled by simple properties like probability maximization
• Different decoding algorithms can allow us to inject biases that encourage different
properties of coherent natural language generation
• Some of the most impactful advances in NLG of the last few years have come from
simple but effective modifications to decoding algorithms
39

Today: Natural Language Generation
1. What is NLG?
2. The simple neural NLG model and training algorithm that we have already seen
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations

40

Are greedy decoders bad because of how they’re trained?
Context: In a shocking finding, scientist discovered a herd
of unicorns living in a remote, previously
unexplored valley, in the Andes Mountains. Even
more surprising to the researchers was the fact
that the unicorns spoke perfect English.

Continuation: The study, published in the Proceedings of the

41

National Academy of Sciences of the United States of
America (PNAS), was conducted by researchers from the
Universidad Nacional Autónoma de México (UNAM)
and the Universidad Nacional Autónoma de México
(UNAM/Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México/
Universidad Nacional Autónoma de México…

(Holtzman et. al., ICLR 2020)

Diversity Issues
• Maximum Likelihood Estimation discourages diverse text generation

42

Unlikelihood Training
• Given a set of undesired tokens 𝒞, lower their likelihood in context
+
ℒA>
= − . log(1 − 𝑃 𝑦DEF

𝑦 ∗ ,+ ))

B!"# ∈ 𝒞

• Keep teacher forcing objective and combine them for final loss function
+
ℒ=>?
= − log 𝑃 𝑦+∗ 𝑦 ∗ ,+ )

+
+
+
ℒA>?
= ℒ=>?
+ 𝛼ℒA>

• Set 𝒞 = 𝑦 ∗ !" and you’ll train the model to lower the likelihood of previously-seen
tokens!
• Limits repetition!
• Increases the diversity of the text you learn to generate!
43

(Welleck et al., 2020)

• Training with teacher forcing leads to
exposure bias at generation time
• During training, our model’s inputs
are gold context tokens from real,
human-generated texts

 







Exposure Bias








 

!#∗

!$∗

!%∗

!(∗

!!∗

!#∗

!$∗

!%∗

!($

!("

!%∗
<START>

!($

…



   
 

∗
!&'%

∗
!&'$

∗
!&'#

<END>
!&∗

∗
!&'(

∗
!&'%

∗
!&'$

∗
!&'#

!(&!(

!(&!"

!(&!$

<END>
!(&

!(&!'

!(&!(

!(&!"

!(&!$

ℒ456 = − log 𝑃 𝑦"∗ 𝑦 ∗ !" )
• At generation time, our model’s
inputs are previously–decoded tokens

…

…

ℒ789 = − log 𝑃 𝑦"" 𝑦" !" )
∗
!!"

44

∗
!!$

!("

…

Exposure Bias Solutions
• Scheduled sampling (Bengio et al., 2015)
• With some probability p, decode a token and feed that as the next input, rather than
the gold token.
• Increase p over the course of training
• Leads to improvements in practice, but can lead to strange training objectives
• Dataset Aggregation (DAgger; Ross et al., 2011)
• At various intervals during training, generate sequences from your current model
• Add these sequences to your training set as additional examples
Basically, variants of the same approach; see:
https://nlpers.blogspot.com/2016/03/a-dagger-by-any-other-name-scheduled.html
45

Exposure Bias Solutions
• Sequence re-writing (Guu*, Hashimoto*, et al., 2018)
• Learn to retrieve a sequence from an existing corpus of human-written prototypes
(e.g., dialogue responses)
• Learn to edit the retrieved sequence by adding, removing, and modifying tokens in
the prototype – this will still result in a more “human-like” generation
• Reinforcement Learning: cast your text generation model as a Markov decision process
• State s is the model’s representation of the preceding context
• Actions a are the words that can be generated
• Policy 𝜋 is the decoder
• Rewards r are provided by an external score

• Learn behaviors by rewarding the model when it exhibits them – go study CS 234
• Use REINFORCE or similar; it’s difficult because huge branching factor/search space
46

Reward Estimation
• How should we define a reward function? Just use your evaluation metric!
• BLEU (machine translation; Ranzato et al., ICLR 2016; Wu et al., 2016)
• ROUGE (summarization; Paulus et al., ICLR 2018; Celikyilmaz et al., NAACL 2018)
• CIDEr (image captioning; Rennie et al., CVPR 2017)
• SPIDEr (image captioning; Liu et al., ICCV 2017)
• Be careful about optimizing for the task as opposed to “gaming” the reward!
• Evaluation metrics are merely proxies for generation quality!
• “even though RL refinement can achieve better BLEU scores, it barely improves the
human impression of the translation quality” – Wu et al., 2016

47

Reward Estimation
• What behaviors can we tie to rewards?
• Cross-modality consistency in image captioning (Ren et al., CVPR 2017)
• Sentence simplicity (Zhang and Lapata, EMNLP 2017)
• Temporal Consistency (Bosselut et al., NAACL 2018)
• Utterance Politeness (Tan et al., TACL 2018)
• Paraphrasing (Li et al., EMNLP 2018)
• Sentiment (Gong et al., NAACL 2019)
• Formality (Gong et al., NAACL 2019)
• If you can formalize a behavior as a Python function (or train a neural network to
approximate it!), you can train a text generation model to exhibit that behavior!
48

The dark side …
• Need to pretrain a model with teacher forcing before doing RL training
• Your reward function probably expects coherent language inputs …
• Need to make use of an appropriate baseline:
K

ℒH> = − .(𝑟 𝑦!+ − 𝒃) log 𝑃(… )
+IJ

• Use linear regression to predict it from the state s (Ranzato et al., 2015)
• Decode a second sequence and use its reward as the baseline (Rennie et al., 2017)
• Your model will learn the easiest way to exploit your reward function
• Mitigate these shortcuts or hope that’s aligned with the behavior you want!
49

Training: Takeaways
• Teacher forcing is still the main algorithm for training text generation models
• Diversity is an issue with sequences generated from teacher forced models
• New approaches focus on mitigating the effects of common words
• Exposure bias causes text generation models to lose coherence easily
• Models must learn to recover from their own bad samples
• E.g., scheduled sampling, DAgger

• Or not be allowed to generate bad text to begin with (e.g., retrieval + generation)
• Training with RL can allow models to learn behaviors that are challenging to formalize
• But learning can be very unstable!
50

Today: Natural Language Generation
1. What is NLG?
2. The simple neural NLG model and training algorithm that we have already seen
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations

51

Types of evaluation methods for text generation

Ref: They walked to the grocery store .
Gen: The woman went to the hardware store .

Content Overlap Metrics

52

Model-based Metrics

Human Evaluations

(Some slides repurposed from Asli Celikyilmaz from EMNLP 2020 tutorial)

Content overlap metrics
Ref: They walked to the grocery store .
Gen: The woman went to the hardware store .
• Compute a score that indicates the similarity between generated and gold-standard
(human-written) text
• Fast and efficient and widely used
• Two broad categories:
• N-gram overlap metrics (e.g., BLEU, ROUGE, METEOR, CIDEr, etc.)
• Semantic overlap metrics (e.g., PYRAMID, SPICE, SPIDEr, etc.)
53

N-gram overlap metrics
Word overlap–based metrics (BLEU, ROUGE, METEOR, CIDEr, etc.)
• They’re not ideal for machine translation
• They get progressively much worse for tasks that are more open-ended than machine
translation
• Worse for summarization, as longer output texts are harder to measure
• Much worse for dialogue, which is more open-ended that summarization
• Much, much worse story generation, which is also open-ended, but whose
sequence length can make it seem you’re getting decent scores!

54

A simple failure case
n-gram overlap metrics have no concept of semantic relatedness!
Are you enjoying the
CS224N lectures?
Heck yes !

55

Score:
0.61

Yes !

0.25

You know it !

False negative

0

False positive

0.67

Yup .
Heck no !

Semantic overlap metrics

PYRAMID:
•

Incorporates human content selection
variation in summarization evaluation.

•

Identifies Summarization Content Units
(SCU)s to compare information content
in summaries.

(Nenkova, et al., 2007)
56

SPICE:

Semantic propositional image caption
evaluation is an image captioning metric
that initially parses the reference text to
derive an abstract scene graph
representation.
(Anderson et al., 2016).

SPIDER:

A combination of semantic graph similarity
(SPICE) and n-gram similarity measure
(CIDER), the SPICE metric yields a more
complete quality evaluation metric.
(Liu et al., 2017)

Model-based metrics
• Use learned representations of words and
sentences to compute semantic similarity
between generated and reference texts
• No more n-gram bottleneck because text
units are represented as embeddings!
• Even though embeddings are pretrained,
distance metrics used to measure the
similarity can be fixed

57

Model-based metrics: Word distance functions
Vector Similarity

Embedding based similarity for
semantic distance between text.
•
•
•
•

Embedding Average (Liu et al., 2016)
Vector Extrema (Liu et al., 2016)
MEANT (Lo, 2017)
YISI (Lo, 2019)

Word Mover’s
Distance
Measures the distance
between two sequences (e.g.,
sentences, paragraphs, etc.),
using word embedding
similarity matching.
(Kusner et.al., 2015; Zhao et al., 2019)

BERTSCORE

Uses pre-trained contextual embeddings from
BERT and matches words in candidate and
reference sentences by cosine similarity.
(Zhang et.al. 2020)
58

Model-based metrics: Beyond word matching
Sentence Movers Similarity :

Based on Word Movers Distance to evaluate text in a continuous space
using sentence embeddings from recurrent neural network
representations.
(Clark et.al., 2019)

BLEURT:

A regression model based on BERT returns a score that
indicates to what extent the candidate text is grammatical
and conveys the meaning of the reference text.
(Sellam et.al. 2020)
59

Automatic metrics in general don’t really work L

60

(Liu et al, EMNLP 2016)

Human evaluations

• Automatic metrics fall short of matching human decisions
• Human evaluation is most important form of evaluation for text generation systems
• >75% generation papers at ACL 2019 included human evaluations
• Gold standard in developing new automatic metrics
• New automated metrics must correlate well with human evaluations!
61

Human evaluations
• Ask humans to evaluate the quality of generated text
• Overall or along some specific dimension:
• fluency
• coherence / consistency
• factuality and correctness
• commonsense
• style / formality
• grammaticality
• typicality
• redundancy
62

Note: Don’t compare human
evaluation scores across
differently conducted studies
Even if they claim to evaluate
the same dimensions!

Human evaluation: Issues
• Human judgments are regarded as the gold standard
• Of course, we know that human eval is slow and expensive
• … but are those the only problems?
• Supposing you do have access to human evaluation:
Does human evaluation solve all of your problems?
• No!
• Conducting human evaluation effectively is very difficult
• are inconsistent
• Humans:
•
•
•
•

63

can be illogical
lose concentration
misinterpret your question
can’t always explain why they feel the way they do

Learning from human feedback

ADEM:

HUSE:

A learned metric from human judgments for dialog
system evaluation in a chatbot setting.

Human Unified with Statistical Evaluation (HUSE),
determines the similarity of the output distribution
and a human reference distribution.

(Lowe et.al., 2017)

(Hashimoto et.al. 2019)

Evaluation: Takeaways
• Content overlap metrics provide a good starting point for evaluating the quality of
generated text. You will need to use one but they’re not good enough on their own.
• Model-based metrics can be more correlated with human judgment, but behavior is
not interpretable
• Human judgments are critical
• Only thing that can directly evaluate factuality – is the model saying correct things?
• But humans are inconsistent!
• In many cases, the best judge of output quality is YOU!
• Look at your model generations. Don’t just rely on numbers!
• Publicly release large samples of the output of systems that you create!
65

Today: Natural Language Generation
1. What is NLG?
2. The simple neural NLG model and training algorithm that we have already seen
3. Decoding from NLG models
4. Training NLG models
5. Evaluating NLG Systems
6. Ethical Considerations

66

Warning:
Some of the content on the
next few slides may be
disturbing

Ethics of text generation systems

Tay
• Chatbot released by Microsoft in 2016
• Within 24 hours, it started making toxic
racist and sexist comments
• What went wrong?
https://en.wikipedia.org/wiki/Tay_(bot)

67

Ethics: Biases in text generation models

(Warning: examples contain sensitive content)

• Text generation models are often
constructed from pretrained language
models
• Language models learn harmful patterns
of bias from large language corpora
• When prompted for this information,
they repeat negative stereotypes

68

(Sheng et al., EMNLP 2019)

Hidden Biases: Universal adversarial triggers

(Warning: examples contain highly sensitive content)

• The learned behaviors of text
generation models are opaque
• Adversarial inputs can trigger VERY
toxic content
• These models can be exploited in
open-world contexts by illintentioned users

69

(Wallace et al., EMNLP 2019)

Hidden Biases: Triggered innocuously

(Warning: examples contain sensitive content)

• Pretrained language models can
degenerate into toxic text even from
seemingly innocuous prompts
• Models should not be deployed without
proper safeguards to control for toxic
content
• Models should not be deployed without
careful consideration of how users will
interact with it
70

(Gehman et al., EMNLP Findings 2020)

Ethics: Think about what you’re building

• Large-scale pretrained language
models allow us to build NLG
systems for many new
applications
• Does the content we’re building
a system to automatically
generate…
… really need to be generated?
71

(Zellers et al., NeurIPS 2019)

Concluding Thoughts
• Interacting with natural language generation systems quickly shows their limitations
• Even in tasks with more progress, there are still many improvements ahead
• Evaluation remains a huge challenge.
• We need better ways of automatically evaluating performance of NLG systems
• With the advent of large-scale language models, deep NLG research has been reset
• It’s never been easier to jump in the space!
• One of the most exciting and fun areas of NLP to work in!
72

Bizarre conversations with my chatbot

73