Looking Outside the Context Window: In-Context
Learning with Up to Hundreds of Examples
Stanford CS224N Custom Project

Linden Li
Department of Computer Science
Stanford University
lindenli@stanford.edu

Varun Shenoy
Department of Electrical Engineering
Stanford University
vnshenoy@stanford.edu

Abstract
Many approaches have tried to transfer the impressive capabilities of large language
models to novel downstream tasks. Conventional adaptation methods typically involve re-training, where pretrained weights are used as the initialization to finetune
a model on task-specific data. These approaches suffer from two drawbacks: the
need for compute-intensive optimization and inefficient storage of a unique set of
model weights per task. One promising alternative is in-context learning, where
a model learns how to perform a unique task given a couple of examples in the
prompt. Transformer models, however, rely on the mechanism of attention; the
finite context window has prevented the study of k-shot performance for large k.
The recent release of the H3 model presents an architecture for language modeling
that allows for arbitrary context lengths while achieving competitive evaluations
with transformers. We present the first study of large-scale in-context learning
with up to 250 examples in a single prompt. We find that adding examples to the
prompt boosts performance up to a critical point after which we observe steeply
declining performance. On some tasks, adding many in-context examples in the
prompt exhibits competitive performance with finetuned counterparts, without the
need for expensive re-training.

1

Key Information
• Mentor: Hong Liu
• External Collaborators (if you have any): Dan Fu, who gave us helpful pointers on the H3
model.
• Sharing project: Yes

2

Introduction

Language models have achieved impressive results at scale, exhibiting state-of-the-art results on a
variety of natural language benchmarks (Brown et al., 2020; Hoffmann et al., 2022). As a result,
a significant amount of work has been done in adapting language models to transfer their strong
performance to downstream tasks. Since language models are trained to be task-agnostic, the typical
approach is to re-train these models with task-specific data.
Many adaptation approaches have been proposed in the literature. One approach, probing, involves
freezing weights of an existing pretrained language model and using it as a feature extractor. The
last layer features are used to retrain a linear layer, which outputs a task-specific result (Devlin et al.,
2019; Liu et al., 2021b) Another more popular approach resulting in super performance is finetuning.
Instead of retraining a single layer, a pretrained language model’s weights are used as the starting
point for optimization over a task-specific training set (Devlin et al., 2019; Wei et al., 2021; Sanh
Stanford CS224N Natural Language Processing with Deep Learning

et al., 2021). Other work has tried to make this process more parameter-efficient by only training new
“adapter" layers in between frozen pretrained weights on in-domain data (Houlsby et al., 2019).
While these methods have led to strong performance on downstream benchmarks compared to training
from scratch, they suffer from two drawbacks. First, these methods involve expensive re-training
processes that require large amounts of computation and time. Second, optimizing a new model for
each downstream task requires surfacing a new set of domain-specific model weights, requiring that a
practitioner store hundreds of gigabytes of model parameters per task.
A promising alternative to these adaptation strategies is in-context learning, an impressive capability
exhibited by large language models at scale (Wei et al., 2022a; Brown et al., 2020). A natural language
description of a task along with some accompanying examples are included in a prompt and the
model is tasked with providing a prediction on an unseen example. This technique has demonstrated
impressive few-shot results, surpassing zero-shot baselines where no examples are included in the
prompt but still lagging in performance behind finetuning approaches. While a natural extension of
this is to include additional examples in-context, transformers suffer from an architectural limitation
that does not make this possible. Transformers rely on self-attention, an O(N 2 ) operation in both
memory and runtime relative to the input sequence length N . Due to these limitations, transformers
are trained with a fixed context window typically around 2048 tokens (Brown et al., 2020); a very
long input prompt will throw a runtime error, since the transformer uses positional embeddings that
only work for the maximum sequence length. With this limitation, most work in the past has only
been able to fit at most 5 examples in-context (Liang et al., 2022).
Recently, state-space models (SSMs) have shown impressive results on long-range tasks (Gu et al.,
2021; Goel et al., 2022). Unlike transformers, they do not have a fixed context window due to their
reliance on recurrences and scale logarithmically with sequence length. Dao et al. (2022) applies
SSMs to language modeling, introducing the H3 layer designed to allow SSMs to perform well at
recall tasks. H3 achieves competitive evaluation metrics with transformer models.
We utilize the long-context properties of H3 to present the first investigation of using in-context
learning examples beyond the few-shot regime as a adaptation strategy, performing up to 250-shot
in-context evaluations. We observe that increasing examples improves performance on many tasks up
to a critical point, after which performance begins to steeply decline. On certain tasks, in-context
performance is competitive with models finetuned on the entire training dataset.

3

Related Work

Language models and in-context learning. Since the introduction of the transformer architecture
in Vaswani et al. (2017), autoregressive decoder-only variants have shown impressive and intriguing
properties at scale Brown et al. (2020); Hoffmann et al. (2022); Rae et al. (2021). One of the properties
is that of in-context learning introduced by Brown et al. (2020), described by Wei et al. (2022a)
as an emergent property at scale where a language model can learn how to perform a task when
given a small number of input-output pairs in the prompt. In-context learning demonstrates strong
few-shot performance exceeding zero-shot baselines on a variety of natural language understanding
benchmarks.
Compared to the finetuning paradigm popularized by Devlin et al. (2019), in-context learning presents
an alternative without the need for expensive transfer learning. Significant amounts of follow-up
work have proposed strategies for better prompting such as including explanations or trying to induce
chain-of-thought (Lampinen et al., 2022; Wei et al., 2022b; Arora et al., 2022). Zhao et al. (2021) and
Liu et al. (2021a) show heavy sensitivity of downstream performance to the prompt, showing that
factors such as the order and choice of examples can have large impacts. Rubin et al. (2021) outlines
two methods of in-context learning: one is based on textual generation, where a “gold" answer is used
as the prediction if it is found within the completion and next-logit prediction, where the prediction is
based on which label is most likely based on the distribution over the vocabulary. Recently, Ouyang
et al. (2022) instruction-tuned models with human feedback to make prompting more faithful to
natural language requests; we note that Dao et al. (2022) did not perform this procedure.
State-space models. State-space models (SSMs) have shown impressive results on long-sequence
tasks including time series (Goel et al., 2022) and audio generation (Gu et al., 2021). The reason why
state space models are a better candidate for long sequence modeling for transformers is because
2

they scale O(N log N ) with sequence length N unlike transformers, which scales O(N 2 ) because of
self-attention. Mehta et al. (2022) propose gated state spaces to apply SSMs language modeling, and
Dao et al. (2022) achieve competitive PPL by introducing the H3 layer.

4

Approach

We utilize H3 from Dao et al. (2022) as the backbone and evaluate its performance across different
tasks included in the SuperGLUE benchmark.
4.1

Prompting

Let D = {(xi , yi )}ni=1 be a training dataset for a given task. For a k-shot training evaluation, we
would like to generate a prompt p consisting of k randomly-sampled training examples. Since
the choice of prompt heavily influences downstream performance, we ensure that that the chosen
in-context examples are roughly class-balanced. For a given class c with nc examples, we sample
⌈nc /k⌉ examples. For a chosen set of examples (x1 , y1 ), . . . , (xk , yk ) we randomly shuffle the
k-examples to avoid the model exploiting spurious patterns in the prompt to make predictions. We
show the prompts used for each task in the Appendix.
4.2

Parsing predictions

Let p1 , . . . , pT be a sequence of tokens containing training examples for a SuperGLUE task. Given
an unseen validation example x, we would like to predict its corresponding label yb. We investigate
two separate methods of retrieving predictions.
4.2.1

Generation

For datasets that involve choosing a binary answer (either True/False or Yes/No), we use open-ended
generation and parse the model’s completion for a “gold output." For example, for the BoolQ dataset
where the true label y ∈ {True, False} for all examples, we return “True" if “True" is a substring of
the model’s generation and “False" if “False" is a substring. If neither is found the the model’s output,
we automatically report an incorrect prediction. We display an example of this pipeline in Figure 1.

Figure 1: Example generation for the CB dataset. Here, the model’s prediction is entailment since
the gold answer "Yes" was a substring of the completion "Answer: Yes".
4.2.2

Next logit prediction

For datasets that involve choosing between two candidate strings x(0) and x(1) , we utilize next
logit prediction, the implementation of which was inspired by Orr (2022). Let fˆ(x) be a language
(j)
model that, given a input x, returns the logits for each token in x. To compute P (x1:L | p1:T ), we
(j)
first compute the logits fˆ(p1:T ∥ x1:L ) ∈ R(T +L)×|V| , where V denotes the vocabulary and ∥ is
3

(j)

(j)

the concatenation operation. For each token x1 , . . . , xL in the candidate string, we compute the
log-likelihood of the token as
n
o

exp fˆ(p)(T +i−1)k
(j)
n
o
L(xi | p) = log  P
|V|
ˆ(p)ij
exp
f
j=1
where k is the token index in V corresponding to the i-th token. We then get the likelihood of the
whole sequence by computing
L(x(j) | p) =

L
X

(j)

L(xi

| p)

i=1

to get the likelihood of the whole sequence. If L(x(0) | p) > L(x(1) | p), we return choice 0;
otherwise, we return choice 1.

5

Experiments

5.1

Data

We evaluate on the SuperGLUE dataset from Wang et al. (2019), a standard set of NLP tasts ranging
from question-answering to natural language inference. We take inspiration from the prompts from
Gao et al. (2021); Arora et al. (2022); Brown et al. (2020) when choosing how to present in-context
examples for a given task. We evaluate on a subset of these tasks: BoolQ, CB, COPA, ReCoRD, and
RTE. The corresponding evaluation metrics, dataset sizes and task descriptions for these tasks can be
found in Table 1.
Task
Train examples Val examples Eval metric
Task description
BoolQ
9427
3270
accuracy
question answering
CB
250
57
accuracy
natural language inference
COPA
400
100
accuracy
question answering
ReCoRD
101k
10k
F1
question answering
RTE
2500
278
accuracy
natural language inference
Table 1: Dataset sizes, evaluation metric, and short task descriptions for each task we evaluated.

5.2

Experimental details

We evaluate four separate model parameter sizes released by Dao et al. (2022), H3-125M, H3-355M,
H3-1.3B, and H3-2.7B (with 2 attention layers). For BoolQ, CB, RTE, and ReCoRD we look for
the gold answer in the text completion. We use generation parameters top_p = 1 and top_k = 1 to
minimize stochasticity in predictions. For COPA, we use next logit prediction.
Since performance is highly sensitive to the choice of in-context examples, we report results aggregated over n trials (where n = 10 for all datasets except BoolQ and ReCoRD where n = 3, since
these have large validation datasets). We report the best accuracy, average accuracy, and standard
deviation across n trials.
We choose to evaluate for k ∈ {1, 5} ∪ {10, 20, 30, . . . }, and only stop when 1) the memory occupied
by model weights and the text prompt exceeds the GPU memory or 2) there are insufficient samples
per class to fit into the prompt. We display the maximum number of in-context examples for different
tasks in Table 2 and report the average token length for all prompts in the validation set. All
experiments were ran on either a 32 GB NVIDIA V100 or 24 GB NVIDIA A10G GPU.
We compare to two baselines: OPT and GPT-Neo, two open-source transformer models, at similar
parameter sizes. The performance benchmark that we compare to is a BERT model finetuned on the
entire dataset from Wang et al. (2019).
4

Task
Max. Examples Avg. Token Length
BoolQ
50
7526
CB
50
4863
COPA
250
7280
RTE
50
4447
ReCoRD
20
5050
Table 2: Maximum number of in-context examples used for different tasks in SuperGLUE. We report
the average token length for these examples. Note GPT-3 has a context length of 2048 tokens, so
none of these would have fit within the maximum sequence length of a transformer.

5.3

Results

We display our results in Figures 2 and 3. For each graph, we display the baseline results, finetuned
BERT comparison, and report the best and average accuracies with standard deviations.

6

Analysis

6.1

Use of in-context examples for large k

We find that adding additional examples helped for most tasks. While we expected to find monotonically increasing performance when adding examples, we instead observe a critical k at which
performance begins to decrease. This critical point is followed by a period of high variance in
performance across different prompts followed by a steep decline which we analyze below. In most
cases except ReCoRD, adding additional examples led to super performance over the transformer
baselines.
Most tasks achieved performance that approached the finetuned BERT model, but still lagged behind
by a non-negligible number of percentage points. Our most competitive result was on the CB dataset,
where H3-1.3B in the 20-shot setting nearly matched its performance.
For the vast majority of tasks, we did find slight improvements when scaling beyond 10-shot prompts,
as was the case for CB, BoolQ, RTE, and COPA. We was surprised to discover that H3-1.3B achieves
competitive performance with a finetuned BERT on the CB task. However, beyond a certain number
of examples, the quality of the model degraded to a significant degree.
6.2

Performance drop off for generation methods

We observe that generation methods are significantly more brittle than logit scoring. Logit scoring on
the COPA dataset exhibits far more consistent performance across different k as seen in Figure 3. In
this setting, however, adding examples seems to have minimal effects on performance; more work
needs to be done to see if this generalizes to other datasets.
For tasks other than those where we were able to use logit evaluation, we find severe issues in
hallucination and generation quality beyond a certain number of tokens. In certain datasets including
BoolQ, CB, ReCoRD, we observe that accuracy goes near zero. When analyzing the completions
generated in these high-shot settings, we find that the model no longer answers the question (e.g.
yes/no or true/false) but instead hallucinates tokens from which no answer can be parsed properly. In
Figure 4, we share some errors that are the product of hallucination.
We identify this phenomenon to be closely related to the number of tokens on a given prompt as seen
in Table 2. Once we scale past 4000 tokens, we observe that our results begin to decline. There seems
to be a point at which the model loses its capability to process and retain information. The optimal
number of examples to include in a prompt depends on the complexity of the task and the length of
individual examples, and it is important to find a balance between providing enough examples to help
the model learn the task and not overloading the model’s capacity. We hypothesize that this could be
because H3 is trained with a sequence length of 2048, causing the model to perform poorly when the
context is too large.
5

Figure 2: Results for four different model sizes on CB, BoolQ, RTE, and ReCoRD. These datasets
all used generation to make predictions. In most cases, adding additional examples helps, but
performance typically saturates at different k. On CB, H3-1.3B achieves competitive performance
with a finetuned BERT model with a 20-shot prompt, using no additional training.

6

Figure 3: Results for four different model sizes on COPA. Since COPA is a multiple choice dataset,
we utilize logit scoring to parse predictions. Unlike the generation methods, performance remains
relatively consistent (adding examples seems to neither help nor have a sharp performance drop).

Figure 4: Selection of errors for 50-shot CB and 40-shot BoolQ. The performance drop can be
explained by the model’s tendency to hallucinate text from which no predictions can be parsed.

7

7

Conclusion and Future Work

In this paper, we explore the use of a novel language model architecture with infinite-context, H3,
to determine if additional in-context examples improve task performance. We find that for some
tasks, in-context learning beyond the 3-shot regime is a suitable adaptation strategy leading to
competitive performance with finetuned counterparts. Despite high performance, we report a dropoff
in performance when the number of tokens in the input prompt gets too large. We hypothesize that
this is because H3 was trained with a finite context length, suggesting the need for models trained
with larger contexts for this to work effectively.
For future work, our first step would be to expand our test suite to the rest of the SuperGLUE dataset.
We picked five of the ten tasks available in SuperGLUE due to time constraints. However, it is
important to validate if our trends would extrapolate to the rest of the benchmark.
Another key limitation is that the weights for H3 that we used were trained on maximum sequence
lengths of 2048. If we obtained weights from a model trained on longer sequence lengths, we could
experience further improvements from 40 or 50 shot prompts. In the limit, it might be possible
to place an entire training dataset for a language task in the prompt itself. Finally, there are other
infinite-context language models that we could have explored in a similar manner, such as RWKV
(BlinkDL, 2022). Future work could apply the techniques presented in this paper on those models as
well.

References
Simran Arora, Avanika Narayan, Mayee F Chen, Laurel J Orr, Neel Guha, Kush Bhatia, Ines Chami,
Frederic Sala, and Christopher Ré. 2022. Ask me anything: A simple strategy for prompting
language models. arXiv preprint arXiv:2210.02441.
BlinkDL. 2022. Rwkv: Rnn with transformer-level llm performance.
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal,
Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models
are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
Tri Dao, Daniel Y Fu, Khaled K Saab, Armin W Thomas, Atri Rudra, and Christopher Ré. 2022.
Hungry hungry hippos: Towards language modeling with state space models. arXiv preprint
arXiv:2212.14052.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of
deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference
of the North American Chapter of the Association for Computational Linguistics: Human Language
Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota.
Association for Computational Linguistics.
Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence
Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, Jason Phang, Laria Reynolds, Eric
Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2021. A framework for few-shot
language model evaluation.
Karan Goel, Albert Gu, Chris Donahue, and Christopher Ré. 2022. It’s raw! audio generation with
state-space models. In International Conference on Machine Learning, pages 7616–7633. PMLR.
Albert Gu, Karan Goel, and Christopher Ré. 2021. Efficiently modeling long sequences with
structured state spaces. arXiv preprint arXiv:2111.00396.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza
Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. 2022.
Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe,
Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning
for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR.
8

Andrew K Lampinen, Ishita Dasgupta, Stephanie CY Chan, Kory Matthewson, Michael Henry
Tessler, Antonia Creswell, James L McClelland, Jane X Wang, and Felix Hill. 2022. Can language
models learn from explanations in context? arXiv preprint arXiv:2204.02329.
Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga,
Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. 2022. Holistic evaluation of
language models. arXiv preprint arXiv:2211.09110.
Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021a.
What makes good in-context examples for gpt-3? arXiv preprint arXiv:2101.06804.
Xiao Liu, Yanan Zheng, Zhengxiao Du, Ming Ding, Yujie Qian, Zhilin Yang, and Jie Tang. 2021b.
Gpt understands, too. arXiv preprint arXiv:2103.10385.
Harsh Mehta, Ankit Gupta, Ashok Cutkosky, and Behnam Neyshabur. 2022. Long range language
modeling via gated state spaces. arXiv preprint arXiv:2206.13947.
Laurel Orr. 2022. Manifest. https://github.com/HazyResearch/manifest.
Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L Wainwright, Pamela Mishkin, Chong
Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. 2022. Training language models to
follow instructions with human feedback. arXiv preprint arXiv:2203.02155.
Jack W. Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John
Aslanides, Sarah Henderson, Roman Ring, Susannah Young, Eliza Rutherford, Tom Hennigan,
Jacob Menick, Albin Cassirer, Richard Powell, George van den Driessche, Lisa Anne Hendricks,
Maribeth Rauh, Po-Sen Huang, Amelia Glaese, Johannes Welbl, Sumanth Dathathri, Saffron
Huang, Jonathan Uesato, John Mellor, Irina Higgins, Antonia Creswell, Nat McAleese, Amy Wu,
Erich Elsen, Siddhant Jayakumar, Elena Buchatskaya, David Budden, Esme Sutherland, Karen
Simonyan, Michela Paganini, Laurent Sifre, Lena Martens, Xiang Lorraine Li, Adhiguna Kuncoro,
Aida Nematzadeh, Elena Gribovskaya, Domenic Donato, Angeliki Lazaridou, Arthur Mensch,
Jean-Baptiste Lespiau, Maria Tsimpoukelli, Nikolai Grigorev, Doug Fritz, Thibault Sottiaux,
Mantas Pajarskas, Toby Pohlen, Zhitao Gong, Daniel Toyama, Cyprien de Masson d’Autume,
Yujia Li, Tayfun Terzi, Vladimir Mikulik, Igor Babuschkin, Aidan Clark, Diego de Las Casas,
Aurelia Guy, Chris Jones, James Bradbury, Matthew Johnson, Blake Hechtman, Laura Weidinger,
Iason Gabriel, William Isaac, Ed Lockhart, Simon Osindero, Laura Rimell, Chris Dyer, Oriol
Vinyals, Kareem Ayoub, Jeff Stanway, Lorrayne Bennett, Demis Hassabis, Koray Kavukcuoglu,
and Geoffrey Irving. 2021. Scaling language models: Methods, analysis amp; insights from
training gopher.
Ohad Rubin, Jonathan Herzig, and Jonathan Berant. 2021. Learning to retrieve prompts for in-context
learning. arXiv preprint arXiv:2112.08633.
Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine
Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. 2021. Multitask prompted training
enables zero-shot task generalization. arXiv preprint arXiv:2110.08207.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information
processing systems, 30.
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer
Levy, and Samuel Bowman. 2019. Superglue: A stickier benchmark for general-purpose language
understanding systems. Advances in neural information processing systems, 32.
Jason Wei, Maarten Bosma, Vincent Y Zhao, Kelvin Guu, Adams Wei Yu, Brian Lester, Nan Du,
Andrew M Dai, and Quoc V Le. 2021. Finetuned language models are zero-shot learners. arXiv
preprint arXiv:2109.01652.
Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, Dani Yogatama,
Maarten Bosma, Denny Zhou, Donald Metzler, Ed H. Chi, Tatsunori Hashimoto, Oriol Vinyals,
Percy Liang, Jeff Dean, and William Fedus. 2022a. Emergent abilities of large language models.
9

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed H. Chi, Quoc Le, and Denny
Zhou. 2022b. Chain of thought prompting elicits reasoning in large language models. CoRR,
abs/2201.11903.
Zihao Zhao, Eric Wallace, Shi Feng, Dan Klein, and Sameer Singh. 2021. Calibrate before use:
Improving few-shot performance of language models. In International Conference on Machine
Learning, pages 12697–12706. PMLR.

A

Appendix

A.1

Prompts

Context : City of Manchester Stadium -- The stadium was built by
Laing Construction at a cost of 112
million and was
designed and engineered by ArupSport , whose design
incorporated a cable - stayed roof structure which is
separated from the main stadium bowl and suspended entirely
by twelve exterior masts and attached cables . The stadium
design has received much praise and many accolades ,
including an award from the Royal Institute of British
Architects in 2004 for its innovative inclusive building
design and a special award in 2003 from the Institution of
Structural Engineers for its unique structural design .
Question : does the etihad stadium manchester have a roof
Answer : Yes
---Context : Prison escape -- In Mexico , Belgium , Germany and
Austria , the philosophy of the law holds that it is human
nature to want to escape . In those countries , escapees who
do not break any other laws are not charged for anything
and no extra time is added to their sentence . However , in
Mexico , officers are allowed to shoot prisoners attempting
to escape , and an escape is illegal if violence is used
against prison personnel or property , or if prison inmates
or officials aid the escape .
Question : legal to break out of prison in germany
Answer : Yes
---Context : Shutter speed -- In photography , shutter speed or
exposure time is the length of time when the film or
digital sensor inside the camera is exposed to light , also
when a camera ’ s shutter is open when taking a photograph .
The amount of light that reaches the film or image sensor
is proportional to the exposure time .
of a second
will let half as much light in as
.
Question : are exposure and shutter speed the same thing
Answer : Yes
---10

Context : The Good Place -- The series focuses on Eleanor
Shellstrop ( Kristen Bell ) , a woman who wakes up in the
afterlife and is introduced by Michael ( Ted Danson ) to ‘‘
The Good Place ’ ’ , a Heaven - like utopia he designed , in
reward for her righteous life . She realizes that she was
sent there by mistake and must hide her morally imperfect
behavior and try to become a better person . William Jackson
Harper , Jameela Jamil and Manny Jacinto co - star as other
residents of ‘‘ The Good Place ’ ’ , together with D ’ Arcy
Carden as Janet , an artificial being helping the
inhabitants .
Question : is there a good place in the good place
Answer : Yes
---Context : George Washington Bridge -- Eastbound vehicles must
pay a toll to cross the bridge ; as with all Hudson River
crossings along the North River , westbound vehicles cross
for free . As of December 6 , 2015 , the cash tolls going from
New Jersey to New York are $15 for both cars and
motorcycles . E - ZPass users are charged $10 .50 for cars and
$9 .50 for motorcycles during off - peak hours , and $12 .50 for
cars and $11 .50 for motorcycles during peak hours . Trucks
are charged cash tolls of $20 .00 per axle , with discounted
peak , off - peak , and overnight E - ZPass tolls . A discounted
carpool toll ( $6 .50) is available at all times for cars
with three or more passengers using NY or NJ E - ZPass , who
proceed through a staffed toll lane ( provided they have
registered with the free ‘‘ Carpool Plan ’ ’) . There is an off
- peak toll of $7 .00 for qualified low - emission passenger
vehicles , which have received a Green E - ZPass based on
registering for the Port Authority Green Pass Discount Plan
.
Question : is there a toll both ways on the george washington
bridge
Answer : No
---Context : Ethanol fuel -- All biomass goes through at least some
of these steps : it needs to be grown , collected , dried ,
fermented , distilled , and burned . All of these steps
require resources and an infrastructure . The total amount
of energy input into the process compared to the energy
released by burning the resulting ethanol fuel is known as
the energy balance ( or ‘‘ energy returned on energy invested
’ ’) . Figures compiled in a 2007 report by National
Geographic Magazine point to modest results for corn
ethanol produced in the US : one unit of fossil - fuel energy
is required to create 1.3 energy units from the resulting
ethanol . The energy balance for sugarcane ethanol produced
in Brazil is more favorable , with one unit of fossil - fuel
energy required to create 8 from the ethanol . Energy

11

balance estimates are not easily produced , thus numerous
such reports have been generated that are contradictory .
For instance , a separate survey reports that production of
ethanol from sugarcane , which requires a tropical climate
to grow productively , returns from 8 to 9 units of energy
for each unit expended , as compared to corn , which only
returns about 1.34 units of fuel energy for each unit of
energy expended . A 2006 University of California Berkeley
study , after analyzing six separate studies , concluded that
producing ethanol from corn uses much less petroleum than
producing gasoline .
Question : does ethanol take more energy make that produces
BoolQ Prompt
Context : A : Sometimes you hear things on the radio that , you
know , could be true or couldn ’ t be . B : Uh - huh . A : Uh , do
you feel like this is , I guess they ’ re spending a billion
or so a year on this AIDS research . B : Uh - huh . A : Do you
think they should spend more ?
Question : they should spend more
Answer : Neither
Context : At the heart of the universe there is cruelty . We are
predators and are preyed upon , every living thing . Did you
know that wasps lay their eggs in ladybirds piercing the
weak spot in their armour ?
Question : wasps lay their eggs in ladybirds
Answer : Yes
Context : B : And the tanks came in and , you know , pretty much
took care of that . A : Exactly . B : And , A : Yeah , uh , that ,
personally I don ’ t see as Gorbachev as being maybe a threat
, and I think he ’ s actually , honestly trying to do some
change . B : Uh - huh . A : But I don ’ t believe that he , in this
first pass around , you know , being the first one to really
turn things around or attempt to is going to be allowed to
get away with it either .
Question : Gorbachev is going to be allowed to get away with
doing some change
Answer : No
Context : A : How did Radio Shack work ? B : If you go in and buy
anything they want your phone number . And I don ’ t think
they ’ re going to call me and ask me how it ’ s functioning ,
Question : they ’ re going to call him
Answer : No
Context : B : No , it was , I didn ’ t like the way it ended . A : I
know , well the only reason I know why it ended is on
Arsenio Hall one night , Christopher Reeves told , that , you
know , B : Uh - huh . A : I can ’ t believe they killed them .
Question : they killed them
Answer : Yes
Context : Valence the void - brain , Valence the virtuous valet .
Why couldn ’ t the figger choose his own portion of titanic
anatomy to shaft ? Did he think he was helping ?
12

Question : Valence was helping
CB Prompt
premise : Jill Pilgrim , general counsel of USA Track and Field ,
brought up the issue during a panel on women ’ s sports at
the sports lawyers conference . Pilgrim said the law
regarding who is legally considered a woman is changing as
sex - change operations become more common .
hypothesis : Sex - change operations become more common .
label : yes
premise : Les Paul , who continues to perform weekly at New York
Iridium Jazz Club , has finished recording " Les Paul &
Friends ."
hypothesis : Iridium Jazz Club is located in New York .
label : yes
premise : A strong supporter of the " Italian road to socialism " ,
he was close to Enrico Berlinguer , and gained a position
in the party secretariat . In 1969 , he drew up the report
proposing the expulsion from the party of the Manifesto
group . In 1984 , after Berlinguer ’ s death , Natta was elected
as party secretary .
hypothesis : Natta supported Italian Socialism .
label : yes
premise : Bogota , 4 May 88 - The dissemination of a document
questioning Colombia ’ s oil policy , is reportedly the aim of
the publicity stunt carried out by the pro - Castro Army Of
National Liberation , which kidnapped several honorary
consuls , newsmen , and political leaders .
hypothesis : Several honorary consuls were kidnapped on 4 May
88.
label : no
premise : PM tried to buy the Belin biscuit company from RJR
Nabisco two years ago .
hypothesis : American tobacco companies began to diversify
production .
label : no
premise : For lunch I went to Cipriani . The good thing about
Cipriani is that it ’ s all Italian . Every single person is
Italian . Even the American sommelier is Italian . Everybody
speaks Italian . It ’ s a good feeling . I consider Cipriani
one of the most refined services that I ’ ve ever had in a
restaurant . For lunch I had spaghetti a la chitarra with
Amatriciana sauce . I had beef tartar . I had fried seafood ,
mixed . I had also the fresh pasta with the duckling r a g .
It was outstanding . Then I got a plate of Parmesan with
green olives and I got the whole roasted branzino . It was
me and another person . We had several glasses of wine . We
didn ’ t get dessert ; we had a glass too much of wine , so we
were very full . We stayed there like an hour just finishing
the wine because my friend ordered a bottle .
hypothesis : Amatriciana is a sauce .
label : yes

13

premise : South African President Thabo Mbeki , the main mediator
in Cote d ’ Ivoire ’ s peace process , said , on Sunday , that
Pretoria is heightening its intervention in the West
African nation in order to pave the way for elections later
this year .
hypothesis : Thabo Mbeki is a citizen of Cote d ’ Ivoire .
label : no
premise : The chapters voluntarily transferred their right of
electing the bishop to Emperor Charles V , and Pope Clement
VII gave his consent to these proceedings .
hypothesis : Emperor Charles V was elected by Clement VII .
label : no
premise : Harrington , of Fitchburg , Massachusetts , was taken to
an area hospital and is listed in critical condition . No
other vehicles were struck during the crash . Authorities
said others at the scene also assisted , including a
turnpike employee and two motorists who carried Harrington
out of the truck as police arrived . Fitzgerald said he had
never used the defibrillator before Tuesday . " As a trooper ,
you see more negative than positive out there ," Fitzgerald
said . " It feels good when you can help someone and it
feels good knowing that all those people had stopped to
help before I got there ."
hypothesis : Harrington is a resident of Massachusetts .
label : yes
premise : Everest Grand Circle Expedition , Nepal and Tibet .
First circumambulation of Everest ; trekking , skiing , and
mountaineering . First American winter ascent of Pumori (
elev . 23 ,422 ’) . Immortalized in the book Everest Grand
Circle . Ned Gillette , Jan Reynolds , Jim Bridwell , Steve
McKinney , Craig Calonica and Rick Barker .
hypothesis : A woman succeeds in climbing Everest solo .
label : no

RTE prompt. We replaced "entailment" and "not_entailment" with "yes" and "no" for
simplicity.
passage : By Ellie Zolfagharifard PUBLISHED : 12:07 EST , 12
August 2013 | UPDATED : 01:37 EST , 14 August 2013 The
Perseid meteor shower reached a peak yesterday with up to
60 shooting stars an hour in the UK . Amateur astronomers
were able to capture stunning images after they were
treated to incredible views of the annual cosmic event . The
skies are expected to shimmer with a ’ natural firework
display ’ again late last night as a meteor shower crosses
into the E a r t h s atmosphere . Scroll down for videos
Stonehenge looks even more magical than usual as it sits
beneath the annual Perseid meteor shower in Salisbury Plain
- Perseid reached a peak early yesterday with up to 60 shooting
stars an hour
- Annual event lit up the sky last night and in the early hours
of yesterday
- The shower is a result of material falling from the tail of
Comet Swift - Tuttle
14

query : A meteor streaks past stars in the night sky over
@placeholder , as the Earth passes through a stream of space
debris left by comet Swift - Tuttle
@placeholder : Stonehenge
passage : By Emily Kent Smith A three - year - old who was given an
egg as an Easter present was not allowed to have his name
on the chocolate - because he shares his name with
footballer Wayne Rooney . Rooney Scholes , from Manchester ,
was told that having just Rooney on the egg would cause ’
copyright issues ’. Yet UK law states that a person ’ s name
can not be subject to copyright . Scroll down for video
Rooney Scholes , three , from Manchester was not allowed to
have his name written on the egg because of ’ copyright
issues . His mother Jo - Anne ( R ) called the shop ’ s behaviour
’ barmy ’
- Rooney Scholes , three , told he could not have his first name
on the egg
- Staff at Thorntons , Bury , said it would create ’ copyright
issues ’
- Yet they agreed to let him have his full name inscribed on
the chocolate
- Mother Jo - Anne branded behaviour of chocolate shop staff ’
madness ’
query : said : ’ @placeholder apologises for the service provided
to Ms . Scholes at
@placeholder : Thorntons
passage : The U . S . Department of Education is legally prohibited
from having any control over curriculum or instruction in
the nation ’ s public schools , but nonetheless Secretary of
Education Arne Duncan is a zealous advocate of the new
Common Core standards for students ’ proficiency in English
and math . First , he said their critics were members of
extremist groups , and he recently assailed the parents who
criticize them as " white suburban moms who
all of a
sudden
their child isn ’ t as brilliant as they thought
they were , and their school isn ’ t quite as good as they
thought they were ." His remarks were prompted by the nearly
unanimous outrage expressed by parents -- moms and dads -at public forums in suburban districts in New York ,
following the release of the abysmal results of the new
Common Core tests .
- Diane Ravitch : Education department should not push Common
Core standards
- Ravitch : Just 31% of N . Y . students passed because standards
unrealistic
- Ravitch : Teachers are not prepared to teach them ; parents don
’ t like them
- Field - testing should have been done , she says , not fast
implementation
query : @placeholder students take more tests than students in
any other nation .
@placeholder : U . S .
passage : By Mike Dawes PUBLISHED : 05:41 EST , 1 January 2014 |
UPDATED : 05:16 EST , 3 January 2014 Arsenal ’ s table - topping
footballers posted a ’ get well soon ’ message to Michael
Schumacher on Instagram after their 2 -0 victory over

15

Cardiff City at the Emirates . Their tribute came hours
after Schumacher ’ s manager described his condition as ’
stable ’ by his manager in the wake of his skiing accident .
The German F1 ace has spent a third night at the University
Hospital of Grenoble , where he was taken after the
accident on Sunday . The 44 - year - old seven - time Formula One
world champion hit his head at Meribel in the French Alps
and there was grave concern for his condition .
- Sabine Kehm says there has been no change in Michael
Schumacher ’ s condition
- More good news after surgeons admit improvement in brain on
Tuesday
- F1 legend was airlifted off slopes after accident on Sunday
query : improvement continued into Tuesday morning , with
@placeholder now reporting a
@placeholder : Sabine Kehm
passage : It is the ’ Jewel of Japan ’ , but Kanazawa , one of the
top destinations for Japanese tourists , is barely known
outside the country . Tucked between the Sea of Japan and
the Japan Alps , peaks etched on the horizon like a backdrop
to a stage , Kanazawa is rather off the beaten track . That
could all change when the shinkansen , Japan ’ s famous bullet
train , arrives next year at an appropriately gleaming
station , rebuilt in 2005 under a dome of glassand - steel
fretwork and fronted by a wooden gate shaped like a drum ,
with a digital clock marked out in tiny bubbling fountains .
It is a tourist attraction in its own right .
- Kanazawa is hugely popular with Japanese tourists , but
unknown beyond
- It is the capital city of the Ishikawa region , on Japan ’ s
main island , Honshu
- The city is renowned for its historic structures and sense of
tradition
query : Nearly all gold leaf used in @placeholder comes from
Kanazawa .
@placeholder : Japan Alps
passage : Animals in a Ukrainian zoo have been left to die of
starvation in the wake of the c o u n t r y s political turmoil
, it has been claimed . The director of Kharkiv Zoo blamed
U k r a i n e s warring politicians for failing to provide
funds , saying the zoo only have enough food to last until
Monday . Alexey Grigoriev is said to be
in
t e a r s over
the plight of the animals , and has pleaded with the prime
minister for help . Starving : Staff at Kharkiv Zoo , Ukraine
say a pregnant elephant , claimed to be ’ hungry and on the
point of expiring from exhaustion ’
Our
animals are not
fighting for power , they do not share anyone ’ s political
views , they just want to live ,
said a statement by the
zoo .
- Animals in a Ukraine Zoo are starving after government cuts
funds
- Kharkiv Zoo will run out of food by the end of the weekend
query : A letter sent by the director Grigoriev to Ukraine ’ s
prime minister said :
The
@placeholder zoo animals on the
verge of starvation .
@placeholder : Kharkiv Zoo

16

passage : ( CNN ) -- Seamus Heaney , the Irish Nobel laureate who
died Friday at 74 , will be remembered for his translations ,
for his literary essays , for his generous international
public presence , but principally for the poetry he himself
wrote . Though the Heaney of the poems could sound unsettled
, or even tormented , he was in person equable , welcoming ,
generous ; these qualities would enter the poetry too . And
he will be remembered not for one kind of poetry , but for
several : He amazed even attentive admirers as he became ,
over his long career , in one way the opposite of his early
self . His first great poems were tough , inward , tied to the
soil ; his last , just as Irish , were confident , sometimes
gleeful , creatures of air .
- Stephen Burt : Seamus Heaney , who died Friday , wrote poetry ,
literary essays , translations
- His early works were of earth , and of the Troubles ; he found
fame writing about divided land
- He says later he went south , wrote of civic , family life ,
dead friends , embraced the numinous
- Burt : He became perhaps the most popular serious poet writing
in English anywhere
query : Yet he remained connected to the particulars of the
@placeholder spaces he knew , to his first friends in poetry
( and in folk music ) , and to his own earlier selves .
@placeholder : Irish
passage : The most boring calendar for 2015 has hit the shelves
- featuring the post boxes of Wales . Self - confessed ’ dull
man ’ Kevin Beresford from Redditch , Worcestershire , came up
with the idea to celebrate post boxes which stand in the
cities , mountains and valleys of Wales . It follows his 2014
calendar which featured the telephone boxes of Wales which
became a best seller . The post box calender follows Kevin
Beresford ’ s 2014 best seller about the best phone boxes in
Wales Self - confessed ’ dull man ’ Kevin Beresford said that
the post office boxes ’ things of beauty ’ and of historical
importance Mr Beresford , 62 , said : ’ People may think post
boxes are a bit dull but I they are things of great beauty
and of historical importance .
- Kevin Beresford said post boxes aren ’ t boring ’ are things of
great beauty and of historical importance ’
- He has previously published calendars celebrating the Britain
’ s best roundabouts and prisons .
- Mr Beresford has featured as Mr January in a calender
showcasing Britain ’ s dullest men
query : Kevin said : ’I live in Redditch which must be the most
boring town in @placeholder and I ’ ve been married and
divorced three times .
@placeholder : Britain
passage : Chelsea ’ s early season form may have led to
comparisons with the Arsenal ’ Invincibles ’ side , but Gary
Neville believes they aren ’ t even as good as the Chelsea
side from 10 years ago . Jose Mourinho ’ s side are currently
four points clear at the top of the Premier League , but
after letting leads slip against both Manchester City and
United , their killer instinct has been called into question
. ’ If a team are going to be playing for a 1 -0 then you
better see it out , ’ Neville said on Monday Night Football .

17

’ When I saw Jose Mourinho two weeks ago he talked about the
2005 ( Chelsea ) team and ( compared ) the team he had then to
the team he has now and he said the killer instinct ’ s
missing .
- Chelsea are four points clear at the top of the Premier
League
- Jose Mourinho ’ s side have proved themselves to be early title
favourites
- But Gary Neville believes there is still room for improvement
- The former Manchester United defender criticised their lack
of killer instinct
- Chelsea dropped points against both Manchester clubs
query : ’ When ( Manchester ) @placeholder went down to 10 men I
thought Chelsea let them off the hook and yesterday at 1 -0
up I think Chelsea let United off the hook .
@placeholder : Manchester City
passage : By Simon Jones Tottenham have been rebuffed in an
initial attempt to offer Gylfi Sigurdsson in return for
Swansea ’ s Ben Davies and Michel Vorm . Spurs boss Mauricio
Pochettino is looking to introduce some new faces at White
Hart Lane following his arrival from Southampton this
summer , and he sees Davies and Vorm as ideal additions to
his squad . Spurs target : Pochettino wants to sign Davies
before the start of the Premier League season Exchange :
Daniel Levy has offered Sigurdsson for Davies and Dutch
international Vorm Left - back Davies enjoyed a good season
for the Welsh side as they finished 12 th in the Premier
League under Garry Monk , with Swansea chairman Huw Jenkins
valuing the Englishman at
10million .
- Swansea chairman Huw Jenkins wants to take Sigurdsson back to
the Liberty Stadium after a successful loan spell in 2012
- Spurs , aware of this interest , have offered Sigurdsson in
exchange for English left - back Davies and Holland
international Vorm
- Mauricio Pochettino is keen to revamp the squad at White Hart
Lane
query : Sigurdsson enjoyed a successful five - month loan spell at
@placeholder back in 2012.
@placeholder : Swansea
ReCoRD prompt.

18