A Machine Learning Approach for Improved BM25
Retrieval
Krysta M. Svore and Christopher J. C. Burges
Microsoft Research
One Microsoft Way
Redmond, WA 98052
{ksvore,cburges}@microsoft.com
Microsoft Research Technical Report MSR-TR-2009-92
July 30, 2009

Abstract
BM25 is one of the most widely used information retrieval functions because of its consistently high retrieval accuracy. Despite its widespread use,
there have been few studies examining its effectiveness on a document description over single and multiple field combinations. We determine the effectiveness of BM25 on various document fields. We find that BM25 models
relevance on popularity fields such as anchor text and query click information no better than a linear function of the field attributes. We also find query
click information to be the single most important field for retrieval. In response, we develop a machine learning approach to BM25-style retrieval that
learns, using LambdaRank, from the input attributes of BM25. Our model
significantly improves retrieval effectiveness when the document description
is over single or multiple fields. Our data-driven approach is fast, effective,
avoids the problem of parameter tuning, and can directly optimize for several
common information retrieval measures. We demonstrate the advantages of
our model on a very large real-world Web data collection.

1 Introduction
BM25 [16] is arguably one of the most important and widely used information
retrieval functions. It has served as a strong baseline in the information retrieval
community, in particular in the TREC Web track [5, 6]. Originally designed to

1

be computed over the body and title fields of a Web document, BM25 is a nonlinear combination of three key document attributes: term frequency, document
frequency, and document length.
Recent research suggests that using click information in ranking can significantly improve accuracy [1, 7, 24]. Since popularity fields are not generated by
the author of the document, the question has been raised if the foundations behind BM25 are suitable for popularity fields [21]. We are particularly interested in
the retrieval effectiveness of popularity fields. We empirically determine how well
BM25 models content fields, such as the title, URL, and body fields, versus popularity fields, such as anchor text and query click. We demonstrate that BM25 works
remarkably well for most content fields, but not for popularity fields. BM25F [17]
is an extension of BM25 that prescribes how to combine more than one field in
the document description, and correspondingly how to compute BM25 across the
expanded document description. We determine the contribution of single and multiple field combinations to retrieval effectiveness and correspondingly evaluate the
retrival capacity of BM25 on these fields.
A challenge to using BM25 and BM25F is the necessity of parameter tuning. Parameters control the contributions of term frequency, field length, and field
weight. BM25F requires the tuning of 2K + 1 parameters for a document description containing K fields. Tuning can be accomplished using a grid-search method,
or by using gradient descent [21]. Each method has its drawbacks; grid-search can
be time intensive and in fact prohibitively slow when the data collection is large,
but can find a reasonable set of parameter values for a given target cost function.
Gradient descent is much faster, but the method in [21] does not optimize the parameters directly for a target evaluation measure and finds parameters no better
than the grid-search technique.
Recently, it has been shown that LambdaRank [3] is empirically optimal [8, 25]
for Mean Average Precision, Mean Reciprocal Rank, and NDCG, and likely for
other IR measures as well. We could extend the approach in [21] to use LambdaRank and thus optimize the parameters for a chosen IR measure, but the function is still restricted to the predefined BM25 and BM25F probabilistic models.
The probabilistic model is not the only way to approach information retrieval; we
consider a machine learning approach to develop a BM25-style model. A machine
learning approach has been difficult previously due to the difficulty in obtaining
large amounts of training data. A resulting challenge has been how to prevent a
complex model with many parameters from overfitting the training data. However,
by training LambdaRank [3] on the input attributes of BM25F and over a large data
collection, we are able to build a rich, expressive retrieval model.
Our main interest is in developing an improved ranking model that uses the
same input attributes as BM25; we therefore consider how to develop a retrieval
2

function that considers information across fields. Our model learns from the same
attributes as BM25, namely term frequency, document frequency, and document
length, and avoids parameter tuning. We learn the function directly from the data
under consideration by training LambdaRank on the input attributes of BM25,
while also optimizing directly for the IR evaluation measure of choice. We call
our model LambdaBM25, since it uses the LambdaRank training method and the
inputs of BM25. Our model is very easy to retrain periodically as attribute values
change (for example, as anchor text and query click fields change over time), and
can be used as a framework for learning other functions by substituting their input
attributes. We believe our model offers value in the design of future information
retreival systems.
Our primary contributions are threefold:
• We empirically determine the effectiveness of BM25 for different field types.
Although BM25 is effective on the title and URL fields, we find that on
popularity fields it does not perform as well as a linear model.
• We develop a machine learning model, called LambdaBM25, that is based on
the attributes of BM25 [16] and the training method of LambdaRank [3]. Our
model is both fast and simple; it does not require any parameter tuning and
is an extension of a state-of-the-art neural net ranking approach. It combines
the input attributes of BM25 with the principles of machine learning and goes
beyond the probabilistic model with a data-driven approach. In addition,
LambdaBM25 optimizes for MAP, MRR, or NDCG, as well as potentially
other IR measures [8, 25].
• We extend our empirical analysis to a document description over various
field combinations. We confirm that BM25F [17] is better than a linear function of BM25 scores. We then extend our model, LambdaBM25, to document descriptions consisting of combinations of fields and find it consistently
outperforms BM25F with statistical significance.

2 Related Work
There have been a number of approaches to document retrieval ranging from simple to complex models. BM25 [16] is based on a probabilistic information retrieval
model [20] which incorporates attributes of documents, such as term frequencies,
document frequencies, and document length. A generalized inverse document frequency model was recently developed that can also be incorporated into BM25
[13]. BM25 is one of the most widely used retrieval methods and serves as a standard baseline in the information retrieval community.
3

More recently, there has been interest in a simple retrieval function that can
capture signals over multiple document fields. The development of this work began
with Wilkinson [22], who evaluated various ways to weight scores that came from
different fields of a document. Ogilvie and Callan [14] overview different field
combination methods as well as propose several novel approaches with a complete
evaluation. More recently, Robertson et al. proposed a simple method of combining
attributes across multiple fields called BM25F [17].
A drawback of BM25 and BM25F is the difficulty in optimizing the function
parameters for a given information retrieval measure. There have been extensive
studies on how to set term frequency saturation parameters and length normalization parameters [18, 10]. Taylor et al. [21] proposed an approach based on gradient
descent that significantly reduces the tuning time over using standard grid-search
heuristics. Their technique does not, however, directly optimize for the evaluation
measure or offer improved accuracy over a grid-search heuristic.
Recently, it has been shown that LambdaRank [3] is empirically optimal for
several IR measures [8, 25], in particular for NDCG, Mean Average Precision, and
Mean Reciprocal Rank. Our work is a combination of LambdaRank and BM25 attributes, and allows us to optimize directly for the IR measure under consideration.
Recent studies demonstrate the effectiveness of query click data for ranking [1,
7, 9, 24]. However, to our knowledge, there is no detailed study of the effectiveness
of BM25 on single document fields or on subsets of document fields, including
anchor text and query click logs. In addition, we are unaware of efforts to develop
a directly analagous retrieval model based on the same attributes as BM25. Our
work provides both an extensive study of the contributions of different document
fields to information retrieval and a framework for improving BM25-style retrieval.
The remainder of the paper is laid out as follows. In the next section, we review
the fields of a document. Section 4 provides an overview of BM25 and BM25F. In
Section 5, we describe the evaluation measure NDCG [11] and the neural network
ranking algorithm LambdaRank [3]. In Section 5.4, we discuss how to learn a
BM25-like retrieval function over a large data collection. In Section 6, we describe
our experiments and present our results. Finally, we conclude in Section 7 and
discuss future directions of our of work.

3 Document Fields
A Web document is composed of several fields of information, in particular the title
of the page, the URL, the body text, the anchor text, and queries that lead to a click
on the page. The field may either be written by the owner of the page, as in the case
of the body text, the URL, and the title of the document, or by other authors, as in

4

the case of anchor text and query click information. We call the former sources of
information content fields and the latter sources of information popularity fields.
The document description is a concatenation of the available fields of information. The document desription may be restricted to certain fields, for example, if
no click information is available or if a retrieval system does not crawl anchor text.
To study the effects of various fields, we can restrict the document description to
particular subsets of fields, or to a single field. Such studies can be enlightening
and may lead to further improvements in information retrieval functions.
All field information is preprocessed by removing punctuation, converting to
lowercase, and removing html markup. A query q is composed of terms; we consider queries which contain at most 10 terms. A document description is decomposable into smaller units, for example, characters, terms, or phrases; we follow
the traditional decomposition into terms. The document frequency for term t is
the number of documents in the collection that contain term t in their document
descriptions. Note that the document description may be over one or several fields.
The term frequency is calculated per term and per field by counting the number of
occurrences of term t in field F of the document under consideration. We measure
the length of a field by counting the number of terms in the field. In this section,
we review the fields of a Web document.

3.1 Content Fields
The content fields of a document include the body text, the document’s title, and the
URL text. The body field consists of the html content of the page. It includes outgoing link information, image text, navigation bars, and so on, in addition to the core
content. The title field contains the title of the document, indicated by the author
through html ¡TITLE¿ tags. For example, for the site http://webmessenger.msn.com,
the title field is msn web messenger. The URL field contains the text of the page’s
web address, after word breaking. For example, for the site http://webmessenger.msn.com,
the URL field is web messenger msn. The body field is typically significantly
longer than the URL and title fields.

3.2 Popularity Fields
Popularity fields include anchor text and query click information. Unlike content fields, popularity fields are not written or controlled by the document’s owner,
but rather are an aggregation over information about the page from many authors.
Popularity fields can be highly repetitive for common pages, and can have a short
length for lesser-known (tail) pages.

5

(msn web messenger, 1802)
(webmessenger, 1278)
(web messenger, 526)
(msn web, 176)
(access messenger via the web, 95)
(web msn, 78)
(web msn messenger, 65)
(msn messenger, 40)
(msn, 37)
(webmsn, 26)
(here, 8)
(msn webmessenger, 7)
(this, 5)
...
Figure 1: Extract of the anchor text field for the site http://webmessenger.msn.com.
3.2.1 Anchor Text Field
The anchor text field is composed of the text of all incoming links to the page.
Anchor text is supposed to indicate the trustworthiness of the document; if people
link to a page, it signals that those people trust its content. Figure 1 lists an extract
of the anchor text field for the site http://webmessenger.msn.com. We compress
the field by listing an anchor text string followed by the number of incoming links
with that string. Note that this representation will preserve term ordering. Our
representation of the anchor text field equally weights all incoming links, regardless
of their parent page. In future research, it may be suitable to weight the text of an
incoming link by the importance of the page it comes from, for example by its
PageRank score [2].
The anchor text field can be highly repetitive. Most Web pages tend to have
large numbers of incoming links containing identical terms. The field may be
repetitive if there are incoming links with repetitive text, or the field may be elaborative if the text of incoming links is diverse. For example, in Figure 1, the first
two examples of anchor text contribute a large number of repetitive terms to the
anchor text field. Ideally, this behavior — repetitive versus elaborative — should
be learned automatically from the document collection (see Sections 4 and 5.4).

6

3.2.2 Query Click Field
Another source of information for document retrieval is query click information.
Query click information for a document, aggregated across many users, signals
relevance for a given query. It shares the property with anchor text that the queries
are not authored by the document’s owner.
We follow [1, 9] and build the query click field from query session data. For
details on the query click field, we refer the reader to [9]. We briefly review the
field in this section. Our query click data consists of query sessions extracted from
one year of a commercial search engine’s query log files. A query session consists
of a user-issued query and a ranked list of 10 documents, each of which may or
may not be clicked by the user. A query session can be represented by a triplet
(q, r, c) [12], where q is the query, r is the ranking of documents, and c is the set
of documents the user clicked on.
In [9], the query click field is represented by a set of query-score pairs (q, Score(d, q)),
where q is a unique query string and Score(d, q) is a score assigned to that query.
Score(d, q) could be the number of times the document was clicked on for that
query, but it is important to also consider the number of times the page has been
shown to the user and the position in the ranked list at which the page was shown.
The score in [9] represents the importance of the query q in describing the
relevance of document d that does not consider position, but does consider the
number of times the document has been shown to users. The score can be derived
from raw click data as
Score(d, q) =

C(d, q, click) + β ∗ C(d, q, last click)
,
C(d, q)

(1)

where C(d, q) is the number of times d is shown to the user when q is issued, also
called the number of impressions, C(d, q, click) is the number of times d is clicked
for q, and C(d, q, last click) is the number of times d is the temporally last click
of q. β is a scaling factor and can be tuned. Since the last clicked document for a
query is a good indicator of user satisfaction, the score is increased in proportion to
β by the last click count. Figure 2 shows an extract of the query click field for the
site http://webmessenger.msn.com, extracted from [9] for completeness. The term
frequency of term t for the query click field is calculated as
X
Score(d, q),
(2)
p|t∈p

where p is the set of query-score pairs.

7

(msn web, 0.6675749)
(webmessenger, 0.6621253)
(msn online, 0.6403270)
(windows web messenger, 0.6321526)
(talking to friends on msn, 0.6130790)
(school msn, 0.5994550)
(msn anywhere, 0.5667575)
(web message msn com, 0.5476839)
(msn messager, 0.5313351)
(hotmail web chat, 0.5231608)
(messenger web version, 0.5013624)
(browser based messenger, 0.3814714)
(im messenger sign in, 0.2997275)
(msn web browser download, 0.0926431)
(install msn toolbar, 0.0027248)
...
Figure 2: Extract of the query click field for the site http://webmessenger.msn.com
[9].

4 BM25
In this section, we briefly review previous work on BM25, but refer the reader
to [16, 20] for a complete description. BM25 [16, 20] stems from the 2-Poisson
probabilistic model of information retrieval; the task is to answer “What is the
probability that document d is relevant to query q?”. The document d is constrained
to the document description, which may be over one or several fields, as described
in Section 3.
The classic retrieval function BM25 is a function of several field attributes:
term frequencies, document frequencies, and the field length. Although the document description could be over several fields, BM25 traditionally has considered
a document description restricted to a single field, or at most two fields, body and
title. However, additional fields can provide different signals of relevance and help
improve ranking accuracy. BM25F [17] is an extension of the BM25 function to a
document description over multiple fields. A key property of this function is that it
is nonlinear. Since BM25F reduces to BM25 when calculated over a single field,
we will refer to both functions as BM25F , where F is a specification of the fields
contained in the document description.
BM25F is computed as follows for document d, with a document description

8

over fields F , and query q:
S=

X

T Ft ∗ It .

(3)

t∈q

The sum is over all terms t in query q. It is the Robertson-Sparck-Jones form of
inverse document frequency of term t and is calculated as
It = log

N − df + 0.5
,
df + 0.5

(4)

where N is the number of documents in the collection, df is the document frequency of term t. Note that the document frequency is calculated across the entire
document description. In our experiments, for simplicity, we calculate document
frequency over the body field for all document frequency attributes1 .
T Ft is a simple term frequency saturation formula that limits the impact of
observing a term multiple times in a field. It is defined as
T Ft =
where f is calculated as
f=

f
,
k+f

X wF ∗ tfF
βF

F

(5)

.

(6)

tfF is the term frequency attribute of term t in field F , k is the saturation parameter
that controls the nonlinearity of T Ft , βF is a function of field length, defined below,
and wF is a tuned field weight parameter. T Ft satisfies three key properties: (1)
When tfF = 0, then T Ft = 0, (2) the function increases monotonically with tfF
and (3) it has an asymptotic limit.
The parameter k is used to tune the saturation of term frequency. If k = 0,
the function reduces to 1 and we score the query document pair according to the
presence of the term across the collection only. If k is large, the function is nearly
linear in tfF . Small k values are typical, say 1 − 2, demonstrating that tfF is
highly nonlinear; after only a few occurrences of the term, the impact of additional
occurrences is minimal.
The 2-Poisson model makes sense only when documents are of equal length,
so BM25F includes a component to account for varying field lengths. Two documents relevant to the same topic may be different lengths because of wordiness attributable to either repetition or elaboration. The BM25F formula assumes wordiness is only attributable to repetition. The field length component is defined as
βF = (1 − bF ) + bF (ℓF /avgℓF ),
1

(7)

We also used the whole document description, but found little difference in accuracy over using
only the body field.

9

where bF is the length tuning parameter, ℓF is the length of the field, and avgℓF is
the average length of the field in the document collection. bF is a tuning constant
between 0 and 1. If bF = 1, then simple normalization is used, which is meant to
correct for verbosity. If bF is small, it reduces the effect of normalization.
The instantiation of BM25F requires parameter tuning and setting. BM25F
requires the tuning of 2K + 1 parameters, when calculated across K fields, namely
k, bF , and wF . Tuning can be done using a simple grid-search technique or by
using a gradient-descent method [21]. However, since the parameters should be
tuned on a large dataset, tuning can be time intensive and potentially prohibitively
slow. In our experiments (see Section 6), we tuned the parameters of BM25F
using grid search over 10K queries. We note that the grid search for various field
combinations for K > 3 took over 2 weeks to complete.

5 Learning a BM25-style Function
In this section, we describe our simple machine learning ranking model that uses
the input attributes of BM25F and the training method of LambdaRank. Our approach is general and may be applied to other retrieval functions. It overcomes
the obstacle of parameter tuning and is completely data driven. We begin by reviewing previous work: the target evaluation measure NDCG [11] and the training
algorithm LambdaRank [3].

5.1 NDCG
We choose to evaluate using NDCG, which has been shown to be a good measure
for relevance of web documents to a query. Normalized Discounted Cumulative
Gain (NDCG) [11] is a widely used measure for search metrics. It operates on
multilevel relevance labels. We assume in our work that relevance is measured on
a 5-level scale. NDCG for a given query q is defined as follows:
L

100 X 2l(r) − 1
NDCG@Lq =
Z
log(1 + r)

(8)

r=1

where l(r) ∈ {0, . . . , 4} is the relevance label of the document at rank position
r and L is the truncation level to which NDCG is computed. Z is chosen such
that the perfect ranking would result in NDCG@Lq = 100. Mean NDCG@L is
P
the normalized sum over all queries: N1 N
q=1 NDCG@Lq . NDCG is particularly
well-suited for Web search applications since it accounts for multilevel relevance
labels and the truncation level can be set to model user behavior. In our studies, we

10

consider mean NDCG@1, 3, 10. For brevity, we write NDCG@1, 3, 10. DCG is
simply NDCG (Eqn 8) without the 1/Z normalization factor.

5.2 LambdaRank
In the next two subsections, we review a state-of-the-art ranking algorithm called
LambdaRank [3] that optimizes for IR measures. For complete details, we refer
the reader to [3]. LambdaRank is both a list-based and a pair-based neural network
learning algorithm; it is trained on pairs of documents per query, where documents
in a pair have different relevance labels. It is an extension of RankNet [4], another
pair-based ranking algorithm whose cost function is a sigmoid followed by a pairbased cross-entropy cost.
In most machine learning tasks, a target evaluation measure is used to evaluate
the accuracy of the model at test time, and an optimization measure, generally a
smooth approximation to the target measure, is used to train the system. Ideally, the
optimization measure matches the target measure, but typical IR target costs (e.g.
MAP, MRR, mean NDCG, etc.) are either flat or non-differentiable everywhere
and require sorting by model score, which itself is a non-differentiable operation.
Hence, direct optimization of the target measure is quite challenging. LambdaRank
[3] leverages the fact that neural net training only needs the gradients of the measure with respect to the model scores, and not the function itself, thus avoiding
the problem of direct optimization. The gradients are defined by specifying rules
about how swapping two documents, after sorting them by score for a given query,
changes the measure.
LambdaRank provides a significant speed-up over RankNet as well as a new
method for directly optimizing a cost function using λ-gradients. In the next section, we describe the λ-gradient for NDCG, although the gradient definition is
general and can work with any target evaluation measure.

5.3 λ-Gradient for Mean NDCG
A LambdaRank gradient, λj , is defined to be a smooth approximation to the gradient of a target evaluation measure with respect to the score of the document at rank
position j. λ-gradients have a physical interpretation; documents are represented
by point masses and λ-gradients are forces on those point masses [3]. On two documents in a pair in a query, the λ-gradients are equal and opposite, where a positive
λ-gradient indicates a push toward the top of the list, and a negative λ-gradient indicates a push toward the bottom of the list. With a suitably defined λ-gradient,
the gradient of any target evaluation measure can be smoothly approximated for a
given document.
11

In [3], several alternatives for λ-gradients are given, and the best λ-gradient
definition is chosen according to accuracy on validation data. The best λ-gradient
found in [3] is a combination of the derivative of the RankNet cost [4] scaled by
the NDCG@Lq gain from swapping two documents i and j with differing labels
for a query q. We drop q below for brevity.
The RankNet cost is a pairwise cross-entropy cost applied to the logistic function on the difference of the model scores. Assume document i has score si and
relevance label li , document j has score sj and relevance label lj , and oij ≡ si − sj
is the score difference, then the RankNet cost can be written as follows:
Cij ≡ C(oij ) = −Sij oij + log(1 + eSij oij ),
where
Sij =



+1
−1

(9)

if li > lj
if li < lj

(10)

The derivative of the RankNet cost according to score difference is
δCij /δoij = δCij /δsi = −Sij /(1 + eSij oij ).

(11)

The λ-gradient can now be expressed as
λij ≡ Sij ∆NDCG

= Sij N (2 − 2 )
li

lj

δCij
δoij

1
1
−
log(1 + ri ) log(1 + rj )

(12)


1
1 + eSij oij



where N is the reciprocal of the maximum DCG for the query and ri and rj are
the rank positions of documents i and j, respectively. Note that the sign Sij only
depends on the labels of documents i and j and not on their rank positions. In addition, if li > lj , then document i is more relevant than document j and document
i must move up the ranked list to reduce the cost, so Sij = 1 and the λ-gradient for
document i is positive.
The λ-gradient for a single document is computed by marginalizing over the
pairwise λ-gradients,
X
λi =
λij ,
(13)
j∈P

where the sum is over all pairs P for query q which contain document i (see [3] for
details).

12

5.4 LambdaBM25
Retrieval can be treated as a ranking process, where the ranking model ranks documents in order of decreasing relevance to the query q. In the probabilistic IR model
BM25, documents are ranked by probability of relevance to the query. However,
there are several challenges to using BM25, including the requirement of parameter
tuning, the inability to directly optimize for an IR measure, and the restrictions of
the underlying probabilistic model. In this section we directly address these challenges by introducing a new machine learning approach to BM25-like retrieval.
Our model, called LambdaBM25, is trained using LambdaRank due to its flexibility, ease of training, and state-of-the-art ranking accuracy. It employs the NDCG
λ-gradient previously described and learns a function of the BM25F attributes directly from the data collection.
As mentioned previously, BM25 can be prohibitively expensive when trained
on a document description over many fields. With the growing use of anchor
text and click information, and potentially other metadata, training parameters for
BM25F can be costly. LambdaBM25 does not require parameter tuning since the
function is learned directly from the train collection. In addition, LambdaBM25
can be extended to optimize for several IR measures, since LambdaRank has recently been shown to be empricially optimal for NDCG and other IR measures
[8, 25].
A basic assumption behind the BM25 formula is that two documents about the
same topic may be different lengths because one is more verbose. However, it
may not be true that verbosity is the only reason a document is longer. Verbosity
could imply elaboration or that the document covers multiple topics and not mere
wordiness, in which case it may be appropriate to assign a longer document a
larger BM25 score, whereas typically it would be assigned a smaller score. The
BM25 formula cannot account for such differences, while LambdaBM25 has the
flexibility to learn from the data if the documents tend to be verbose or elaborative.
In addition, wordiness may be common among some fields and rare among others.
For example, a title or URL field is succinct, while anchor text and query click
fields are verbose due to repeatability. Our method learns these differences through
neural net training and can apply different functions to the fields in the document
description.
Our model has the additional advantage that it does not require that the attributes be statistically independent, as in [20]. LambdaBM25 learns relationships
among the attributes and the fields from the data collection through LambdaRank
training that may not be apparent otherwise. Our machine learning approach to an
improved BM25-style function is trained over a very large data collection so that
our model is effective, robust, and avoids overfitting.
13

We recognize that in learning our model directly from a large data collection,
we lose the probabilistic interpretation inherent to BM25. However, our model has
an additional advantage that it is very flexible, and can be extended to include other
fields in the document description as new fields become available.
We develop our model as follows. We optimize for NDCG and use the λgradient as previously described. We train our model using LambdaRank and the
same input attributes as BM25, namely term frequency, document frequency, and
field length, for each field included in the document description. Although we
could include additional attributes, we would like to maintain a fair comparison
to the BM25 retrieval function because it is so widely used. We train single- and
two-layer LambdaRank neural nets with varying numbers of hidden nodes. Since
neural network learning improves when the data is normalized, we apply several
transformations to the input attributes to achieve zero mean, unit variance across
the feature values. Results are discussed in Section 6.

6 Experiments
We perform extensive experiments to determine the effectiveness of BM25 on single fields and multiple-field combinations and to determine the most important
fields in a document. We then compare our method, LambdaBM25, to BM25 and
evaluate the techniques on a very large train and test collection. Our goal is to have
our nonlinear (two-layer) LambdaBM25 model demonstrate improved accuracy
over BM25.

6.1 The Data and Evaluation Measure
We evaluate our method on a real-world Web-scale data collection. The data contains queries sampled from query log files of a commercial search engine and corresponding URLs. All queries are English queries and can contain up to 10 query
terms. Our data collection includes anchor text, title, URL, body, and query click
fields. We perform stopword removal and some stemming on queries. Field information is preprocessed as previously described.
Our train/validation/test data contains 67683/11911/12185 queries, respectively.
Each query is associated with on average 150-200 documents (URLs) together with
a vector of feature attributes extracted for the query-URL pair. The features consist
of the term frequencies for terms in positions 1–10, the document frequencies for
terms in positions 1–10, and field lengths for all fields under consideration. Each
query-URL pair also has a relevance label. The label is human generated and is on
a 5-level relevance scale, 0 to 4, with 4 meaning document d is the most relevant

14

Table 1: Parameters learned using grid search on the validation set for single-field
BM25F .
Field F
kF
bF
avgℓF
Title (T)
3.000
0.400 7
URL (U)
1.135
0.331 6
Body (B)
1.000
0.500 1815
Anchor (A) 0.910
0.008 167
Click (C)
101.540 0.504 39
to query q and 0 meaning d is not relevant to q.
We evaluate model performance using mean NDCG. We report NDCG scores
at truncation levels 1, 3, and 10. We also perform a signficance test, i.e., a t-test with
a significance level of 0.05. A significant difference should be read as significant at
the 95% level. Statistical significance between pairs of models is indicated in bold.

6.2 Effectiveness of Single Fields
We first seek to determine which single field is the most effective in terms of ranking relevant documents using BM25F . The parameters of BM25F , where here F
is the single field in the document description, are tuned to optimize NDCG@1
on our validation set using a 2-D grid search over the saturation parameter kF
and the length normalization parameter bF , for each field F . We follow the grid
search method outlined in [21], except we consider 1000 epochs or convergence of
NDCG@1 as the stopping criterion. It was prohibitively slow to tune the parameters on the training set due to its size. Table 1 lists the parameters found for each
individual field. We also tried an approach similar to the gradient-based approach
in [21] and found results to be almost identical.
In Table 2, we report results for BM25F on a document description restricted
to a single content or popularity field. The three content fields, Title (T), URL (U),
and Body (B), are equally effective in terms of NDCG ranking accuracy on our
test set. At truncation level 10, the body field yields significantly better ranking
accuracy. The URL field appears to be the least reliable for retrieval in terms of
accuracy across the three truncation levels.
For popularity fields, retrieval using only the anchor text field (A) yields improved NDCG scores over retrieval using a single content field. However, BM25C
over the query click field yields almost a 7 point NDCG gain at truncation level 1
and a 4 point NDCG gain at truncation level 3 over BM25A . Certainly, if restricted
to a single field, the query click field achieves the highest NDCG accuracy.
We next seek to compare BM25F to single-layer LambdaBM25F on single
15

Table 2: Accuracy results on the test set for BM25F for single fields.
Model
NDCG@1 NDCG@3 NDCG@10
BM25T 24.50
27.23
33.32
BM25U 24.96
27.24
32.77
BM25B 24.35
27.92
35.07
BM25A 33.50
32.53
33.37
BM25C 40.07
36.62
35.89
Table 3: Accuracy results on the test set for 1-layer LambdaBM25F for single fields. Bold indicates statistical significance over the corresponding BM25F
model. Italic indicates statistical significance of the corresponding BM25F model
over the LambdaBM25F model. Parentheses indicate no statistically significant
difference.
Model
NDCG@1 NDCG@3 NDCG@10
LambdaBM25T 20.79
24.93
32.51
LambdaBM25U 22.96
26.38
33.17
LambdaBM25B 18.03
21.93
30.60
LambdaBM25A (33.83)
33.11
34.73
LambdaBM25C 39.34
(36.50)
(35.96)
field document descriptions. Since BM25F is a highly nonlinear function, we expect it to outperform a simple linear combination of input attributes, in particular
for the content fields, for which BM25F was originally developed. Our linear
model cannot, for example, divide term frequency by document frequency or field
length; these two operations have been shown to give improved retrieval accuracy
[20]. We train single-layer LambdaBM25F models by choosing the best training
epoch and learning rate based on the validation data. We found a learning rate of
10−5 and 500 epochs to be reasonable settings for all fields.
Table 3 contains results for single-layer LambdaBM25F . At each truncation
level, our results indicate that for each content field, BM25F significantly outperforms our learned linear function at each truncation level, with the exception
LambdaBM25U , which performs similarly to BM25U at truncation level 10. For
content fields, we conclude that BM25F is significantly better than a linear combination of input attributes. We anticipated such a result since BM25F was explicitly
designed for improved accuracy over a linear term frequency function when using
content fields.
In the case of popularity fields, the results indicate that our single-layer LambdaBM25F
model performs similar or better than BM25F . For the anchor text field, we find
16

Table 4: Number of hidden nodes found on the validation data for single field,
two-layer LambdaBM25F .
Field F Hidden nodes
T
10
U
15
B
15
A
15
C
5
Table 5: Accuracy results on the test set for 2-layer LambdaBM25F for single fields. Bold indicates statistical significance over the corresponding BM25F
model. Italic indicates statistical significance of the corresponding BM25F model
over the LambdaBM25F model. Parentheses indicate no statistically significant
difference.
Model
NDCG@1 NDCG@3 NDCG@10
LambdaBM25T (24.31)
(27.38)
33.86
LambdaBM25U 23.69
26.70
33.21
LambdaBM25B 27.53
30.49
37.03
LambdaBM25A 36.33
34.68
35.33
LambdaBM25C 41.61
38.01
37.19
that BM25A performs significantly worse at truncation levels 3 and 10 than our
learned linear function LambdaBM25A . Similarly, for the query click field, we
find that BM25C performs similarly to our learned linear function LambdaBM25C .
Such results were hypothesized in [21], and since popularity fields draw content
from authors other than the document’s owner, it seems reasonable that the BM25
function, which was built for content fields, may not model the data much better
than a linear function of input attributes.
Finally, we seek to determine if our nonlinear LambdaBM25F model can outperform BM25F . We train a two-layer neural net with 5, 10, and 15 hidden nodes,
for various learning rates. We choose the best net according to the validation set.
We found a learning rate of 10−5 and 500 epochs to consistently perform well.
Table 4 lists the number of hidden nodes and training parameters used for each
single-field nonlinear model.
Table 5 reports the results of BM25F versus our learned two-layer LambdaBM25F
model. For the Title field, BM25T performs almost identically to LambdaBM25T .
For the URL field, BM25U performs slightly better at most truncation levels than
LambdaBM25U . We conclude that BM25 models these two content fields very
17

well. However, for the Body field, we find that LambdaBM25B outperforms significantly BM25B across all truncation levels. We hypothesize that BM25F models
short, succinct, non-repeatable fields well, but fails to model longer fields with similar accuracy. Both the Title and URL fields are reasonably short, while the Body
field on average is around 300 times longer. As the length of the field grows, it
is beneficial to learn richer relationships between term frequency, document frequency, and field length, which LambdaBM25F is able to do.
For popularity fields, we find that two-layer LambdaBM25F consistently outperforms BM25F , with statistical significance, which further confirms that the
BM25F function was not designed for popularity fields. LambdaBM25F is able
to exploit found relationships in the training data that are restricted in the BM25F
model.

6.3 Effectiveness of Multiple Fields
For a document description over a single field, BM25F exhibits reasonable accuracy for content fields, while LambdaBM25F exhibits superior accuracy for popularity fields. We have also seen that with query click information alone, we can
achieve substantial retrieval accuracy gains. In this section, we perform experiments to examine retrieval effectiveness when the document description contains
multiple fields. We find that our learned nonlinear method, LambdaBM25F , outperforms BM25F when F is a document description over multiple fields. We also
verify that a nonlinear combination of multiple fields is required for the best retrieval accuracy.
The parameters of BM25F are tuned using a 2K-dimensional grid search,
where K is the number of fields in the document description. We consider several combinations of fields; the combinations and their parameters are listed in
Table 6. Note that the parameter k can be absorbed into the field weights wF (see
Eqs 5–6). Thus we assume k = 1 and learn 2K parameters instead of 2K + 1.
All field weights, with the exception of the query click field weight wc , are
between 0 and 20. In all field combinations, the body field consistently receives
the lowest field weight wB . When the document description is over all fields (final
row of the table), the query click field receives a weight 1000 times more than the
anchor text or body fields. We can conclude that the query click field is the most
important field in the document description.
We first seek to determine the most effective combination of fields to include
in the document description for BM25F . Table 7 lists the results of BM25F on
various field combinations. We find that using multiple fields in the document
description is superior to using a single field, unless that single field is the query
click field; the only combination of fields to outperform BM25C are combinations
18

19

Table 6: BM25F parameters learned on the validation set for various field combinations. In all cases, k = 1.
Fields F
wT
bT
wB
bB
wU
bU
wA
bA
wC
bC
T, B
9.8000 0.5406 0.0044 0.420 T, B, U
0.6454 0.4009 0.0085 0.196 1.6219
0.950 T, B, U, A
2.2415 0.6440 0.0490 0.685 5.4024
0.989 0.3202 0.0083 T, B, U, C
1.5811 0.9250 0.0290 0.851 6.7961
0.979 2.9265
0.390
T, B, U, A, C 9.5327 0.7990 0.0944 0.430 18.7944 0.890 0.0935 0.1993 93.8318 0.528

Table 7: Accuracy results on the test set for BM25F for multiple fields.
Model
Fields F
NDCG@1 NDCG@3 NDCG@10
BM25F T, B
27.84
30.81
36.98
BM25F U, T, B
30.81
33.30
39.53
BM25F A, U, T, B
38.66
38.83
43.42
BM25F C, U, T, B
45.29
43.37
46.83
BM25F C, A, U, T, B 45.41
43.53
46.88
Table 8: Accuracy results on the test set for 1-layer LambdaBM25F for multiple fields. Bold indicates statistical significance over the corresponding BM25F
model. Italic indicates statistical significance of the corresponding BM25F model
over the LambdaBM25F model. Parentheses indicate no statistically significant
difference.
Model
Fields F
NDCG@1 NDCG@3 NDCG@10
LambdaBM25F T, B
25.42
28.81
35.80
LambdaBM25F U, T, B
29.28
32.08
38.75
LambdaBM25F A, U, T, B
(38.91)
(38.84)
42.81
LambdaBM25F C, U, T, B
43.34
41.70
45.04
LambdaBM25F C, A, U, T, B 44.60
42.33
45.44
that include the query click field. Note that using multiple fields outperforms using
only BM25C . Even using the anchor text field in conjunction with all content fields
cannot match the accuracy of BM25C . The addition of anchor text to the C,U,T,B
combination in fact yields very little improvement in accuracy, without statistical
significance. The anchor text field is, however, important when query click information is not available, as we can see by the significant accuracy improvement
between the U,T,B and A,U,T,B field combinations.
We next determine if BM25F is better that a linear function of input attributes.
We learn single-layer LambdaBM25F models for each combination of fields listed
in Table 6. For each model, we find a learning rate of 10−5 performs best on our
validation data. Table 8 lists the results of our learned linear function LambdaBM25F .
In all cases, we find that BM25F performs as well or better than single-layer
LambdaBM25F ; our results confirm that the motivation for BM25F given in [17]
is accurate that a linear combination of fields is insufficient for good retrieval accuracy.
Finally, we seek to determine if our two-layer LambdaBM25F model learns a
better BM25-style retrieval function than BM25F . We train two-layer LambdaBM25F
models on the field combinations listed in Table 6. We find that a learning rate of
20

Table 9: Accuracy results on the test set for 2-layer LambdaBM25F for multiple fields. Bold indicates statistical significance over the corresponding BM25F
model.
Model
Fields F
NDCG@1 NDCG@3 NDCG@10
LambdaBM25F T, B
29.61
32.49
38.93
LambdaBM25F U, T, B
34.26
37.03
43.05
LambdaBM25F A, U, T, B
43.70
42.58
46.21
LambdaBM25F C, U, T, B
49.70
46.58
49.14
LambdaBM25F C, A, U, T, B 50.33
47.14
49.47
10−5 and 15 hidden nodes performs well for all field combinations on our validation data. Table 9 reports results of BM25F versus two-layer LambdaBM25F for
various field combinations. For every field combination, LambdaBM25F achieves
gains with statistical significance over the corresponding BM25F model. As expected, we see smaller gains between LambdaBM25T,B and BM25T,B since BM25F
models title and body fields very well. For combinations that include anchor text
and query click fields on the otherhand, we see very substantial gains over BM25F
of around 5 points NDCG@1 and 3 points NDCG@10. Note that even 0.5 points
NDCG gain is substantial, in particular for truncation level 1.
We would like to highlight that for both BM25F and two-layer LambdaBM25F
models, the gains achieved when new fields are added to the document description
are consistent. In Tables 7 and 9, the inclusion of the query click field in the
document description yields the highest accuracy. In addition, smaller gains are
achieved by adding the anchor text field to the document description. Ordering by
accuracy, the multiple-field combinations are in the same order for BM25F and
LambdaBM25F .

7 Conclusions and Future Work
We have extensively studied the contributions of various document fields to information retrieval accuracy. We find that query click information is the most effective
field, while the URL field is the least effective field. A document description containing all fields yields the best retrieval accuracy. We also study when BM25
outperforms a linear combination of input attributes. BM25 performs remarkably
well on single content fields, but on single popularity fields, BM25 achieves a retrieval accuracy comparable to a linear function of input attributes. For document
descriptions containing multiple fields, we verify that a nonlinear combination of
field attributes attains significantly better accuracy than a linear combination of
21

field attributes.
Our main contribution is a new information retrieval model trained using LambdaRank and the input attributes of BM25. LambdaBM25F significantly improves
retrieval effectiveness over BM25F for most single-field, in particular popularity
fields, and all multiple-field document descriptions. LambdaBM25F optimizes
directly for the chosen target IR evaluation measure and avoids the necessity of
parameter tuning, yielding a significantly faster approach. Our model is general
and can potentially act as a framework for modelling other retrieval functions.
There are several future directions of this work. First, we would like to perform more extensive studies to determine the importance of attributes in our model.
Since LambdaBM25 is a neural network, it is difficult to determine the actual relationship learned between attributes. However, by using a decision tree learner,
such as LambdaMART [23], we can decipher the trees to determine the ranked list
of important features in our model. Currently, our preliminary results using LambdaMART to learn a BM25-style function indicate that term frequency attributes are
significantly more important to the model than document frequency attributes. The
most important features are the term frequencies of the first two terms of the query
in the query click field and the title field. In addition, the field lengths of the body
field and the query click field are the most important field length attributes.
Second, we would like to determine the effectiveness of LambdaBM25 as a
scoring function, where the scores can be used as inputs to a more complex ranking
system. For example, LambdaBM25 could be used as a single feature in recent
TREC retrieval systems [6, 5].
Finally, we plan to expand our model to learn proximity relationships. Recent work on incorporating proximity information into BM25 has focused on bigram frequencies [15] or frequencies of terms in spans [19]. In both cases, it has
been unclear how to combine n-gram document frequency information with ngram term frequency information. In addition, a challenge has been how to extend
BM25 to account for relationships between a query term appearing as a unigram
or with another query term as a bigram. We plan to examine the effect of bigram
and trigram frequency attributes on our model and determine if incorporating such
features can learn a better function than, for example, the proximity BM25 model
given in [15, 19]. In the presence of proximity field attributes, we expect different
field combinations to yield the highest retrieval accuracy. LambdaBM25 has the
advantage of learning the relationship directly from the training collection and requires no tuning of the function. With our approach, we can learn the dependencies
between document and term frequency directly.

22

8 Acknowledgments
We would like to thank Susan Dumais for providing many useful discussions and
insights. We also thank Bhuvan Middha for invaluable discussions regarding the
parameter tuning of BM25 and BM25F.

References
[1] E. Agichtein, E. Brill, and S. Dumais. Improving web search ranking by
incorporating user behavior information. In ACM SIGIR Conference on
Research and Development in Information Retrieval (SIGIR), pages 19–26,
2006.
[2] S. Brin and L. Page. The anatomy of a large-scale hypertextual web search
engine. In Seventh International Conference on the World Wide Web (WWW),
pages 107–117, 1998.
[3] C.J.C. Burges, R. Ragno, and Q.V. Le. Learning to rank with nonsmooth cost
functions. In Advances in Neural Information Processing Systems (NIPS),
2006. See also MSR Technical Report MSR-TR-2006-60.
[4] C.J.C. Burges, T. Shaked, E. Renshaw, A. Lazier, M. Deeds, N. Hamilton,
and G. Hullender. Learning to rank using gradient descent. In International
Conference on Machine Learning (ICML), Bonn, Germany, 2005.
[5] N. Craswell and D. Hawking. Overview of the TREC 2004 web track. In
Proceedings of TREC 2004, 2004.
[6] N. Craswell, D. Hawking, R. Wilkinson, and M. Wu. Overview of the TREC
2003 web track. In Proceedings of TREC 2003, 2003.
[7] N. Craswell and M. Szummer. Random walk on the click graph. In ACM
SIGIR Conference on Research and Development in Information Retrieval
(SIGIR), 2007.
[8] P. Donmez, K. Svore, and C. Burges. On the local optimality of LambdaRank.
In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009.
[9] J. Gao, W. Yuan, X. Li, K. Deng, and J-Y. Nie. Smoothing clickthrough
data for web search ranking. In ACM SIGIR Conference on Research and
Development in Information Retrieval (SIGIR), 2009.
23

[10] B. He and I. Ounis. On setting the hyper-parameters of term frequency normalization for information retrieval. ACM Transactions on Information Systems (TOIS), 25(3):13, 2007.
[11] K. Jarvelin and J. Kekalainen. IR evaluation methods for retrieving highly relevant documents. In ACM SIGIR Conference on Research and Development
in Information Retrieval (SIGIR), pages 41–48, 2000.
[12] T. Joachims. Optimizing search engines using clickthrough data. In SIGKDD,
pages 133–142, 2002.
[13] D. Metzler. Generalized inverse document frequency. In ACM Conference on
Information Knowledge Management (CIKM), 2008.
[14] P. Ogilvie and J. Callan. Combining document representations for known
item search. In ACM SIGIR Conference on Research and Development in
Information Retrieval (SIGIR), 2003.
[15] Y. Rasolofo and J. Savoy. Term proximity scoring for keyword-based retrieval
systems. In Proceedings of the 25th European Conference on IR Research
(ECIR), 2003.
[16] S. Robertson and S. Walker. Some simple effective approximations to the
2-Poisson model for probabilistic weighted retrieval. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages
345–354, 1994.
[17] S. Robertson, H. Zaragoza, and M. Taylor. Simple BM25 extension to multiple weighted fields. In ACM Conference on Information Knowledge Management (CIKM), pages 42–49, 2004.
[18] A. Singhal, C. Buckley, and M. Mitra. Pivoted document length normalization. In ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), pages 21–29, 1996.
[19] R. Song, M. Taylor, J-R. Wen, H-W. Hon, and Y. Yu. Viewing term proximity from a different perspective. Advances in Information Retrieval, Lecture
Notes in Computer Science, 4956/2008:346–357, 2008.
[20] K. Sparck-Jones, S. Walker, and S. Robertson. A probabilistic model of information retrieval: development and comparative experiments. Information
Processing and Management, 36:809–840, 2000.

24

[21] M. Taylor, H. Zaragoza, N. Craswell, S. Robertson, and C. Burges. Optimisation methods for ranking functions with multiple parameters. In ACM
Conference on Information Knowledge Management (CIKM), 2006.
[22] R. Wilkinson. Effective retrieval of structured documents. In Research and
Development in Information Retrieval, pages 311–317, 1994.
[23] Q. Wu, C.J.C. Burges, K.M. Svore, and J. Gao. Ranking, boosting and model
adaptation. Microsoft Technical Report MSR-TR-2008-109, 2008.
[24] G. Xue, H-J. Zeng, Z. Chen, Y. Yu, W-Y. Ma, W. Xi, and W. Fan. Optimizing
web search using web click-through information. In ACM Conference on
Information Knowledge Management (CIKM), 2004.
[25] Y. Yue and C.J.C Burges. On using simultaneous perturbation stochastic approximation for IR measures, and the empirical optimality of LambdaRank.
NIPS Machine Learning for Web Search Workshop, 2007.

25