Fast, Piecewise Training for Discriminative Finite-state and Parsing Models
Charles Sutton and Andrew McCallum
Department of Computer Science
University of Massachusetts Amherst
Amherst, MA 01003 USA
{casutton,mccallum}@cs.umass.edu

Abstract
Discriminitive models for sequences and
trees—such as linear-chain conditional
random fields (CRFs) and max-margin
parsing—have shown great promise because they combine the ability to incorporate arbitrary input features and the benefits of principled global inference over
their structured outputs. However, since
parameter estimation in these models involves repeatedly performing this global
inference, training can be very slow. We
present piecewise training, a new training method that combines the speed of local training with the accuracy of global
training by incorporating a limited amount
of global information derived from previous errors of the model. On namedentity and part-of-speech data, we show
that our new method not only trains in
less than one-fifth the time of a CRF
and yields improved accuracy over the
MEMM, but surprisingly also provides
a statistically-significant gain in accuracy
over the CRF. Also, we present preliminary results showing a potential application to efficient training of discriminative
parsers.

1

Introduction

Conditionally-trained models have enjoyed popularity for a wide variety of tasks in NLP, including

document classification (Taskar et al., 2002; Nigam
et al., 1999), part-of-speech tagging (Ratnaparkhi,
1996; Toutanova et al., 2003), chunking (Sha and
Pereira, 2003), named-entity recognition (Florian et
al., 2004), and information extraction (McCallum et
al., 2000; Pinto et al., 2003). Their popularity stems
from the flexibility they afford in designing rich features to best suit particular tasks.
Of special recent interest has been discriminative
parsing (Taskar et al., 2004; Clark and Curran, 2004;
Collins and Roark, 2004). Discriminative parsing
models have been interesting, because the rich features sets they afford can allow higher better accuracy without explicitly needing the intricate smoothing required by state-of-the-art generative parsing
models (Charniak, 2000; Collins, 2000).
These benefits come at a cost, however. Although
conditional training for unstructured models is relatively efficient (it is simply the maximum-entropy
classifier), if the model predicts richly structured
outputs, such as sequences or parse trees, then discriminative training can become extremely expensive. This is because learning with structured outputs requires repeated inference over the training
set: to compute model expectations in CRF-style
training, or to find a high-probability configuration
in perceptron training.
Discriminative parsers, for instance, can be fantastically expensive to train, because they require repeatedly parsing the many sentences in the training
set. For example, Clark and Curran (2004) report
training their model on 45 machines in parallel. And
a max-margin parsing model can take over 3 months
to train on the entire Penn treebank (Ben Taskar, per-

sonal communication).
Even in simpler models, such as for sequences,
CRF training can be slow if there are many states.
Training a CRF part-of-speech tagger on the Sections 2-21 of the Penn treebank, for example, can
take weeks because of expense of running forwardbackward with 45 part-of-speech tags.
Conditional training methods like CRF and perceptron training can be expensive because they are
global, that is, they require repeatedly performing
inference over entire structured training examples.
An alternative is to define a model that is locally normalized, for example, in terms of distributions over
successor states or individual expansions of a nonterminal. Locally-normalized models correspond to
learning probabilistic classifiers for individual decisions in the model. Thus training can be much faster
because it does not require global inference. Unfortunately, globally-trained models usually perform
better. One reason for this is that when an earlier
classifier makes a mistake, a later classifier has local information that indicates against it. But even if
global inference is performed at test time, because
training is local, the later classifiers never learn to
vote against earlier mistakes.
These phenomena appear even in models as simple as a linear chain. A CRF is a globallynormalized sequence model, in which parameter
estimation involves repeatedly running forwardbackward over the training set, which can be expensive if the state space is large. Alternatively,
maximum-entropy markov models (MEMMs) are
locally-normalized conditional sequence models, in
which the parameter estimation for each next-state
classifier can be performed separately, and forwardbackward is not required. But despite their fast
training, MEMMs are known to suffer from several pathologies, such as label bias (Lafferty et al.,
2001) and observation bias (Klein and Manning,
2002). For these reasons, across a broad range
of NLP tasks, MEMMs consistently perform worse
than CRFs.
We propose a new piecewise training procedure
for locally-normalized models that allows limited interaction between the local classifiers. This combines the efficiency of MEMM-style training with
the accuracy of CRF-style training. The basic idea
is to train the the next-state classifier to recognize

and vote against errors in the previous state. We
do this by augmenting the local training sets of an
MEMM with noisy instances where the source state
is actually in error; for these new instances, instead
of predicting a next-state, the classifier is trained
to predict a new “none-of-the-above” (N OTA) label,
thereby learning to vote against the previous incorrect decision. In two different NLP tasks, we show
that piecewise training not only is faster than CRF
training but also, amazingly, has higher accuracy.
We hypothesize that the CRF’s capacity to trade off
weights across sequence positions leads to greater
capacity to overfit.
Although we explore piecewise training in the
case of linear models, it can be applied to general
probability distributions, as long as they can be divided into locally-normalized pieces. To demonstrate this, we also present early, preliminary results
applying piecewise training to a discriminative parsing model. With piecewise training, we show a small
improvement over an MEMM-style discriminative
parsing model.1 Unlike other discriminative parsing
approaches, our piecewise procedure requires parsing the training set only once.

2

Conditional Linear-Chain Models

In this section, we review two standard models for
sequence labeling, the maximum-entropy Markov
model (MEMM) (McCallum et al., 2000) and the
conditional random fields (CRF) (Lafferty et al.,
2001). Both CRFs and MEMMs are methods for
conditionally training weighted finite state transducers, so that given a sentence x, both assign
a probability p(y|x) to a sequence of labels y.
Both models are defined in terms of a set of features {fk (yt , yt−1 , x, t)} on state transitions, each of
which has an associated real-valued weight λk .
MEMMs learn a next-state classifier for each state
in the FSM. In an MEMM, the next-state classifier
trained by maximum entropy, so that the model is
given by:
P
exp ( k λk fk (yt−1 , yt , xt ))
p(yt−1 |yt , x) =
(1)
Z(xt , yt )
1
In fact, the MEMM-style parsing model, which we term an
MECFG, seems unexplored in the literature, and may be interesting in its own right.

where each next-state distribution is locally normalized by
!
X
X
0
Z(xt , yt ) =
exp
λk fk (y , yt , xt )
(2)
y0

k

Several pathologies have been observed with
MEMMs. One such pathology is label bias (Bottou,
1991; Lafferty et al., 2001), in which the MEMM
tends to favor states with low-entropy next-state distributions. In the extreme, states with only one successor ignore their observations completely.
Now, if the Markov assumptions of the MEMM
were accurate, ignoring the observation would be
correct; label bias arises because a later observation
influences an earlier transition. In fact, in carefully
designed experiments with synthetic data it has been
reported (Lafferty et al., 2001) that CRFs are more
robust to violations of their independence assumptions than MEMMs.
Also, Klein and Manning (Klein and Manning,
2002) report the phenomenon of observation bias,
in which in some data sets, instead of observations
being incorrectly ignored, state transitions are not
given enough weight, even when they are highly predictive. Now, this phenomenon has been reported in
part-of-speech tagging, where the observations are
known to provide much more information than the
state transitions, so it is not clear to what extent this
phenomenon occurs across tasks.
Because of these pathologies, MEMMs have been
shown to perform worse than CRFs across NLP
data sets, including noun-phrase chunking (Sha and
Pereira, 2003), part-of-speech tagging, and namedentity recognition (Section 4).
The CRF remedies these limitations of the
MEMM. A linear-chain CRF defines a global distribution over label sequences y as
Q
P
t exp ( k λk fk (yt−1 , yt , xt ))
p(y|x) =
, (3)
Z(x)
where Z(x) is a normalization constant given by
!
XY
X
Z(x) =
exp
λk fk (yt−1 , yt , xt ) .
y

t

k

(4)
The only difference between the CRF model and
the MEMM model, then, is that a CRF is globally

normalized over all possible label sequences, while
in an MEMM, each transition distribution is locally
normalized.
Conditional random fields are usually trained by
maximum likelihood. The partial derivative of the
likelihood with respect to a weight λk is
X
∂L
=
fk (yt , yt−1 , xt )
∂λk
t
XX
−
fk (y, y 0 , x)p(y, y 0 |x) (5)
t

y,y 0

Computing the gradient thus requires computing p(yt , yt−1 |x), which can be computed by the
forward-backward algorithm. This explains the
difference in training time between CRFs and
MEMMs. Computing the MEMM gradient requires
local computations that are linear in the number
of FSM states. The CRF gradient, on the other
hand, requires running forward-backward, which is
quadratic in the number of states. And forwardbackward must be called once for each training instance at each iteration of a numerical optimization procedure. This means that CRF training as a
whole can require hundreds of thousands of calls to
forward-backward for large NLP data sets.

3

Piecewise Training with Limited
Interactions

In this section, we introduce piecewise training, a
method for local training that avoids the pathologies of MEMMs. For concreteness, we present our
method for the case of linear chains, although it is
more general.
In MEMMs, training by conditioning only on the
true values of the previous state can be problematic.
When global inference at test time estimates high
probability for an incorrect state at the previous time
step, the next-state classifiers are evaluated on inputs
they may never have seen at training time, resulting
in unpredictable scores.
We propose avoiding this problem by augmenting the objective function of the MEMM. For each
local classifier we introduce an additional value
termed “none of the above,” or N OTA (e.g., an additional, imaginary part-of-speech label). Traditional
MEMM training would create local training sets for

each source state, assigning training data to a particular next-state classifier using the true labels in the
training data. By contrast, we also assign training
data to a source state even when the true source state
from the original training sequence does not match;
the correct predicted value in this case is N OTA.
Inference at test time is performed with standard
MEMM local inference, with the N OTA state and
all parameters associated with N OTA outcomes removed. Accordingly, training is performed such that
all parameters associated with N OTAvalues are constrained to be zero. Thus the only way for N OTA
to be correctly predicted is by reducing the strength
of the parameters associated with other outcomes,
given the current observations. Once N OTA is removed, the next-state distribution is nearly uniform,
which in an MEMM is the next-state distribution
that most equally “votes against” all possible outcomes.
Formally, the objective function used in a
piecewise-trained linear-chain model with parameters Λ and training data D = {hx(i) , y(i) i} is

OP T (Λ, D) = log

|D|
Y
Y

(i)

(i)

(i)

p̃Λ (yt |yt−1 , xt )

t

i

(i)

Y

p̃Λ (N OTA|y, xt ). (6)

(i)

y6=yt−1

An interesting insight comes from expanding this
to show the normalization function and the product
of potential functions. If we define a potential function as
!
X
φ(yt , yt−1 , xt ) = exp
λk fk ((yt , yt−1 , xt )
k

(7)
then we see
|D|
Y
Y

(i)

(i)

(i)

φ(yt , yt−1 , xt )
OP T (Λ, D) = log
P
(i)
(i)
t 1+
i
y φ(y, yt−1 , xt )
Y
1
P
0 (i)
(i) 1 +
y φ(y, y , xt )
0

where the sum over y does not include N OTA (since
they are captured with the included 1’s). This corresponds to approximating the normalization function
Z(Λ, x) with
!
YY
X
(i)
ZP T (Λ, x(i) ) =
1+
φ(y, y 0 , xt ) .
t

y0

y

Tom Minka (personal communication) has
pointed out that this seems to be a novel approximation to the partition function, and that a somewhat
similar approximation is produced in the very
beginning of belief propagation, when all messages
are equal to 1, and the partition function is estimated
by
YXX
(i)
φ(y, y 0 , xt ).
ZBP (Λ, x(i) ) ∝
t

y0

y

The training data for the N OTA outcome, y 6=
(i)
yt−1 , may be exhaustive, or randomly sampled, or

chosen to include only those cases in which incorrectly had high marginal probability by joint inference with a previous parameter setting. A method
similar to this last option has been previously used
to successfully incorporate not all but some of the
most important “unsupported features” in linearchain CRFs (McCallum, personal communication).
Furthermore, we can calibrate the magnitude of
the parameters Λs across each subset s, by learning
a per-subset multiplicative factor, αs Λs . Although
this factor is learned via traditional global inference,
its impact on training time is limited because it has
such low dimensionality that optimization typically
requires only a few gradient steps.
Essentially N OTA outcomes allow limited communication between locally-normalized subsets, by
allowing them to assign uninformative distributions
to incorrect variable assignments of conditioned
variables.

4

Experiments on Sequential Tasks

Although piecewise training was motivated by the
need to train large structure models such as parse
trees, we show here that piecewise training can be
y 6=yt−1
beneficial even in graphical models as simple as a
|D|
(i)
(i)
(i)
linear-chain.
YY
φ(yt , yt−1 , xt )


= log
,
We present results on two tasks: part-of-speech
Q
P
0 , x(i) )
1
+
φ(y,
y
0
t
i
t
tagging
and named-entity recognition. First, for
y
y

MEMM
CRF
CRF-PT

Training
F1
99.89%
99.95%
99.82%

Named entity
Testing
Training
F1
Time
88.90
1 hr
89.87
9 hr
90.47 5 hr, 35 min

Training
Accuracy
99.1%
99.8%
99.08%

POS tagging
Testing
Training
Accuracy
Time
88.1%
2 hr, 8 min
88.1%
14 hr
88.8%
2 hr, 30 min

Table 1: Results on named-entity recognition and part-of-speech tagging. CRFs trained in pieces (CRF-PT)
statistically-significantly outperform both regular MEMMs and CRFs. A small subset of the treebank was
used for the POS results, which explains the low baseline performance.
named-entity recognition, we use the CoNLL 2003
English data set, consisting of 14,987 newswire sentences annotated with names of people, organizations, locations, and miscellaneous entities. We test
on the standard development set of 3,466 sentences.
Evaluation is done using precision and recall on the
extracted chunks, and we report F1 = 2P R/P + R.
Results are shown in Table 1. We compare a CRF,
an MEMM, and a piecewise-trained CRF (CRF-PT)
with exhaustively-added N OTA instances. Consistent with previous work, the CRF performs better
than the MEMM. But with the addition of N OTA instances, the CRF-PT performs better than the standard MEMM and, amazingly, also better than the
CRF. It appears that CRF-PT is overfitting less than
the CRF, since CRF-PT has lower training accuracy
despite its higher testing accuracy. All of the pairwise differences in table 1 are significant by McNemar’s test on the per-sentence labeling disagreements (p < 0.001).
Second, in previous work (Lafferty et al., 2001),
CRFs were shown to outperform MEMMs on partof-speech tagging. Here we test whether training in
pieces addresses the previously-observed problems
with local normalization on a POS data set. For
these preliminary experiments, we used a small subset comprising 1,154 sentences, randomly sampled
from sections 0–18 of the Penn Treebank WSJ corpus. We evaluated on all 5,527 sentences of of sections 20 and 21. The Treebank tag set contains 45
tags.
In this experiment, we achieved better performance by including only a few N OTA interactions.
In particular, after twenty iterations of training, we
added a N OTA term of the form p(N OTA|yt ) for all
incorrect yt in the training set that the model as-

signed probability greater than 0.2.
On this small training set, the MEMM and the
CRF had identical performance. The training set is
so small that the CRF’s greater capacity to overfit
negates its advantage in avoiding label bias. When
trained on larger subsets of the treebank, the CRF
performs better than the the MEMM, consistent with
previous results. Still, the piecewise-trained CRFPT achieves significantly better performance than
both the CRF and the MEMM. This difference is significant by a paired t-test on the number of incorrect
tags per sentence (p < 0.001).
In both data sets, we found that for piecewise
training using local MEMM-style normalization at
test time performed better than CRF-style global
testing. This is not surprising, because one might
expect that the weights from each of the separatelytrained pieces would have different scale. For the
NER results that we report, we globally train a perstate scaling factor, as mentioned in the previous
section. For the POS results in Table 1, however,
we used locally-normalized MEMM testing.

5

Discriminative Parsing

The real promise of piecewise training techniques
is for richly-structured models that are simply too
complex to train globally. In this section, we present
early results which suggest that piecewise training is
also a promising approach for discriminative parsing, which can take weeks or months to train globally.
Our aim in these early experiments is not to beat
generative models (by performing extensive feature
engineering), but to show that piecewise training improves over a plain locally-normalized MEMM analogue to parsing. We indeed find a small improve-

G ENERATIVE
M ECFG
P T-C FG

Number N OTA
Instances
N/A
0
1 564 827

Test P
(len<= 15)
81.7
81.4
81.7

Test R
(len<= 15)
83.1
82.7
83.2

Table 2: Early results from a piecewise-trained discriminative parsing model. As more N OTA instances are
added, the performance of the model improves, evntually equalling the generative baseline.
ment, suggestive of good future work in this area.
Although previous locally-normalized parsers
(Ratnaparkhi, 1999) have normalized over shiftreduce decisions, a locally-normalized model can be
defined from a CFG more directly. Just as a MEMM
locally normalizes over all next-states from a source
state, in a CFG one can normalize over all expansions of a given chart edge. This yields a conditional model from a PCFG exactly as an MEMM
does from an HMM.
More formally, let T be a parse tree for a sentence
x. We assume that we have a CFG G in Chomsky
normal form, and set of features fk (A, BC, x, i, j),
where A → BC is a rule in the CFG, and the indices
i and j are the boundaries in x of the subtree headed
by A. This means essentially that all features must
computable given a single traversal in a chart.
To define the probability of a parse tree in this
model, we first define the probability of a single expansion A → BC occuring when A spans the sequence xi . . . xj :

PCFG training, it still requires no parsing at training time.
We report early results from training these models
on the Penn Wall St. Journal Treebank. In the results
here, we restrict the training and test sets sentences
of length ≤ 15 words. As usual, we report labeled
precision and recall. We train on sections 2-21 (9753
sentences), and use section 22 as our test set (421
sentences).
We use the CFG structure from the unlexicalized
PCFG of Klein and Manning (2003).2 However, in
these results, our tagging model is not very sophisticated, and our handling of unknown words is very
simple. This explains why our generative baseline
has an F1 of 82.4, much lower than the 88 F1 that
one can get with this PCFG structure on this subset.
Our features are lexical features of the span
boundaries, similar to Taskar et al. (Taskar et al.,
2004). Specifically, we used the first and last words
of the span, conjoined with the span length if it is
less than 3.
To apply piecewise training to this model, recall
φ(A, BC, x, i, j)
that a N OTA instance corresponds to an incorrect
p(BC|A, xi . . . xj ) =
decision that could potentially occur during global
Z(A, x, i, j))
X
inference at test time. While in an MEMM, this corZ(A, x, i, j) =
φ(A, BC, x, i, j)
responds to an incorrect source state, in a CFG this
B 0 C 0 :A0 →B 0 C 0
! corresponds to an incorrect chart edge (X, i, j). A
X
φ(A, BC, x, i, j) = exp
λk fk (A, BC, x, i, j) , chart edge could be incorrect for two reasons: the
nonterminal X can be incorrect, and the span (i, j)
k
might not correspond to a true bracketing. We selecand then the probability of a tree T is simply the tively generate N OTA instances for both kinds of erproduct of all expansions that occur in it. We are not rors, based on mistakes made by a partially-trained
aware of this model in the literature, so we call it a model. Specifically, after 10 iterations of MECFG
maximum-entropy context-free grammar (MECFG). training, we parse the training set and add N OTA
Maximum-likelihood training for MECFGs can
2
In fact, our implementation uses transformed trees
be accomplished by numerical optimization, as is
printed directly from Dan Klein’s code, which is availstandard for maximum-entropy models. Thus, al- able at http://www-nlp.stanford.edu/software/
though MECFG training is more expensive than lex-parser.shtml.

instances that correspond to incorrect chart edges
whose scores are sufficiently high. For each incorrect chart edge, we compare its score to the bestscoring chart edge of the same length; if the log ratio
of scores is greater than a threshold δ (here, we use
δ = 5), then the incorrect chart edge is included as a
N OTA instance.
Results from comparing MECFG training, piecewise training, and a generative baseline are given
in Table 2. We expected the MECFG to perform
poorly, perhaps suffering even more severely from
the same pathologies that affect MEMMs. To our
surprise, this was not the case; the MECFG comes
within 0.4 F1 of the generative baseline. Even so,
piecewise training still provides a small, suggestive
improvement, essentially equalling the generative
baseline in performance. More complex discriminative features, such as scores from a generative
model (Collins and Roark, 2004) and richer lexical
features, can help performance further.
We have shown that with fairly basic lexical features, piecewise training can equal the performance
of a generative baseline without the vast training
time required by a global discriminative model. To
get an idea of the cost of training a global model
on this subset, our parser takes about 40 minutes
to parse the entire training set. Suppose that global
training takes 100 iterations to converge (CRF-style
training might use many more iterations than this;
perceptron training may use less.) This yields a
back-of-the-envelope estimate of 3 days to train. By
contrast, piecewise training took only 4.6 hr. In addition, we do not necessarily need to parse the entire
training set to generate NOTA instances; parsing a
sample of the training set may work well for large
data sets.

6

Related Work

There are several examples in the literature of undirected models trained in locally-normalized pieces.
Pseudolikelihood (Besag, 1975) is a well-known
method for training a globally-normalized model using local distributions. In pseudolikelihood, parameters are trained to maximize the likelihood of each
predicted variable, conditioned on the true values of
the neighboring variables. The MEMM training objective is actually very similar to the pseudolikeli-

hood objective, except that in the MEMM objective,
the local term for each node is conditioned only on
the previous node, not on both neighbors as in pseudolikeliood. It would be interesting to see whether
the N OTAtechnique can be used to improve the performance of pseudolikelihood training as well. The
MEMM objective has also been used by others (Punyakanok and Roth, 2001; Klein et al., 2003).
Pseudolikelihood has had some success in applications. For example, Toutanova et al. (2003)
achieve state-of-the-art performance on part-ofspeech tagging using a cyclic dependency network
trained using pseudolikelihood. Also, pseudolikelihood has been used for grid-shaped CRFs in
computer vision (Kumar and Hebert, 2003).
Roth (2002) has advocated training disjoint classifiers, and then performing joint inference at test time
in an approach he terms “training with classifiers.”
Kakade, Teh, and Roweis (2002) show that label bias in MEMMs can be somewhat ameliorated
by training on the marginal probability of single labels. With this training objective, MEMMs actually
perform better on token accuracy than CRFs on an
extraction data set. To compute the marginal likelihood, however, requires forward-backward, and
therefore is just as computationally intensive as
global CRF training.

7

Conclusion

We present a new method for efficient piecewise
of large, richly structure probabilistic models. Using so-called “none-of-the-above” instances we allow some global interactions across the model without requiring full inference at training time. On two
sequence-labeling NLP tasks, we show that N OTA
training preserves the efficiency of MEMM training, while surprisingly achieving significantly better
accuracy than a CRF. We also present early results
showing that piecewise training is a promising approach for discriminative parsing. We introduce a
maximum-entropy context free grammar, a locallynormalized parsing model analogous to MEMM,
which we believe to be novel. On a subset of the
Penn treebank, we show that MECFGs perform surprisingly well, only slightly below an (admittedly
low) generative baseline. We present early results
suggesting that piecewise training can potentially

improve performance over a MECFG.
In future work, we will improve the parsing results, by using better tagging models in the baseline,
and by making better use of discriminative features.
Also, we are interested in broader applications of
piecewise training to large structured models, such
as arise in information extraction and data mining.

Acknowledgments
This work was supported in part by the Center for Intelligent Information Retrieval, in part by The Central Intelligence Agency,
the National Security Agency and National Science Foundation under NSF grant numbers IIS-0326249 and IIS-0427594.
Any opinions, findings and conclusions or recommendations
expressed in this material are the author(s) and do not necessarily reflect those of the sponsor.

References
Julian Besag. 1975. Statistical analysis of non-lattice data. The
Statistician, 24(3):179–195.
Léon Bottou.
1991.
Une Approche théorique de
l’Apprentissage Connexionniste: Applications à la Reconnaissance de la Parole. Ph.D. thesis, Université de Paris XI,
Orsay, France.
Eugene Charniak. 2000. A maximum-entropy-inspired parser.
In Proceedings of the first conference on North American
chapter of the Association for Computational Linguistics,
pages 132–139. Morgan Kaufmann Publishers Inc.
Stephen Clark and James R. Curran. 2004. Parsing the wsj using ccg and log-linear models. In Proceedings of the 42nd
Meeting of the Association for Computational Linguistics
(ACL’04), Main Volume, pages 103–110, Barcelona, Spain,
July.
Michael Collins and Brian Roark. 2004. Incremental parsing
with the perceptron algorithm. In Proceedings of the 42nd
Meeting of the Association for Computational Linguistics
(ACL’04), Main Volume, pages 111–118, Barcelona, Spain,
July.
Michael Collins. 2000. Discriminative reranking for natural language parsing. In Proc. 17th International Conf. on
Machine Learning, pages 175–182. Morgan Kaufmann, San
Francisco, CA.
Radu Florian, H. Hassan, A. Ittycheriah, Hongyan Jing, Nanda
Kambhatla, X. Luo, Nicolas Nicolov, Salim Roukos, and
Tong Zhang. 2004. A statistical model for multilingual entity detection and tracking. In In HLT/NAACL 2004.
Sham Kakade, Yee Whye Teh, and Sam Roweis. 2002. An alternative objective function for markovian fields. In In Proceedings of the Nineteenth International Conference on Machine Learning.
Dan Klein and Christopher D. Manning. 2002. Conditional
structure versus conditional estimation in nlp models. In
EMNLP.

Dan Klein and Christopher D. Manning. 2003. Accurate unlexicalized parsing. In Proceedings of the 41st Annual Meeting
of the Association for Computational Linguistics.
Dan Klein, Joseph Smarr, Huy Nguyen, and Christopher D.
Manning. 2003. Named entity recognition with characterlevel models. In Proceedings the Seventh Conference on
Natural Language Learning, pages 180–183.
Sanjiv Kumar and Martial Hebert. 2003. Discriminative fields
for modeling spatial dependencies in natural images. In Sebastian Thrun, Lawrence Saul, and Bernhard Schölkopf, editors, Advances in Neural Information Processing Systems
16. MIT Press, Cambridge, MA.
J. Lafferty, A. McCallum, and F. Pereira. 2001. Conditional
random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th International Conf. on Machine Learning.
Andrew McCallum, Dayne Freitag, and Fernando Pereira.
2000. Maximum entropy Markov models for information
extraction and segmentation. In Proc. 17th International
Conf. on Machine Learning, pages 591–598. Morgan Kaufmann, San Francisco, CA.
Kamal Nigam, John Lafferty, and Andrew McCallum. 1999.
Using maximum entropy for text classification. In IJCAI’99
Workshop on Information Filtering.
David Pinto, Andrew McCallum, Xing Wei, and W. Bruce
Croft. 2003. Table extraction using conditional random
fields. In Proceedings of the ACM SIGIR.
V. Punyakanok and D. Roth. 2001. The use of classifiers in
sequential inference. In NIPS 13.
Adwait Ratnaparkhi. 1996. A maximum entropy model
for part-of-speech tagging. In Proc. of the 1996 Conference on Empirical Methods in Natural Language Proceeding
(EMNLP 1996).
Adwait Ratnaparkhi. 1999. Learning to parse natural language
with maximum entropy models. Machine Learning, 34:151–
175.
Dan Roth. 2002. Reasoning with classifiers. In Proc. of European Conference on Machine Learning.
Fei Sha and Fernando Pereira. 2003. Shallow parsing with
conditional random fields. In Proceedings of HLT-NAACL
2003. Association for Computational Linguistics.
Ben Taskar, Pieter Abbeel, and Daphne Koller. 2002. Discriminative probabilistic models for relational data. In Eighteenth
Conference on Uncertainty in Artificial Intelligence (UAI02).
Ben Taskar, Dan Klein, Michael Collins, Daphne Koller, and
Chris Manning. 2004. Max-margin parsing. In Empirical
Methods in Natural Language Processing (EMNLP04).
Kristina Toutanova, Dan Klein, Christopher D. Manning, and
Yoram Singer. 2003. Feature-rich part-of-speech tagging
with a cyclic dependency network. In HLT-NAACL 2003.