Center For Intelligent Information Retrieval Technical Report IR-383
Presented at NIPS’04 workshop on Learning with Structured Outputs

Piecewise Training with
Parameter Independence Diagrams:
Comparing Globally- and Locally-trained
Linear-chain CRFs
Andrew McCallum and Charles Sutton
Department of Computer Science
University of Massachusetts Amherst
Amherst, MA 01003 USA
{mccallum,casutton}@cs.umass.edu

Abstract
We present a diagrammatic formalism and practial methods for introducing additional independence assumptions into parameter estimation, enabling efficient training of undirected graphical models in
locally-normalized pieces. On two real-world data sets we demonstrate
our locally-trained linear-chain CRFs outperforming traditional CRFs—
training in less than one-fifth the time, and providing a statisticallysignificant gain in accuracy.

1

Introduction

Graphical models have brought about a revolution in probabilistic modeling because they
provide a formal framework for representing independence assumptions among random
variables. Even when not quite true, these assumptions can be tremendously beneficial
by enabling models to generalize and scale to large data. With sufficient independence
assumptions, calculating the probability of a configuration becomes a product of simple
factors taken from low-dimensionality tables, and in low tree-width graphs, inference can
be performed efficiently via dynamic programming.
However, even when independence assumptions result in a low-treewidth graph, parameter
estimation is often difficult and time-consuming. In models with hidden variables, and
in all undirected models, parameter estimation involves (repeatedly) performing inference
across unobserved model variables in order to obtain marginals necessary to calculating the
gradient of the likelihood. Finding the optimal parameter setting involves balancing subtle
trade-offs in parameter values spread throughout the model, and can require many rounds
of gradient steps and inference.
In many practical situations, even extremely simple graphical models can have severe
training costs. For example, linear-chain conditional random fields (CRFs) are undirected
graphical models in which the predicted variables are connected in a trivial linear chain.
Inference comprises straightforward dynamic programming via forward-backward. However, real-world applications of this model use millions of parameters trained on hundreds
of thousands of words, and parameter estimation can then require half a day or more (Mc-

Callum, 2003). In factorial CRFs (Sutton et al., 2004) (shallow grids), training can literally
take days.
This paper presents methods by which additional independence assumptions among parameters can improve both training speed and regularization. As with traditional independence
assumptions among non-parameter random variables in graphical models, these assumptions may be merely approximations, but they can be drammatically advantageous. We
introduce a formalism and diagrammatic representation we term parameter independence
diagrams that provide a general language for expressing independence assumptions among
parameters that are represented as factors in a factor graph. We also introduce a method for
efficiently capturing limited interactions among sets of parameters.
Rather than using global normalization in undirected models for inference at training time,
the factors and variables in the model’s factor graph are apportioned into subsets we call
“pieces,” and normalization is performed locally among the variables in each subset. The
parameters in each subset’s factors are trained independently from each other1 and thus we
use the term “piecewise training.”
Because normalization is performed locally, over smaller sets of the variables, inference can
be significantly faster. Local normalization also limits the interactions between parameters
in the gradient of the objective function, which additionally speeds learning by providing
a simpler likelihood surface to climb. Furthermore, we have evidence that this reduced
interaction among parameters results in greater regularization, and lower test set error.
There has recently been growing interest in using undirected graphical models to perform
joint inference over a large number of interdependent predicted variables—recently coined
“collective classification.” In some cases exact inference is feasible, but slow (Lafferty
et al., 2001; McCallum, 2003; Sutton et al., 2004). In other cases approximate inference
methods are used, such as sum-product (loopy belief propagation and related alternatives)
(Taskar et al., 2002; Sutton et al., 2004; Sutton & McCallum, 2004), or Monte-Carlo sampling (Li & McCallum, 2004; McCallum & Wellner, 2003; Wellner et al., 2004). Since
inference must be performed repeatedly during parameter estimation (once for each gradient calculation), sometimes a more efficient, dramatic approximation is used at training
time than at test time. Pseudo-likelihood training, followed by Monte-Carlo inference at
test time is just such an example (Kumar & Hebert, 2003; McCallum & Wellner, 2003;
Wellner et al., 2004). This paper describes generalizations of training methods for this
scheme.
As the size and complexity of these graphical models grow—for example, capturing not
only interdependent labels on many objects (such as collective document classification), but
also joint inference among many entire subsystems (such as part-of-speech-tagging, phrase
segmentation, named entity recogition, coreference, relation identification, role discovery
and group detection), we will require new divide-and-conquer training methods for these
models. Parameter independence diagrams and piecewise training provide a framework
and methods for a divide and conquer approach to training undirected graphical models.

2

Parameter Independence Diagrams

An undirected graphical model (also known as a Markov random field, or Markov network) expresses independence assumptions among its random variables with a limited
number of undirected edges between pairs of variables. Without loss of generality, consider a conditional random field in which a subset of the variables, the “target” variables
Y = {y1 , y2 , ...} are predicted conditioned a given set of values for the remaining “observed” variables X = {x1 , x2 , ...}. The joint distribution over the target varibles conditioned on the observed variables can be expressed as a product over non-negative-valued
1

Or, nearly independently, as explained below.

potential functions Φ(·) of the cliques c ∈ C in this graph (with normalization constant
ZX ) (Hammersley & Clifford, 1971), where yc indicates the subset of variables in y that
are members of clique c, and xc is defined analogously:
1 Y
p(y|x) =
Φc (yc , xc ).
Zx
c∈C

A factor graph is a bipartite graph (with variables nodes in one partition and factor nodes
in the other) specifying further independence assumptions by indicating how a per-clique
potential function above can be comprised of multiplicative factors, indexed by f , each
calculated by a potential function φf (·). For example, three variables may be completely
connected, forming a clique in the original graphical model, however, the parameterization
of the potential function for this clique may consist only of functions of pairs; thus in
this case Φ(x1 , x2 , x3 ) = φ(x1 , x2 )φ(x2 , x3 )φ(x1 , x3 ). The joint distribution over target
variables conditioned on observed variables is then a product of factor potential functions
φf (·) of the factors f ∈ F, where yf indicates the subset of variables in y that have edges
to factor f , and xf is defined analogously:
1 Y
φf (yf , xf ).
p(y|x) =
Zx
f ∈F

It is common to parameterize factors by a log-linear function of a set of feature functions
gk each multiplied by a learned weight parameter λk , so that
!
X
φf (yf , xf ) = exp
λk gk (yf , xf ) .
k

The objective function for training a conditional random field given a training set D of
independent pairs hxi , yi i is the log-probability all the y’s given the x’s, given the set of
all parameters Λ,
|D|
X
O(Λ, D) =
log pΛ (y(i) |x(i) ).
i=1

Gradient of the objective function with respect to a single parameter λk is


|D|
X
X
∂O(Λ, D) X X
(i)
(i)
(i)
=
gf (yf , xf ) −
pΛ (y|x(i) )
gf (yf , xf ) .
∂λk
y
i=1
f

f

Parameters in factors not separated by observed variables x are coupled in the gradient,
through the partition function Zx (Λ) that appears in the expansion of pΛ (y|x). These coupled parameters are not estimated independently from each other. Much of the expressive
power of these models is comes from the subtle trade-offs of these parameter values against
each other.
A parameter independence diagram expresses yet further independence statements that allow the gradient to factor. Just as factor graph may make independence statements that are
not actually true in the data being modeled, a parameter independent diagram may express
independence statements about parameters that are not true in the original factor graph.
A parameter independence diagram is a hypergraph consisting of a factor graph, plus a
set of hyperedges, Π = {π, ...}, connecting subsets of variables and factors.2 Such a
2

Or, equivalently a tripartite graph, consisting of a (bipartite) factor graph, plus a third partition
consisting of “piece” nodes, each connected to the variables and factors that are members of that
piece.

MEMM
1

Bi-directional CMM
1

1

1

...

1

1

1

1

1

1

...

Linear-chain CRF

Pseudo-likelihood
1

1

1

1

...

1

1

...

Figure 1: Four examples of parameter independence diagrams, all for sequence data. Observed variables are drawn as black circles, and predicted target variables as gray circles.
The predicted variables are connected in a linear chain, indicating a first-order Markov
model among state in a finite state machine; the left-most state is colored black to indicate
the pre-determined “start state.” Factors appear as rectangles, the shared digit inside them
indicating that their parameters are tied across positions in the sequence. Each “piece” is
indicated by circling (drawn here with a dashed stroke) its variable-node and factor-node
members.
subset is termed a piece, and written π. The objective function specified by the parameter
independence diagram is defined to be the sum of objective functions for each piece,
X
OΠ (Λ, D) =
Oπ (Λ, D);
π∈Π

and the objective function for a piece is the joint distribution of the variables in the piece
conditioned on those variables outside the piece that are connected to factors within the
piece. We write yN (π) and xN (π) for these target and observable variables neighboring
factors within the piece, and the objective function for piece π is
Oπ (Λ, D) =

|D|
X

(i)

(i)

log p̃Λ (yπ(i) |yN (π) , xN (π) ).

i=1

This local joint distribution over yπ is written p̃ (rather than p), indicating that this is not
the true marginal, but a locally-normalized function of the product of factors f within piece
π; thus, (writing this set of factors, f ∈ π),
1 Y
φf (yf , xf ; Λf ).
p̃Λ (yπ |yN (π) , x) =
Zπ,x
f ∈π

Note that some variables outside the piece may share a factor with a variable inside the
piece, but unless the shared factor itself is inside the piece, those outside variables are not
conditioned on, and play no role in the piece’s objective function. Factors and variables
can appear in more than one piece. We may specify that the parameters of some factors are
locked; in this way they would have zero gradient for a piece’s objective function, but may
play a role in inference. Traditional undirected graphical models correspond to a parameter
independence diagram in which all variables and factors appear in a single piece. When
the training data is fully observed, factors never in a common piece are conditionally independent given the training data, and may have their parameters estimated independently (or
partially independently as we shall see in the next section). When the training data leaves
some variables hidden, factors connected through unobserved variables remain dependent.

Several pictorial conventions are possible for pieces, including drawing a new node for
each piece, with edges connecting its participating variables and factors. We have found
it clearer to circle the variables and factors in each piece. Since pieces typically include
locally-connected nodes in the factor graph, this representation has usually been quite natural and clear.
Figure 1 shows the diagrammatic representation of parameter independence diagrams for
several common linear-chain graphical models representing finite state machines (FSMs)
used to model sequence data. Pieces are indicated by cicled regions (in a dashed stroke
here for emphasis). We show parameter independence diagrams of four objective functions
for linear-chain models: MEMMs, pseudolikelihood, bi-directional CMMs, and the linearchain CRF.
Maximum Entropy Markov Models (MEMMs) (McCallum et al., 2000) learn a “next-state
classifier” for each source state in the FSM. Its objective function for parameter estimation
P P
(i) (i)
(i)
is log i t p̃(yt |yt−1 , xt ), where t indicates a position in the sequence (and also a
piece).
The objective for pseudo-likelihood (Besag, 1974) involves predicting each target variable
conditioned on all its neighbors,
OP L (Λ, D) =

|D|
X
X
i

(i)

(i)

(i)

(i)

log p̃Λ (yt |yt−1 , yt+1 , xt ).

t

Although not strictly a probabilistic model, bi-directional conditionally-trained Markov
model (bi-directional CMM) have been trained using support vector machines to learn separate functions for both p̃(yt |yt−1 , x) and p̃(yt |yt+1 , x) (Kudo & Matsumoto, 2001). The
two are then combined using a modified forward-backward procedure.
The objective function for the linear-chain CRF (Lafferty et al., 2001) is simply the joint
probability p(y|x) of all target variables.

3

Piecewise Training with Limited Interactions

In pseudo-likelihood and MEMMs, training by conditioning only on the true values of the
neighbors can be problematic. When global inference at test time estimates high probability
for incorrect assignments to these neighbors, potential functions are evaluated on inputs
they may never have seen at training time, resulting in unpredictable potential scores.
Given a set of labels (e.g., 45 part-of-speech tags), a training set
(i)
(i)
(i)
{hx(1) , y(1) i, ...hx(N ) , y(N ) i}, where x(i) is an input sequence hx1 , x2 , ...xT i,
(i) (i)
(i)
and y(i) is its corresponding output label sequence hy1 , y2 , ...yT i, the parameters, Λ,
of an MEMM are trained to maximize the conditional likelihood,
OM EM M (Λ, D) =

|D|
X
X
i

(i)

(i)

(i)

log p̃Λ (yt |yt−1 , xt ).

t

Here the locally-normalized subsets are individual “next states,” yt , conditioned on “source
states” yt−1 , and there is a separate potential function for each source state. Potential
functions are typically restricted to some exponential family.
We present here an approach to training in pieces that provides robust output for incorrect
assignments to neighboring variables outside the subset, and allows for limited interactions
between subsets, while also preserving efficient local normalization.

A piecewise-trained linear-chain CRF,
with weight rescaling

Maximum Entropy Classifier
Traditional, all features trained jointly

Training, Phase 1
2

2

2

1

2

1

1

1

...

Trained with "features bagged"

Training, Phase 2
2

2

2

1

2

1

1

1

...

Testing
2

Independently-trained (ala Naive Bayes)
2

2

1

2

1

1

1

...

Figure 2: Left: An example two-stage method for piecewise-training a linear-chain CRF.
Right: Variants on the traditional maximum entropy classifier, in which groups of features
are trained independently.

For each variable we introduce an additional value termed “none of the above,” or N OTA ,
(e.g. a 46th part-of-speech label). Traditional pseudo-likelihood training would assign
training data to a particular potential function using the true labels in the training data. By
contrast, we also assign training data to a potential function even when the conditioning
values do not match this particular potential function; the correct predicted value in this
case is N OTA .
Inference at test time is performed with standard MEMM inference, with the N OTA state
and all parameters associated with N OTA outcomes removed. Accordingly, training is
performed such that all parameters associated with N OTA values are constrained to be zero.
Thus the only way for N OTA to be correctly predicted is by reducing the strength of the
parameters associated with other outcomes, given the current observations. Once N OTA is
removed, the next-state distribution is nearly uniform, which in an MEMM is the next-state
distribution that most equally “votes against” all possible outcomes.
Formally, the objective function used in a piecewise-trained linear-chain model is
OP T (Λ, D) = log

|D|
Y
Y
i

(i)

(i)

(i)

p̃Λ (yt |yt−1 , xt )

t

(i)

Y

p̃Λ (N OTA |y, xt ).

(i)

y6=yt−1

Expanding this to show the partition function and the product of potential functions, we see
OP T (Λ, D)

=

log

|D|
Y
Y
i

t

(i)

(i)

(i)

φ(yt , yt−1 , xt )
P
(i)
(i)
1 + y φ(y, yt−1 , xt )

1

Y
(i)

y 0 6=yt−1

1+

(i)
0
y φ(y, y , xt )

P

MEMM
CRF
CRF-PT

Training
F1
99.89%
99.95%
99.82%

Named entity
Testing
Training
F1
Time
88.90
1 hr
89.87
9 hr
90.47
5 hr, 35 min

Training
Accuracy
99.1%
99.8%
99.08%

POS tagging
Testing
Training
Accuracy
Time
88.1%
2 hr, 8 min
88.1%
14 hr
88.8%
2 hr, 30 min

Table 1: Results on named-entity recognition and part-of-speech tagging. CRFs trained in
pieces (CRF-PT) significantly outperform both regular MEMMs and CRFs. The training
time of CRF-PT would be substantially further reduced with non-exhaustive y 6= yt−1 ,
(experiments forthcoming).

=

log

|D|
Y
Y
i

t

(i)

(i)

(i)

φ(yt , yt−1 , xt )
,
Q 
P
(i)
0
y0 1 +
y φ(y, y , xt )

where the sum over y does not include N OTA (since they are captured with the included
1’s). This corresponds to approximating the partition function Z(Λ, x) with
!
X
YY
(i)
0
(i)
1+
φ(y, y , xt ) .
ZP T (Λ, x ) =
t

y0

y

Tom Minka (personal communication) has pointed out that this seems to be a novel approximation to the partition function, and that a somewhat similar approximation is produced in
the very beginning of belief propagation, when all messages are equal to 1, and the partition
function is estimated by
YXX
(i)
φ(y, y 0 , xt ).
ZBP (Λ, x(i) ) ∝
t

y0

y
(i)

The training data for the N OTA outcome, y 6= yt−1 , may be exhaustive, or randomly sampled, or chosen to include only those cases in which incorrectly had high marginal probability by joint inference with a previous parameter setting. A method similar to this last option
has been previously used to successfully incorporate not all but some of the most important
“unsupported features” in linear-chain CRFs (McCallum, personal communication).
Furthermore, we can calibrate the magnitude of the parameters Λs across each subset s,
by learning a per-subset multiplicative factor, αs Λs . Although this factor is learned via
traditional global inference, its impact on training time is limited because it has such low
dimensionality that optimization typically requires only a few gradient steps.3
Essentially N OTA outcomes allow limited communication between locally-normalized subsets, by allowing them to assign low potentials to incorrect variable assignments of conditioned variables.

4

Experiments

Although piecewise training was motivated by the need to train large undirected models
representing multiple interdependent sub-tasks, we show here that piecewise training can
be beneficial even in graphical models as simple as a linear-chain.
3

These calibration parameters are not used in the preliminary experiments below, however.

We present results on two natural-language tasks: part-of-speech tagging and named-entity
recognition. First, for named-entity recognition, we use the CoNLL 2003 data set, consisting of 14,987 newswire sentences annotated with names of people, organizations, locations, and miscellaneous entities. We test on the standard development set of 3,466
sentences. Evaluation is done using precision and recall on the extracted chunks, and we
report F1 = 2P R/P + R.
Results are shown in Table 1. We compare a CRF, an MEMM, and a CRF-PT with
exhaustively-added N OTA instances. Consistent with previous work, the CRF performs
better than the MEMM. But with the addition of N OTA instances, the CRF-PT performs
better than both the standard MEMM and the CRF. It appears that CRF-PT is overfitting
less than the CRF, since CRF-PT has lower training accuracy despite its higher testing accuracy. All of the pairwise differences in table 1 are significant by McNemar’s test on the
per-sentence labeling disagreements (p < 0.001).
Second, in previous work (Lafferty et al., 2001), CRFs were shown to outperform MEMMs
on part-of-speech tagging. Here we test whether training in pieces addresses the previouslyobserved problems with local normalization on a POS data set. For these preliminary
experiments, we used a very small subset of 1,154 sentences, randomly sampled from
sections 0–18 of the Penn Treebank WSJ corpus. We evaluated on all 5,527 sentences of
of sections 20 and 21. The Treebank tag set contains 45 tags.
In this experiment, we achieved better performance by including only a few
N OTA interactions. In particular, after twenty iterations of training, we added a N OTA term
of the form p(N OTA |yt ) for all incorrect yt in the training set that the model assigned
probability greater than 0.2.
On this small training set, the MEMM and the CRF had identical performance. The training
set is so small that the CRF’s greater capacity to overfit negates its advantage in avoiding
label bias. Still, the locally trained CRF achieves significantly better performance than the
CRF and the MEMM. This difference is significant by a paired t-test on the number of
incorrect tags per sentence (p < 0.001).
In both data sets, we found that using local normalization at test time performed better than
CRF-style global testing with the parameters learned from CRF-TP training. This is not
surprising, because one might expect that the weights from each of the separately-trained
pieces would have different scale. For the NER results that we report, we globally train
a per-state scaling factor, as mentioned in the previous section. For the POS results in
Table 1, however, we used locally-normalized MEMM testing.

5

Related Work

There are several examples in the literature of undirected models trained in locallynormalized pieces. Pseudolikelihood (Besag, 1975) is a well-known method for training a
globally-normalized model using local distributions. In pseudolikelihood, parameters are
trained to maximize the likelihood of each predicted variable, conditioned on the true values of the neighboring variables. The MEMM training objective is actually very similar to
the pseudolikelihood objective, except that in the MEMM objective, the local term for each
node is conditioned only on the previous node, not on both neighbors as in pseudolikeliood.
It would be interesting to see whether the N OTA technique can be used to improve the performance of pseudolikelihood training as well. The MEMM objective has also been used
by others, including Punyakanok and Roth (2001) and Klein et al. (2003).
Pseudolikelihood has had some success in applications. For example, Toutanova et al.
(2003) achieve state-of-the-art performance on part-of-speech tagging using a cyclic dependency network trained using pseudolikelihood. Also, pseudo-likelihood has been used
for grid-shape CRFs in computer vision (Kumar & Hebert, 2003).

Roth (2002) has advocated training disjoint classifiers, and then performing joint inference
at test time in an approach he terms “training with classifiers.”
Kakade, Teh, and Roweis (2002) show that label bias in MEMMs can be somewhat ameliorated by training on the marginal probability of single labels. With this training objective,
MEMMs actually perform better on token accuracy than CRFs on an extraction data set.
To compute the marginal likelihood, however, requires forward-backward, and therefore is
just as computationally intensive as global CRF training.

6

Conclusions

We have described Training-in-pieces, a learning method for providing both improved
training speed and reduced overfitting. Initial success raises additional interesting questions. How should subset boundaries be best selected? What choices of limited interaction
are best? How can sparse subsets of y 6= yt−1 be most effectively selected? We are also
considering what additional terms may be included to help account for overlapping pieces
“double-counting” certain variables. We are planning to apply these methods to more complex graphical models, including coreference, parsing, and cascades of numerous NLP
processing steps.
Acknowlegments
We thank Tom Minka for pointing out the relation of the approximate partion function
to loopy belief propagation. This work was supported in part by the Center for Intelligent
Information Retrieval and in part by The Central Intelligence Agency, the National Security
Agency and National Science Foundation under NSF grant #IIS-0326249 and grant #IIS0427594. Any opinions, findings and conclusions or recommendations expressed in this
material are the author(s) and do not necessarily reflect those of the sponsor.

References
Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. R. Statist. Soc.
B, 36, 192–236. (with Discussion).
Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24, 179–195.
Hammersley, J., & Clifford, P. (1971). Markov fields on finite graphs and lattices. Unpublished
manuscript.
Kakade, S., Teh, Y. W., & Roweis, S. (2002). An alternative objective function for markovian fields.
In Proceedings of the Nineteenth International Conference on Machine Learning.
Klein, D., Smarr, J., Nguyen, H., & Manning, C. D. (2003). Named entity recognition with characterlevel models. Proceedings the Seventh Conference on Natural Language Learning (pp. 180–183).
Kudo, T., & Matsumoto, Y. (2001). Chunking with support vector machines.
Kumar, S., & Hebert, M. (2003). Discriminative fields for modeling spatial dependencies in natural
images. In S. Thrun, L. Saul and B. Schölkopf (Eds.), Advances in neural information processing
systems 16. Cambridge, MA: MIT Press.
Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for
segmenting and labeling sequence data. Proc. 18th International Conf. on Machine Learning.
Li, W., & McCallum, A. (2004). A note on semi-supervised learning using markov random fields.
Technical Note.
McCallum, A. (2003). Efficiently inducing features of conditional random fields. Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03).
McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information
extraction and segmentation. Proc. ICML 2000 (pp. 591–598). Stanford, California.

McCallum, A., & Wellner, B. (2003). Toward conditional models of identity uncertainty with application to proper noun coreference. IJCAI Workshop on Information Integration on the Web.
Punyakanok, V., & Roth, D. (2001). The use of classifiers in sequential inference. NIPS 13.
Roth, D. (2002). Reasoning with classifiers. Proc. of European Conference on Machine Learning.
Sutton, C., & McCallum, A. (2004). Collective segmentation and labeling of distant entities in
information extraction. ICML workshop on Statistical Relational Learning.
Sutton, C., Rohanimanesh, K., & McCallum, A. (2004). Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. Proceedings of the TwentyFirst International Conference on Machine Learning (ICML-2004).
Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for relational data.
Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02).
Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging
with a cyclic dependency network. HLT-NAACL 2003.
Wellner, B., McCallum, A., Peng, F., & Hay, M. (2004). An integrated, conditional model of information extraction and coreference with application to citation matching. Conference on Uncertainty
in Artificial Intelligence (UAI).