Center For Intelligent Information Retrieval Technical Report IR-383 Presented at NIPS’04 workshop on Learning with Structured Outputs Piecewise Training with Parameter Independence Diagrams: Comparing Globally- and Locally-trained Linear-chain CRFs Andrew McCallum and Charles Sutton Department of Computer Science University of Massachusetts Amherst Amherst, MA 01003 USA {mccallum,casutton}@cs.umass.edu Abstract We present a diagrammatic formalism and practial methods for introducing additional independence assumptions into parameter estimation, enabling efficient training of undirected graphical models in locally-normalized pieces. On two real-world data sets we demonstrate our locally-trained linear-chain CRFs outperforming traditional CRFs— training in less than one-fifth the time, and providing a statisticallysignificant gain in accuracy. 1 Introduction Graphical models have brought about a revolution in probabilistic modeling because they provide a formal framework for representing independence assumptions among random variables. Even when not quite true, these assumptions can be tremendously beneficial by enabling models to generalize and scale to large data. With sufficient independence assumptions, calculating the probability of a configuration becomes a product of simple factors taken from low-dimensionality tables, and in low tree-width graphs, inference can be performed efficiently via dynamic programming. However, even when independence assumptions result in a low-treewidth graph, parameter estimation is often difficult and time-consuming. In models with hidden variables, and in all undirected models, parameter estimation involves (repeatedly) performing inference across unobserved model variables in order to obtain marginals necessary to calculating the gradient of the likelihood. Finding the optimal parameter setting involves balancing subtle trade-offs in parameter values spread throughout the model, and can require many rounds of gradient steps and inference. In many practical situations, even extremely simple graphical models can have severe training costs. For example, linear-chain conditional random fields (CRFs) are undirected graphical models in which the predicted variables are connected in a trivial linear chain. Inference comprises straightforward dynamic programming via forward-backward. However, real-world applications of this model use millions of parameters trained on hundreds of thousands of words, and parameter estimation can then require half a day or more (Mc- Callum, 2003). In factorial CRFs (Sutton et al., 2004) (shallow grids), training can literally take days. This paper presents methods by which additional independence assumptions among parameters can improve both training speed and regularization. As with traditional independence assumptions among non-parameter random variables in graphical models, these assumptions may be merely approximations, but they can be drammatically advantageous. We introduce a formalism and diagrammatic representation we term parameter independence diagrams that provide a general language for expressing independence assumptions among parameters that are represented as factors in a factor graph. We also introduce a method for efficiently capturing limited interactions among sets of parameters. Rather than using global normalization in undirected models for inference at training time, the factors and variables in the model’s factor graph are apportioned into subsets we call “pieces,” and normalization is performed locally among the variables in each subset. The parameters in each subset’s factors are trained independently from each other1 and thus we use the term “piecewise training.” Because normalization is performed locally, over smaller sets of the variables, inference can be significantly faster. Local normalization also limits the interactions between parameters in the gradient of the objective function, which additionally speeds learning by providing a simpler likelihood surface to climb. Furthermore, we have evidence that this reduced interaction among parameters results in greater regularization, and lower test set error. There has recently been growing interest in using undirected graphical models to perform joint inference over a large number of interdependent predicted variables—recently coined “collective classification.” In some cases exact inference is feasible, but slow (Lafferty et al., 2001; McCallum, 2003; Sutton et al., 2004). In other cases approximate inference methods are used, such as sum-product (loopy belief propagation and related alternatives) (Taskar et al., 2002; Sutton et al., 2004; Sutton & McCallum, 2004), or Monte-Carlo sampling (Li & McCallum, 2004; McCallum & Wellner, 2003; Wellner et al., 2004). Since inference must be performed repeatedly during parameter estimation (once for each gradient calculation), sometimes a more efficient, dramatic approximation is used at training time than at test time. Pseudo-likelihood training, followed by Monte-Carlo inference at test time is just such an example (Kumar & Hebert, 2003; McCallum & Wellner, 2003; Wellner et al., 2004). This paper describes generalizations of training methods for this scheme. As the size and complexity of these graphical models grow—for example, capturing not only interdependent labels on many objects (such as collective document classification), but also joint inference among many entire subsystems (such as part-of-speech-tagging, phrase segmentation, named entity recogition, coreference, relation identification, role discovery and group detection), we will require new divide-and-conquer training methods for these models. Parameter independence diagrams and piecewise training provide a framework and methods for a divide and conquer approach to training undirected graphical models. 2 Parameter Independence Diagrams An undirected graphical model (also known as a Markov random field, or Markov network) expresses independence assumptions among its random variables with a limited number of undirected edges between pairs of variables. Without loss of generality, consider a conditional random field in which a subset of the variables, the “target” variables Y = {y1 , y2 , ...} are predicted conditioned a given set of values for the remaining “observed” variables X = {x1 , x2 , ...}. The joint distribution over the target varibles conditioned on the observed variables can be expressed as a product over non-negative-valued 1 Or, nearly independently, as explained below. potential functions Φ(·) of the cliques c ∈ C in this graph (with normalization constant ZX ) (Hammersley & Clifford, 1971), where yc indicates the subset of variables in y that are members of clique c, and xc is defined analogously: 1 Y p(y|x) = Φc (yc , xc ). Zx c∈C A factor graph is a bipartite graph (with variables nodes in one partition and factor nodes in the other) specifying further independence assumptions by indicating how a per-clique potential function above can be comprised of multiplicative factors, indexed by f , each calculated by a potential function φf (·). For example, three variables may be completely connected, forming a clique in the original graphical model, however, the parameterization of the potential function for this clique may consist only of functions of pairs; thus in this case Φ(x1 , x2 , x3 ) = φ(x1 , x2 )φ(x2 , x3 )φ(x1 , x3 ). The joint distribution over target variables conditioned on observed variables is then a product of factor potential functions φf (·) of the factors f ∈ F, where yf indicates the subset of variables in y that have edges to factor f , and xf is defined analogously: 1 Y φf (yf , xf ). p(y|x) = Zx f ∈F It is common to parameterize factors by a log-linear function of a set of feature functions gk each multiplied by a learned weight parameter λk , so that ! X φf (yf , xf ) = exp λk gk (yf , xf ) . k The objective function for training a conditional random field given a training set D of independent pairs hxi , yi i is the log-probability all the y’s given the x’s, given the set of all parameters Λ, |D| X O(Λ, D) = log pΛ (y(i) |x(i) ). i=1 Gradient of the objective function with respect to a single parameter λk is   |D| X X ∂O(Λ, D) X X (i) (i) (i) = gf (yf , xf ) − pΛ (y|x(i) ) gf (yf , xf ) . ∂λk y i=1 f f Parameters in factors not separated by observed variables x are coupled in the gradient, through the partition function Zx (Λ) that appears in the expansion of pΛ (y|x). These coupled parameters are not estimated independently from each other. Much of the expressive power of these models is comes from the subtle trade-offs of these parameter values against each other. A parameter independence diagram expresses yet further independence statements that allow the gradient to factor. Just as factor graph may make independence statements that are not actually true in the data being modeled, a parameter independent diagram may express independence statements about parameters that are not true in the original factor graph. A parameter independence diagram is a hypergraph consisting of a factor graph, plus a set of hyperedges, Π = {π, ...}, connecting subsets of variables and factors.2 Such a 2 Or, equivalently a tripartite graph, consisting of a (bipartite) factor graph, plus a third partition consisting of “piece” nodes, each connected to the variables and factors that are members of that piece. MEMM 1 Bi-directional CMM 1 1 1 ... 1 1 1 1 1 1 ... Linear-chain CRF Pseudo-likelihood 1 1 1 1 ... 1 1 ... Figure 1: Four examples of parameter independence diagrams, all for sequence data. Observed variables are drawn as black circles, and predicted target variables as gray circles. The predicted variables are connected in a linear chain, indicating a first-order Markov model among state in a finite state machine; the left-most state is colored black to indicate the pre-determined “start state.” Factors appear as rectangles, the shared digit inside them indicating that their parameters are tied across positions in the sequence. Each “piece” is indicated by circling (drawn here with a dashed stroke) its variable-node and factor-node members. subset is termed a piece, and written π. The objective function specified by the parameter independence diagram is defined to be the sum of objective functions for each piece, X OΠ (Λ, D) = Oπ (Λ, D); π∈Π and the objective function for a piece is the joint distribution of the variables in the piece conditioned on those variables outside the piece that are connected to factors within the piece. We write yN (π) and xN (π) for these target and observable variables neighboring factors within the piece, and the objective function for piece π is Oπ (Λ, D) = |D| X (i) (i) log p̃Λ (yπ(i) |yN (π) , xN (π) ). i=1 This local joint distribution over yπ is written p̃ (rather than p), indicating that this is not the true marginal, but a locally-normalized function of the product of factors f within piece π; thus, (writing this set of factors, f ∈ π), 1 Y φf (yf , xf ; Λf ). p̃Λ (yπ |yN (π) , x) = Zπ,x f ∈π Note that some variables outside the piece may share a factor with a variable inside the piece, but unless the shared factor itself is inside the piece, those outside variables are not conditioned on, and play no role in the piece’s objective function. Factors and variables can appear in more than one piece. We may specify that the parameters of some factors are locked; in this way they would have zero gradient for a piece’s objective function, but may play a role in inference. Traditional undirected graphical models correspond to a parameter independence diagram in which all variables and factors appear in a single piece. When the training data is fully observed, factors never in a common piece are conditionally independent given the training data, and may have their parameters estimated independently (or partially independently as we shall see in the next section). When the training data leaves some variables hidden, factors connected through unobserved variables remain dependent. Several pictorial conventions are possible for pieces, including drawing a new node for each piece, with edges connecting its participating variables and factors. We have found it clearer to circle the variables and factors in each piece. Since pieces typically include locally-connected nodes in the factor graph, this representation has usually been quite natural and clear. Figure 1 shows the diagrammatic representation of parameter independence diagrams for several common linear-chain graphical models representing finite state machines (FSMs) used to model sequence data. Pieces are indicated by cicled regions (in a dashed stroke here for emphasis). We show parameter independence diagrams of four objective functions for linear-chain models: MEMMs, pseudolikelihood, bi-directional CMMs, and the linearchain CRF. Maximum Entropy Markov Models (MEMMs) (McCallum et al., 2000) learn a “next-state classifier” for each source state in the FSM. Its objective function for parameter estimation P P (i) (i) (i) is log i t p̃(yt |yt−1 , xt ), where t indicates a position in the sequence (and also a piece). The objective for pseudo-likelihood (Besag, 1974) involves predicting each target variable conditioned on all its neighbors, OP L (Λ, D) = |D| X X i (i) (i) (i) (i) log p̃Λ (yt |yt−1 , yt+1 , xt ). t Although not strictly a probabilistic model, bi-directional conditionally-trained Markov model (bi-directional CMM) have been trained using support vector machines to learn separate functions for both p̃(yt |yt−1 , x) and p̃(yt |yt+1 , x) (Kudo & Matsumoto, 2001). The two are then combined using a modified forward-backward procedure. The objective function for the linear-chain CRF (Lafferty et al., 2001) is simply the joint probability p(y|x) of all target variables. 3 Piecewise Training with Limited Interactions In pseudo-likelihood and MEMMs, training by conditioning only on the true values of the neighbors can be problematic. When global inference at test time estimates high probability for incorrect assignments to these neighbors, potential functions are evaluated on inputs they may never have seen at training time, resulting in unpredictable potential scores. Given a set of labels (e.g., 45 part-of-speech tags), a training set (i) (i) (i) {hx(1) , y(1) i, ...hx(N ) , y(N ) i}, where x(i) is an input sequence hx1 , x2 , ...xT i, (i) (i) (i) and y(i) is its corresponding output label sequence hy1 , y2 , ...yT i, the parameters, Λ, of an MEMM are trained to maximize the conditional likelihood, OM EM M (Λ, D) = |D| X X i (i) (i) (i) log p̃Λ (yt |yt−1 , xt ). t Here the locally-normalized subsets are individual “next states,” yt , conditioned on “source states” yt−1 , and there is a separate potential function for each source state. Potential functions are typically restricted to some exponential family. We present here an approach to training in pieces that provides robust output for incorrect assignments to neighboring variables outside the subset, and allows for limited interactions between subsets, while also preserving efficient local normalization. A piecewise-trained linear-chain CRF, with weight rescaling Maximum Entropy Classifier Traditional, all features trained jointly Training, Phase 1 2 2 2 1 2 1 1 1 ... Trained with "features bagged" Training, Phase 2 2 2 2 1 2 1 1 1 ... Testing 2 Independently-trained (ala Naive Bayes) 2 2 1 2 1 1 1 ... Figure 2: Left: An example two-stage method for piecewise-training a linear-chain CRF. Right: Variants on the traditional maximum entropy classifier, in which groups of features are trained independently. For each variable we introduce an additional value termed “none of the above,” or N OTA , (e.g. a 46th part-of-speech label). Traditional pseudo-likelihood training would assign training data to a particular potential function using the true labels in the training data. By contrast, we also assign training data to a potential function even when the conditioning values do not match this particular potential function; the correct predicted value in this case is N OTA . Inference at test time is performed with standard MEMM inference, with the N OTA state and all parameters associated with N OTA outcomes removed. Accordingly, training is performed such that all parameters associated with N OTA values are constrained to be zero. Thus the only way for N OTA to be correctly predicted is by reducing the strength of the parameters associated with other outcomes, given the current observations. Once N OTA is removed, the next-state distribution is nearly uniform, which in an MEMM is the next-state distribution that most equally “votes against” all possible outcomes. Formally, the objective function used in a piecewise-trained linear-chain model is OP T (Λ, D) = log |D| Y Y i (i) (i) (i) p̃Λ (yt |yt−1 , xt ) t (i) Y p̃Λ (N OTA |y, xt ). (i) y6=yt−1 Expanding this to show the partition function and the product of potential functions, we see OP T (Λ, D) = log |D| Y Y i t (i) (i) (i) φ(yt , yt−1 , xt ) P (i) (i) 1 + y φ(y, yt−1 , xt ) 1 Y (i) y 0 6=yt−1 1+ (i) 0 y φ(y, y , xt ) P MEMM CRF CRF-PT Training F1 99.89% 99.95% 99.82% Named entity Testing Training F1 Time 88.90 1 hr 89.87 9 hr 90.47 5 hr, 35 min Training Accuracy 99.1% 99.8% 99.08% POS tagging Testing Training Accuracy Time 88.1% 2 hr, 8 min 88.1% 14 hr 88.8% 2 hr, 30 min Table 1: Results on named-entity recognition and part-of-speech tagging. CRFs trained in pieces (CRF-PT) significantly outperform both regular MEMMs and CRFs. The training time of CRF-PT would be substantially further reduced with non-exhaustive y 6= yt−1 , (experiments forthcoming). = log |D| Y Y i t (i) (i) (i) φ(yt , yt−1 , xt ) , Q  P (i) 0 y0 1 + y φ(y, y , xt ) where the sum over y does not include N OTA (since they are captured with the included 1’s). This corresponds to approximating the partition function Z(Λ, x) with ! X YY (i) 0 (i) 1+ φ(y, y , xt ) . ZP T (Λ, x ) = t y0 y Tom Minka (personal communication) has pointed out that this seems to be a novel approximation to the partition function, and that a somewhat similar approximation is produced in the very beginning of belief propagation, when all messages are equal to 1, and the partition function is estimated by YXX (i) φ(y, y 0 , xt ). ZBP (Λ, x(i) ) ∝ t y0 y (i) The training data for the N OTA outcome, y 6= yt−1 , may be exhaustive, or randomly sampled, or chosen to include only those cases in which incorrectly had high marginal probability by joint inference with a previous parameter setting. A method similar to this last option has been previously used to successfully incorporate not all but some of the most important “unsupported features” in linear-chain CRFs (McCallum, personal communication). Furthermore, we can calibrate the magnitude of the parameters Λs across each subset s, by learning a per-subset multiplicative factor, αs Λs . Although this factor is learned via traditional global inference, its impact on training time is limited because it has such low dimensionality that optimization typically requires only a few gradient steps.3 Essentially N OTA outcomes allow limited communication between locally-normalized subsets, by allowing them to assign low potentials to incorrect variable assignments of conditioned variables. 4 Experiments Although piecewise training was motivated by the need to train large undirected models representing multiple interdependent sub-tasks, we show here that piecewise training can be beneficial even in graphical models as simple as a linear-chain. 3 These calibration parameters are not used in the preliminary experiments below, however. We present results on two natural-language tasks: part-of-speech tagging and named-entity recognition. First, for named-entity recognition, we use the CoNLL 2003 data set, consisting of 14,987 newswire sentences annotated with names of people, organizations, locations, and miscellaneous entities. We test on the standard development set of 3,466 sentences. Evaluation is done using precision and recall on the extracted chunks, and we report F1 = 2P R/P + R. Results are shown in Table 1. We compare a CRF, an MEMM, and a CRF-PT with exhaustively-added N OTA instances. Consistent with previous work, the CRF performs better than the MEMM. But with the addition of N OTA instances, the CRF-PT performs better than both the standard MEMM and the CRF. It appears that CRF-PT is overfitting less than the CRF, since CRF-PT has lower training accuracy despite its higher testing accuracy. All of the pairwise differences in table 1 are significant by McNemar’s test on the per-sentence labeling disagreements (p < 0.001). Second, in previous work (Lafferty et al., 2001), CRFs were shown to outperform MEMMs on part-of-speech tagging. Here we test whether training in pieces addresses the previouslyobserved problems with local normalization on a POS data set. For these preliminary experiments, we used a very small subset of 1,154 sentences, randomly sampled from sections 0–18 of the Penn Treebank WSJ corpus. We evaluated on all 5,527 sentences of of sections 20 and 21. The Treebank tag set contains 45 tags. In this experiment, we achieved better performance by including only a few N OTA interactions. In particular, after twenty iterations of training, we added a N OTA term of the form p(N OTA |yt ) for all incorrect yt in the training set that the model assigned probability greater than 0.2. On this small training set, the MEMM and the CRF had identical performance. The training set is so small that the CRF’s greater capacity to overfit negates its advantage in avoiding label bias. Still, the locally trained CRF achieves significantly better performance than the CRF and the MEMM. This difference is significant by a paired t-test on the number of incorrect tags per sentence (p < 0.001). In both data sets, we found that using local normalization at test time performed better than CRF-style global testing with the parameters learned from CRF-TP training. This is not surprising, because one might expect that the weights from each of the separately-trained pieces would have different scale. For the NER results that we report, we globally train a per-state scaling factor, as mentioned in the previous section. For the POS results in Table 1, however, we used locally-normalized MEMM testing. 5 Related Work There are several examples in the literature of undirected models trained in locallynormalized pieces. Pseudolikelihood (Besag, 1975) is a well-known method for training a globally-normalized model using local distributions. In pseudolikelihood, parameters are trained to maximize the likelihood of each predicted variable, conditioned on the true values of the neighboring variables. The MEMM training objective is actually very similar to the pseudolikelihood objective, except that in the MEMM objective, the local term for each node is conditioned only on the previous node, not on both neighbors as in pseudolikeliood. It would be interesting to see whether the N OTA technique can be used to improve the performance of pseudolikelihood training as well. The MEMM objective has also been used by others, including Punyakanok and Roth (2001) and Klein et al. (2003). Pseudolikelihood has had some success in applications. For example, Toutanova et al. (2003) achieve state-of-the-art performance on part-of-speech tagging using a cyclic dependency network trained using pseudolikelihood. Also, pseudo-likelihood has been used for grid-shape CRFs in computer vision (Kumar & Hebert, 2003). Roth (2002) has advocated training disjoint classifiers, and then performing joint inference at test time in an approach he terms “training with classifiers.” Kakade, Teh, and Roweis (2002) show that label bias in MEMMs can be somewhat ameliorated by training on the marginal probability of single labels. With this training objective, MEMMs actually perform better on token accuracy than CRFs on an extraction data set. To compute the marginal likelihood, however, requires forward-backward, and therefore is just as computationally intensive as global CRF training. 6 Conclusions We have described Training-in-pieces, a learning method for providing both improved training speed and reduced overfitting. Initial success raises additional interesting questions. How should subset boundaries be best selected? What choices of limited interaction are best? How can sparse subsets of y 6= yt−1 be most effectively selected? We are also considering what additional terms may be included to help account for overlapping pieces “double-counting” certain variables. We are planning to apply these methods to more complex graphical models, including coreference, parsing, and cascades of numerous NLP processing steps. Acknowlegments We thank Tom Minka for pointing out the relation of the approximate partion function to loopy belief propagation. This work was supported in part by the Center for Intelligent Information Retrieval and in part by The Central Intelligence Agency, the National Security Agency and National Science Foundation under NSF grant #IIS-0326249 and grant #IIS0427594. Any opinions, findings and conclusions or recommendations expressed in this material are the author(s) and do not necessarily reflect those of the sponsor. References Besag, J. (1974). Spatial interaction and the statistical analysis of lattice systems. J. R. Statist. Soc. B, 36, 192–236. (with Discussion). Besag, J. (1975). Statistical analysis of non-lattice data. The Statistician, 24, 179–195. Hammersley, J., & Clifford, P. (1971). Markov fields on finite graphs and lattices. Unpublished manuscript. Kakade, S., Teh, Y. W., & Roweis, S. (2002). An alternative objective function for markovian fields. In Proceedings of the Nineteenth International Conference on Machine Learning. Klein, D., Smarr, J., Nguyen, H., & Manning, C. D. (2003). Named entity recognition with characterlevel models. Proceedings the Seventh Conference on Natural Language Learning (pp. 180–183). Kudo, T., & Matsumoto, Y. (2001). Chunking with support vector machines. Kumar, S., & Hebert, M. (2003). Discriminative fields for modeling spatial dependencies in natural images. In S. Thrun, L. Saul and B. Schölkopf (Eds.), Advances in neural information processing systems 16. Cambridge, MA: MIT Press. Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Proc. 18th International Conf. on Machine Learning. Li, W., & McCallum, A. (2004). A note on semi-supervised learning using markov random fields. Technical Note. McCallum, A. (2003). Efficiently inducing features of conditional random fields. Nineteenth Conference on Uncertainty in Artificial Intelligence (UAI03). McCallum, A., Freitag, D., & Pereira, F. (2000). Maximum entropy Markov models for information extraction and segmentation. Proc. ICML 2000 (pp. 591–598). Stanford, California. McCallum, A., & Wellner, B. (2003). Toward conditional models of identity uncertainty with application to proper noun coreference. IJCAI Workshop on Information Integration on the Web. Punyakanok, V., & Roth, D. (2001). The use of classifiers in sequential inference. NIPS 13. Roth, D. (2002). Reasoning with classifiers. Proc. of European Conference on Machine Learning. Sutton, C., & McCallum, A. (2004). Collective segmentation and labeling of distant entities in information extraction. ICML workshop on Statistical Relational Learning. Sutton, C., Rohanimanesh, K., & McCallum, A. (2004). Dynamic conditional random fields: Factorized probabilistic models for labeling and segmenting sequence data. Proceedings of the TwentyFirst International Conference on Machine Learning (ICML-2004). Taskar, B., Abbeel, P., & Koller, D. (2002). Discriminative probabilistic models for relational data. Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI02). Toutanova, K., Klein, D., Manning, C. D., & Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. HLT-NAACL 2003. Wellner, B., McCallum, A., Peng, F., & Hay, M. (2004). An integrated, conditional model of information extraction and coreference with application to citation matching. Conference on Uncertainty in Artificial Intelligence (UAI).