Does Learning Syntax Help Models Learn Language? Stanford CS224N Custom Project Lian Wang Department of Computer Science Stanford University lianwang@stanford.edu Abstract Papadimitriou and Jurafsky (2020) showed that LSTMs trained on nonlinguistic structural data performed significantly better than random baselines when evaluated zero-shot on language tasks, which suggests that models can learn generalizable structure independent of specific vocabularies. In this paper, I replicate that finding for transformer models and introduce a new synthetic corpus that captures a different type of structure (i.e., distributional categories). I find that models trained on this corpus outperform models trained on corpora with binary dependency structures, which shows that models are sensitive to finer-grained structural differences, and certain types of structures are better than others as inductive biases for language learning. 1 Key Information to include • Mentor: Isabel Papadimitriou • External Collaborators (if you have any): N/A • Sharing project: No 2 Introduction Language models display a remarkable amount of syntactic “knowledge”, despite never being taught combinatorial rules and only being optimized for next-word prediction. Much analytic work has been dedicated to trying to understand how much syntax models actually learn and how they represent it (e.g. Linzen et al., 2016; Hewitt and Manning, 2019; Chi et al., 2020). On the engineering side, understanding how these models work, and importantly, diagnosing where they excel and fail, is crucial for designing better systems. Moreover, the question of how languages learn, represent, and use syntax is also of theoretical interest. For human language, it is hypothesized that, through a combination of biological bias and later language exposure, we learn a set of syntactic operations and constraints, which are independent of specific lexical items or even the features of a specific language (Chomsky, 1965; Hauser et al., 2002). On the other hand, neural models rely heavily on semantic features and co-occurrence statistics of specific lexical items (Papadimitriou et al., 2021), yet they are able to display somewhat human-like linguistic behavior. Thus it is interesting for linguists to see how well a purely probabilistic model can learn language and how language is represented in such a system, and also important for computer scientists to study their limitations and fundamental differences from the human language system. Beyond simply asking how much syntax models know, there is a more specific question that pertains to the above motivations: How much purely structural knowledge do models learn and utilize, independent of the semantics of a specific system? Using the method proposed in Papadimitriou and Jurafsky (2020), I probe this question by studying the extent to which different types of structural information serve as inductive biases for further training and evaluation on natural language. Stanford CS224N Natural Language Processing with Deep Learning 3 Related work Past work has approached the question of how models represent syntax in many different ways. Using a probing approach, Hewitt and Manning (2019) recovers syntactic tree distances between two words in a sentence through a linear transformation on their vector representations, showing that models do implicitly encode structural distance. Chi et al. (2020) further shows that models also encode dependency labels in a way that holds cross-linguistically. By training classifiers to detect certain grammatical features, Papadimitriou et al. (2021) shows that models are furthermore sensitive to subtle language-specific parameters, like morphosyntactic alignment. Interestingly, Papadimitriou et al. (2021) also shows that classifier decisions about subjecthood, usually considered a syntactic notion, are highly dependent on semantic features like animacy and agency, which tend to co-occur with subjects. This highlights that, while models encode a substantial amount of syntax, there is no clear separation between their syntactic and semantic knowledge. Thus I’m interested in whether and to what extent models encode and utilize syntactic knowledge independent of specific vocabulary semantics. A recent approach that gets at this question is proposed by Papadimitriou and Jurafsky (2020), where they trained models on several non-linguistic corpora (L1) designed to capture various types of structural information, then finetuned and evaluated the models on a natural language dataset (L2). They found that models trained on structural L1s performed better than random baseline when evaluated zero-shot on the L2 Spanish, despite no overlap in vocabulary. This suggests that models were able to learn structure that generalized across different systems, independent of specific vocabulary semantics, and use that knowledge in natural language predictions. They also found performance differences between the various structural L1s, which raises the question of what led to those differences. This current study adopts the same approach as Papadimitriou and Jurafsky (2020) and further explores the latter question of how different types of structural biases correspond to differences in performance. Their original L1s do not represent the structure of natural language, but rather various other nonlinguistic systems. Thus I introduce a new structural L1 that captures the distribution of syntactic categories in an actual natural language dataset, in order to test to what extent structural closeness to natural language and what types of structural abstractions contribute to effectiveness as an inductive bias. 4 Approach 4.1 Structural L1s In order to capture natural language syntax, I designed and wrote code to create several corpora that abstracted different structural aspects of natural language syntax, and ultimately chose to use one that comprises the parts-of-speech (POS) tags of a language corpus. This POS corpus captures dependencies without losing information about the relationship between types of words and their distribution. Specifically, I use a neural parser to parse English language data. I then extracted the POS tags from the parsed dataset and reshaped them into the correct format to be loaded as a Hugging Face Dataset object (Lhoest et al., 2021), with each tag represented as an integer over a vocabulary of all possible POS tags. The vocabulary size of the corpus is 76, corresponding to the number of POS tags utilized by the parser. I also re-implemented the Flat Parentheses and Nested Parentheses corpora from Papadimitriou and Jurafsky (2020) using existing code. The Flat Parentheses corpus consists of pairs of identical integers placed independently, thus allowing crossing dependencies. The Nested Parentheses corpus consists of pairs of identical integers nested hierarchically thus not allowing crossing dependencies, which is a constraint found in natural language. I created these two corpora with a vocabulary size of 100 and corpus size of around 102M tokens, to control for vocabulary and corpus size across the different corpora. 4.2 Baselines Consistent with Papadimitriou and Jurafsky (2020), I trained my model on two random corpora as baselines. These corpora comprise integers sampled randomly, one from a Uniform distribution over the vocabulary (where each word is equally likely to be sampled) and one from a Zipfian distribution 2 (where common words are more likely to be sampled than others; this is taken to resemble the actual distribution of natural language words). Both the corpora have a vocabulary size of 76 and corpus size of around 102M. Across all synthetic corpora, the line length is fixed at 512. I make use of two additional baselines, both of which are not trained on any synthetic corpora. I train a GPT-2 Small model from scratch during the finetune stage, thus it is randomly initialized and trained only for a small number of steps. I also use a pretrained GPT-2 Small model, which I finetune along with all the other models, thus forcing it to re-learn embeddings. 4.3 Model architecture Different from the experiments in Papadimitriou and Jurafsky (2020), which used LSTM models, I probe the behavior of a transformer model. Specifically, I use the GPT-2 Small model architecture, which consists of 12 decoder transformer blocks and 124M parameters (Radford et al., 2019). Additionally, different from the original paper, where they froze parameter weights after the pretraining stage and only finetuned the embedding layer, I do not freeze the parameter weights, but instead simply let the models train for a small number of steps. I trained all my models using code adapted from the Mistral codebase (Karamcheti et al., 2021)1 , which is built on Hugging Face models. I adapt finetuning code written by the authors of the original paper, shared privately. 5 Experiments 5.1 Data For both extracting POS tags and finetuning models, I use wikitext-103 as hosted on Hugging Face, which comprises 103M English tokens extracted from Wikipedia articles (Merity et al., 2016). The POS corpus uses the “en_core_web_sm” parser provided by spaCy (Honnibal et al., 2020), which uses POS tags based on the Penn Treebank annotation scheme (Marcus et al., 1993). 5.2 Evaluation method To evaluate my models, I first use the standard metric of perplexity (PPL), which is simply the exponent of the loss, calculated from the evaluation loss at the finetuning stage. A lower perplexity indicates that the model was able to perform well on the test set, which I take to reflect how well it learned English in the limited number of finetuning steps. I also use the syntax challenge set provided by SyntaxGym (Gauthier et al., 2020)2 as an additional targeted evaluation metric, which the original paper did not make use of. This is to compare model performance on syntax-specific tasks, and see if they perform differently relative to each other from the broad PPL metric. SyntaxGym consists of 33 test suites, where each test suite contains 20-80 minimal pairs designed to test knowledge of a specific grammatical construction. A sample minimal pair may include a grammatical and an ungrammatical sentence, and the model makes the correct prediction if it assigns higher probability to the grammatical sentence. I integrated this as a Hugging Face Metric into my evaluation code. I excluded six of the test suites that targeted garden path sentences, as those did not evaluate grammaticality but rather likeness to human processing effects. 5.3 Experimental details I trained five GPT-2 Small models on the five different synthetic corpora (POS, Nested Parentheses, Flat Parentheses, Random Zipf, Random Uniform) as L1s for 10,000 steps. I then finetuned each of the trained models once with sampled embeddings, and once with pretrained embeddings. In the former condition, new embeddings were sampled from the old embeddings learned in the training stage. In the latter condition, the GPT pretrained embeddings were used. Thus I ended up with two finetuned models for each of the synthetic corpora. 1 2 https://github.com/stanford-crfm/mistral https://syntaxgym.org/; https://huggingface.co/spaces/cpllab/syntaxgym 3 At the finetuning stage, I also introduced two new baselines: a GPT-2 Small pretrained model and a GPT-2 Small model trained from scratch. Both models were finetuned with the same settings as the other synthetic-corpora models. For the Pretrained GPT-2 Small model, this meant that the model had to relearn its embeddings. The From Scratch model was trained from scratch for only the number of finetuning steps. All models, including the synthetic-L1 models, were finetuned for 1,000 steps. For training, I mostly used the default Mistral optimizer configurations, which uses the AdamW optimizer with a starting learning rate of 6e-7. I trained my models on the synthetic corpora with a device batch size of 14 and effective batch size of 518, and I finetuned with a batch size of 8. The synthetic-L1 models all trained for 1-2 days, and finetuning took around 10 hours to complete. 5.4 Results For the synthetic-L1 models, I found the general expected gradation of perplexities that corresponds to the amount of structure in the synthetic corpus (see Figure 1a). The ordering of their perplexities (in the sampled embeddings condition) is as follows: POS < Nested Parentheses < Flat Parentheses < Random Zipf < Random Uniform. The differences between each of the perplexity scores is substantial. The two non-synthetic-trained baseline models (From Scratch and Pretrained GPT-2) both perform better than the other models. For the models finetuned with pretrained embeddings (Figure 1b), we see a much smaller difference between the perplexities of the From Scratch, POS, Random Zipf, and Flat Parentheses models. On the two extremes, the Pretrained GPT-2 model performed significantly better, while the Random Uniform model performed significantly worse. The Nested Parentheses model, in contrary to the results with sampled embeddings, performed worse than both the Flat Parentheses and Random Zipf models, but still significantly better than the Random Uniform model. (a) With sampled embeddings (b) With pretrained embeddings Figure 1: Perplexity scores on English test set. Lower perplexity indicates better performance. All the models performed poorly (below chance) on the SyntaxGym test set, with similar overall accuracies between 0.2-0.3 for sampled embeddings, and a wider range between 0.18-0.68 for pretrained embeddings (Table 1). 4 Model Accuracy (Pretrained) Accuracy (Sampled) GPT-2 Pretrained 0.680 0.296 Nested Parens 0.324 0.209 From Scratch 0.271 0.254 Flat Parens 0.263 0.238 POS 0.260 0.216 Random Zipf 0.258 0.263 Random Uniform 0.186 0.220 Table 1: Overall accuracy scores on SyntaxGym test set. An accuracy score of 1.000 would indicate the model made all correct predictions. If we only consider the test suites where mean accuracy across all models was above 0.5 (Figure 2), we do see an ordering of accuracy scores that reflects the degree of structure in the model, albeit with small differences that might not be significant. Figure 2: Overall accuracy scores on SyntaxGym test suites where mean accuracy across models was above 0.500. Six test suites fit this criterion. 6 Analysis Replicating the results of the original paper, I found that models trained on any structure mostly outperformed the random baselines. This suggests that models are able to learn and utilize generalized structural knowledge independent of the semantics of an individual system, as there is no overlap in semantics or vocabulary between the nonlinguistic and linguistic corpora.3 Moreover, I showed that models are sensitive to finer-grained differences between different types of structure. Between the two random models, the Random Zipf model significantly outperformed Random Uniform in PPL and SyntaxGym accuracy for both the pretrained- and sampled-embeddings conditions. In fact, the Random Zipf results grouped closer to those of the structural models than of the Random Uniform model. This is contrary to what the original paper found, which was while the performance difference between Random Zipf and Random Uniform was significant, the overall performance for both those models was still much worse than other structural models. This difference 3 One note of caution is that while I’ve shown models can learn generalizable structural knowledge independent of semantics, this doesn’t imply that they actually do when training on natural language data. 5 can be attributed to many factors: the difference in model architecture (GPT-2 vs. LSTM), difference in vocabulary size (76 vs. 50,000), training configurations, or performance of the other models. Importantly, this result shows that models are sensitive to word distribution, even with no additional structural features. The performance differences between the models trained on structural L1s were less consistent but still present. The POS model outperformed both parentheses models in PPL for both embedding conditions. I speculate that this is because the POS language captures richer structural information than either of the parentheses models. The dependencies between “words” in the POS corpus are subject to categorical differences, and each word can be consistently dependent on multiple words. For example, a verb is consistently distributed in a certain pattern (perhaps often between two nouns), and words like prepositions may form dependencies with both the verb and the following noun. This is the type of dependency we find in human language. Additionally, since the POS tags were directly extracted from an actual English dataset, the distribution of categories corresponds directly to the distribution in human language text. So the POS L1 is directly closer to human language in this implementational way. One baseline that could be included in future work is to sample the 76 vocabulary items with general consideration for word order, but not from a direct parse of language data, in order to tease apart the roles of the type of abstracted structure and the distribution that directly reflects human language text. In the parentheses models, on the other hand, each word is only dependent on one other word in the corpus (namely, its closest identical twin). This representation is a very reduced abstraction of dependencies that only allows one “word" to be dependent on a single other word, and does not allow differentiation between categories of dependencies. Moreover, while the dependency lengths were sampled from a distribution of dependency lengths found in human language, the placement of the “parentheses” (i.e. integer pairs) relative to each other did not follow any language-based distribution, thus these parentheses L1s also do not capture information about how dependency lengths are distributed in a human language sample. There seemed to be no significant difference between the performance of the Nested Parentheses and Flat Parentheses model (which is consistent with the results in the original paper). This result may be due to several factors: Perhaps while models are sensitive to broad types of structure (distributional/categorical as in POS tags vs. binary dependencies), they simply aren’t sensitive to the level of detail as to distinguish between different binary dependency structures. One other possibility is that both types of structures are sufficient for models to learn language to a certain degree. There are also implementational considerations: We cannot definitively claim that nested hierarchy doesn’t matter, as the parentheses corpora may simply not adequately capture that type of structure. Looking at the SyntaxGym accuracy scores, we also notice that the ranking of models based on perplexity is only roughly reflected in how well the models performed on the challenge set. The accuracies may be too low across-the-board to make definitive claims about the types of tasks the different models are good at. It is nevertheless illuminating that the models performed so poorly on the challenge set; just learning a small amount of language poorly seems to greatly hinder performance on difficult syntactic subtleties. I later discuss a possible reason for the poor SyntaxGym results. While this project is mostly focused on how models trained on different structural corpora perform compared to each other and against random baselines, it is worth noting that all of the models pretrained on a synthetic L1 performed significantly worse than both the From Scratch and Pretrained GPT-2 models. This shows that for substantially different systems (perhaps especially ones with very different vocabulary sizes), training model parameters on one system is detrimental to transferring it to the other. The difficulty of transfer learning offsets any potential gains by general structural knowledge. Wu et al. (2022) showed that the primary difficulty in transferring learning is learning new embeddings. This is illustrated in the performance difference of the Pretrained GPT-2 model between the two embedding conditions. When embeddings were sampled, the Pretrained GPT-2 model did similar to or worse than the From Scratch model, despite being trained on much more data. But when Pretrained GPT-2 used the pretrained embeddings, its performance skyrocketed (as most notably evidenced from its high accuracy on SyntaxGym metrics), presumably because the direct correspondence between its own embeddings and the new ones. The difficulty in learning new embeddings may also help explain the poor performance of all models (excluding Pretrained GPT-2) on the SyntaxGym challenge set. Although designed to test syntactic 6 knowledge, most of the test suites are centered around very subtle grammatical differences that depend on rich knowledge of the relevant lexical items; i.e., good embeddings are a prerequisite. Thus the inability to learn the embedding layer well likely masked any possible difference between models’ syntax abilities.4 7 Conclusion I showed that models can learn and utilize generalized structure across different systems, and are sensitive to fine-grained differences in types of structure. Specifically, distributional structures that retain categorical dependencies are more effective inductive biases for language learning than binary dependency structures, possibly because they capture richer structural information and are closer to natural language syntax. However, training on a different system with a completely different vocabulary in both items and size is detrimental to model performance and offsets the advantage gained by structural bias. The difficulty in learning new embeddings is a limitation of the transfer learning approach to probe certain questions, as the poorly learned embeddings masks possible differences in language behavior. It is still interesting that models are sensitive to different types of structures from an analytic perspective, but bolstering language syntax through targeted structure-training may not be practically desirable. Future work should expand upon this approach and train models on a wider range of structural corpora and explore other training processes. For all of the current corpora, we should run trials with different vocabulary and corpus sizes to control for these effects. We should also create additional synthetic corpora to represent several broader types of structure (e.g. other ways to represent binary dependencies, or distributional categories), to find the relevant level of structural difference that models are sensitive to. In addition, future work should explore alternative training schemes or evaluation methods that can lessen the effect of poor embeddings. References Ethan A. Chi, John Hewitt, and Christopher D. Manning. 2020. Finding universal grammatical relations in multilingual BERT. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 5564–5577, Online. Association for Computational Linguistics. Noam Chomsky. 1965. Aspects of the Theory of Syntax, 50 edition. The MIT Press. Jon Gauthier, Jennifer Hu, Ethan Wilcox, Peng Qian, and Roger Levy. 2020. SyntaxGym: An online platform for targeted evaluation of language models. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 70–76, Online. Association for Computational Linguistics. Marc D. Hauser, Noam Chomsky, and W. Tecumseh Fitch. 2002. The faculty of language: What is it, who has it, and how did it evolve? Science, 298(5598):1569–1579. John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics. Matthew Honnibal, Ines Montani, Sofie Van Landeghem, and Adriane Boyd. 2020. spaCy: Industrialstrength Natural Language Processing in Python. Siddharth Karamcheti, Laurel Orr, Jason Bolton, Tianyi Zhang, Karan Goel, Avanika Narayan, Rishi Bommasani, Deepak Narayanan, Tatsunori Hashimoto, Dan Jurafsky, Christopher D. Manning, Christopher Potts, Christopher Ré, and Percy Liang. 2021. Mistral - a journey towards reproducible language model training. 4 Though it is still unclear to me why these models performed well below chance. 7 Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander Rush, and Thomas Wolf. 2021. Datasets: A community library for natural language processing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 175–184, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics. Tal Linzen, Emmanuel Dupoux, and Yoav Goldberg. 2016. Assessing the ability of LSTMs to learn syntax-sensitive dependencies. Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics, 19(2):313–330. Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. 2016. Pointer sentinel mixture models. Isabel Papadimitriou, Ethan A. Chi, Richard Futrell, and Kyle Mahowald. 2021. Deep subjecthood: Higher-order grammatical features in multilingual BERT. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 2522–2532, Online. Association for Computational Linguistics. Isabel Papadimitriou and Dan Jurafsky. 2020. Learning Music Helps You Read: Using transfer to study linguistic structure in language models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6829–6839. ACL. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI. Zhengxuan Wu, Isabel Papadimitriou, and Alex Tamkin. 2022. Oolong: Investigating what makes crosslingual transfer hard with controlled studies. CoRR, abs/2202.12312. Appendix Figure 3: Model’s SyntaxGym accuracy score (x-axis) against perplexity (y-axis). This graph is just a visualization of how the relationship between model perplexity and SyntaxGym accuracy is not direct. I think the differences in performance between models on the SynaxGym challenge set may largely be random or at least minimally informative. 8