CS101: Mathematical and Computational Linguistics
Winter 2015: the class will meet Tuesday-Thursday 10:30-11:55 in Room 107
ANB (the Annenberg building)
Instructor:
Matilde Marcolli
There will be three TAs for the class: Vinny Augustine, Shival Dasu,
and Vibhor Kumar
(Darren Waterston, "Linguistics", oil on wood, 2002)
Brief Course Description
The class will cover mathematical and computational models of
acquisition and evolution of natural languages. We will discuss
learnability questions, Markov chain models, population dynamics
models, evolutionary behavior, communicative efficiency and fitness,
We will focus in particular on the Principles and Parameters model
of linguistics and we will discuss the use of mathematical methods,
involving algebraic geometry, topology, and statistical physics, to
describe the evolution of natural languages. Specific examples from
historical linguistics will be revisited from a mathematical and
computational perspective.
Bibliography
- Partha Niyogi, "The
computational nature of language learning and evolution",
MIT Press, 2006.
Additional references and reading materials
- Andras Kornai, "Mathematical Linguistics", Springer, 2010.
- G.E. Revesz, "Introduction to formal languages", McGraw-Hill, 1983.
- Christopher D. Manning, Hinrich Schuetze, "Foundations of statistical
natural language processing", MIT Press, 1999.
- Mark C. Baker, "The Atoms of Language", Basic Books, 2001.
- Peter Forster and Colin Renfrew, "Phylogenetic methods and
the prehistory of language", McDonald Institute Monographs, 2006.
- Prashant Parikh, "Language and equilibrium", MIT Press, 2010.
- Charlotte Calves et al (Eds.) "Parameter theory and linguistic change" Oxford, 2012
- G.E. Barton, R.C. Berwick, E.S. Ristad,
"Computational complexity and natural language", MIT Press, 1987.
- Ruslan Mitkov, "The Oxford Handbook of Computational Linguistics",
Oxford 2009.
- Chris Heunen at al. (Eds.) "Quantum physics and linguistics",
Oxford, 2013.
Additional references to papers will be added as the class progresses
Lectures Outlines
A brief outline of the lectures will be added here:
- Tuesday January 6: What is Linguistics? World language families, diachronic/synchronic viewpoints, Levels of structure: phonology, morphology, syntax, semantics; Overview of phonetics: IPA charts, tones and suprasegmental features, autosegmental phonology, feature geometry, optimality theory
- Thursday January 8: Overview of morphology: allomorphy, morphological typology, polysynthetic languages, lexicology; Overview of Syntax: transformational grammars, principles and parameters, government and binding, minimalist program, head-driven phrase structure grammar, lexical functional grammar, tree-adjoining grammar (continued on Jan 13)
- Tuesday January 13: Historical linguistics: sound changes, borrowing,
analogical change, semantic shifts, syntactic changes, grammaticalization,
comparative methods and reconstruction of proto-languages, phylogenetic
linguistics, family branches by shared innovations
- Thursday January 15: Phylogenetic Linguistics: computational methods,
Swadesh lexical lists, cognates and coding of lexical data, neighbor-join
method, Q-test matrices, UPGMA, maximum parsimony, maximum likelihood,
Bayesian inference trees; Wave Theory of language change; origins of
modern linguistics from Panini to de Saussure and Chomsky
- Tuesday January 20: Phylogenetic Inference: Hidden Markov Models,
invariants, Viterbi sequence, polynomial formulation, phylogenetic
algebraic geometry, model parameters, Segre embeddings, Secant variety
of a Segre variety, determinantal varieties, tropical semiring and
tropicalization, Viterbi sequence and tropical polynomials, Newton
polygon, normal fan, and inference functions
- Thursday January 22: Formal languages: grammars, context free and context
sensitive, the Chomsky hierarchy, Types and Machine Recognition, finite
state automata, pushdown stack automata, Turing machines, recursively
enumerable grammars, linear bounded automata
- Tuesday January 27: formal languages from group theory, word
problem, recursive languages, regular languages, Cayley graphs and
context free languages; context-free
grammars and parse trees, ambiguities, parse trees and natural
languages, operations, transformational grammars, tree-adjoining
grammar, non-context-freeness of natural languages
- Thursday January 29: Probabilistic linguistics:
Bernoulli and Markov measures, hidden Markov models,
probabilistic context free grammars, sentence
probabilities, inside and outside
probabilities (Chomsky normal form), training,
probabilistic tree adjoining grammars
- Tuesday February 3: Graph granmmars, context free case,
examples from Feynman graphs and quantum field theory, insertion
Lie algebra; Languages and Complexity: physical systems from
formal languages (subshifts of finite type, random walk with barrier,
self-avoiding random walk), Kolmogorov complexity and data compression,
morphological complexity of languages, Kolmogorov complexity and entropy,
Kraft inequality, optimal encoding and Shannon entropy,
universal Levin probability, syntactic parameters and complexity, Zipf's
law and entropy, Zipf's law and complexity
- Thursday February 5: Coding theory and linguistics: error correcting
codes and code languages, code parameters, asymptotic bound, Gilbert-Varshamov
bound and Shannon random code ensemble, asymptotic bound and Kolmogorov
complexity, asymptotic bound as phase transition, syntatic parameters
and the asymptotic bound: a measure of language relatedness (continued
on February 10)
- Tuesday February 10: Natural Language Processing:
tagging, collocations, disambiguation, supervised and
unsupervised learning, machine translation, text alignement,
translation probabilities, information retrieval, vector space model
- Thursday February 12: CLASS CANCELLED due to ongoing
Student-Faculty Conference
- Tuesday February 17: Models of language acquisition:
learning algorithm, inductive inference approach, learnability,
probabilistic learnability
- Thursday February 19: Models of language acquisition:
probably approximately correct model, statistical learning theory,
weak convergence, Vapnik-Chervonenkis dimension and learnability;
Syntactic Parameter setting: 3-parameter model, Gibson-Wexler Trigger
Learning Algorithm, Parameter space as a Markov Chain, closed sets
and learnability
- Tuesday February 26: population dynamics of 2-language model,
trigger learning algorithm, batch error-based learning, cue-based learning
- Thursday February 26: multiple languages, Markov
Chain model, homogeneous and non-homogeneous population, multilingual
learners, bilingualism, communicative fitness, languages as
association matrices (measures)
- Tuesday March 3: Communicative Fitness,
languages as association matrices, encoders/decoders, communicability,
best response approximation algorithm, learning with full and partial
information; linguistic coherence as
emergent property, cue-based social learning
- Thursday March 5: statistical physics models of language
learning and evolution; automatic construction of symbolic parsers
for syntactic parameters; parameter
setting via conditional entropies; algebraic versus
probabilistic methods in Linguistics
- Tuesday March 10: student presentations
Slides
Slides will be posted here as the class progresses
-
What is Linguistics? (Part I) general introduction, phonology
-
What is Linguistics? (Part II) morphology, syntax
-
What is Linguistics? (Part III) historical linguistics,
phylogenetic linguistics, wave theory, roots of modern linguistics
-
Geometry of Phylogenetic Inference: hidden Markov models and
polynomial maps, phylogenetic algebraic geometry, tropicalization and
Viterbi sequence
-
Formal Languages (Part I) context free, context sensitive, Chomsky
hierarchy, types and machine recognition, finite state automata, pushdown
stack automata, Turing machines
-
Formal Languages (Part II) formal languages from group theory, word
problem, recursive languages, regular languages, Cayley graphs and
context free languages
-
Parsing Trees: from formal languages to natural languages; context-free
grammars and parse trees, ambiguities, parse trees and natural
languages, operations, transformational grammars, tree-adjoining
grammar, non-context-freeness of natural languages (Swiss German)
-
Probabilistic Linguistics: Bernoulli and Markov measures,
hidden Markov models, probabilistic context free grammars,
probabilistic tree adjoining grammars
-
Graph Grammars: parallelism and graph grammars, examples based on
Feynman graphs
-
Languages and Complexity: Kolmogorov complexity, morphological
and syntactic complexity, Zipf's law
-
Coding Theory and Linguistics : error correcting codes and
code language, code parameters, asymptotic bound and Kolmogorov
complexity, syntactic parameters, language families and codes
-
Natural Language Processing: tagging, collocations, disambiguation,
supervised and unsupervised learning, machine translation, text
alignement, translation probabilities, information retrieval, vector
space model
-
Models of Language Acquisition: learning algorithm, inductive
inference approach, learnability, probabilistic learnability
-
Models of Language Acquisition: Part II: probably approximately
correct model, statistical learning theory, weak convergence,
Vapnik-Chervonenkis dimension and learnability
-
Language Acquisition: Parameter Setting 3-parameter model, Gibson-Wexler
Trigger Learning Algorithm, Parameter space as a Markov Chain, closed sets
and learnability
- Language Acquisition and Parameters: Part II Learning Algorithms and
(inhomogeneous) Markov Chains
-
Models of Language Evolution: 2-language model, population dynamics,
trigger learning algorithm, batch error-based learning, cue-based learning
-
Models of Language Evolution, Part II: multiple languages, Markov
Chain model, homogeneous and non-homogeneous population, multilingual
learners, bilingualism
-
Models of Language Evolution, Part III: Communicative Fitness,
languages as association matrices, encoders/decoders, communicability,
best response approximation algorithm, learning with full and partial
information
-
Models of Language Evolution, Part IV: linguistic coherence as
emergent property, cue-based social learning, language learning
and evolution and statistical physics
-
Additional Topics: Syntactic parameters and language
acquisition: automatic construction of symbolic parsers; parameter
setting via conditional entropies; discussion of algebraic versus
probabilistic methods in Linguistics
Additional Reading Material: Papers
Models of Syntax
- pdf
Noam Chomsky, "Three models for the description of Language"
- pdf Seymour Ginsburg, Barbara
Partee, "A mathematical model of Transformational Grammars"
- pdf Stuart M. Shieber, "Evidence against the context-freeness of
natural language"
- pdf Haitao Liu, "Dependency direction as a means of word-order
typology: a method based on dependency treebanks"
Language change and evolution
- pdf
Partha Niyogi, Robert C. Berwick, "A dynamical systems model for
language change"
- pdf
C.F. Cuskley, M. Pugliese, C. Castellano, F. Colaiori, V.Loreto, F.Tria,
"Internal and external dynamics in language: evidence from verb regularity in a historical corpus of English"
Phylogenetic Trees
- pdf
N.Saitou, M.Nei, "The Neighbor-joining Method: a new method for
reconstructing phylogenetic trees"
- pdf
R.Mihaescu, D.Levy, L.Pachter, "Why neighbor-joining works?"
- pdf
A.Delmestri, N.Cristianini, "Linguistic Phylogenetic Inference by PAM-like
Matrices"
- pdf
F.Petroni, M.Serva, "Language distance and tree reconstruction"
- pdf A.Bouchard-Cote, D.Hall, T.L.Griffiths, D.Klein, "Automated
reconstruction of ancient languages using probabilistic models of sound change"
- pdf H.Luqman, "A Phylogenetic approach to comparative linguistics: a
test study using the languages of Borneo"
- pdf
B.Chor, T.Tuller, "Finding the Maximum Likelihood Tree is Hard"
- pdf
L.Pacher, B.Sturmfels, "The Mathematics of Phylogenomics"
- pdf
N.Eriksson, K.Ranestad, B.Sturmfels, S.Sullivant,
"Phylogenetic Algebraic Geometry"
- pdf
L.Pacher, B.Sturmfels, "Tropical geometry of statistical models"
-
pdf G. Longobardi, C. Guardiano, G. Silvestri, A. Boattini, A. Ceolin,
"Towards a syntactic phylogeny of modern Indo-European languages"
- pdf G. Longobardi, C. Guardiano, "Evidence for syntax
as a signal of historical relatedness"
Wave Theory of Language Change:
- pdf
W. Labov, "Transmission and Diffusion"
- pdf
J. Nerbonne, "Measuring the diffusion of linguistic change"
Phonology:
- pdf
Marc van Oostendrop, "Feature Geometry"
- pdf
Andras Kornai, "The generative power of feature geometry"
- pdf
G.N. Clements, "The Geometry of phonological features"
- pdf Alan Prince, Paul Smolensky, "Optimality Theory in Phonology"
Head-Driven Phrase Structure Grammar
- pdf
R.D. Levine, W.D. Meurers, "Head-driven Phrase Structure Grammar"
Lexical Functional Grammar
- pdf Carol Neidle, "Lexical Functional
Grammar"
- pdf
Ronald M. Kaplan, Joan Bresnan, "Lexical-Functional Grammar: A formal system for grammatical representation"
Semantics
- pdf A.Copestake, D.Flickinger, C.Pollard, I.A.Sag, "Minimal Recursion
Semantics: An Introduction"
Entropy and Complexity
- pdf
K.Ehret, B.Szmrecsanyi, "An information-theoretic approach to
assess linguistic complexity"
- pdf
M.Bane, "Quantifying and measuring Morphological Complexity"
- pdf
A.Kaltchenko, "Algorithms for estimating information distance with
applications to bioinformatics and linguistics"
- pdf
R.Clark, "Kolmogorov complexity and the information content of parameters"
-
pdf M.Gell-Mann, "What is Complexity?"
-
pdf C.E.Shannon, "Prediction and Entropy of Printed English"
- pdf
D.Link, "Traces of the Mouth: Andrei Andreyevich Markov's mathematization
of writing"
- pdf
A.K.Zvonkin, L.A.Levin, "The complexity of finite objects and the
development of the concepts of information and randomness by means of the
theory of algorithms"
Formal Languages
-
pdf T.Ceccherini-Silberstein, W.Woess, "Growth and Ergodicity
of Context-free Languages"
-
pdf Y.Wang, L.Yang, H.Xie, "Complexity of unimodal maps
with aperiodic kneading sequences"
- pdf
Eibe Frank, "Formal Languages and Automata", Chapter 6
- pdf J.Shallit, "Number Theory and Formal Languages"
Other Topics
-
pdf R.Sproat, M.Yarmohammadi, I.Shafran, B.Roark,
"Lexicographic Semirings"
- pdf
S.Giraudo, J.G.Luque, L.Mignot, F.Nicart, "Operads, quasiorders and
regular languages"
Research Projects
Seminar Information
The seminar accompanying the class will take place on
Thursday afternoon in the same room ANB 107. The room is
available from 3:30 to 5:00 pm. The seminars will run
either 3:30-4:30 or 4:00-5:00 (the variable schedule is
meant to accommodate conflicts of timing for some of
the participants). Seminars will be listed below: check
here for speakers and timing.
- January 15, 4-5pm: Nakul Dawra on Chomsky's Three
models for the description of Language
- February 19, 4-5pm: Ella Mathews on Tree Distance and Language
Reconstruction
- February 26, 4-5pm: Shival Dasu on Evidence against the
context-freeness of natural language
- March 5, 4-5pm: Ella Mathews on Evidence for Syntax as a
Signal for Historical Relatedness
- March 10, 10:30-11:15am: Haebin Lim on Operads in Linguistics
(slides
of the talk)
- March 10, 11:15am-12:00pm: Sadaf Amouzegar on Feature Geometry
- March 28, 11:30-12:30 SLN 159: Chris Estrada on Automata and
Formal Languages
Grading policy
A grade for the class will be assigned on the basis
of attendance and participation in class and the
completion of the following tasks according to
the students preference:
- Oral presentations based on assigned reading material
- A computer project based on some of the material discussed in class
Miscellaneous Links