Unsupervised learning of morphology. By H. Hammarström & L. Borin
Computational Linguistics 37 (2): 309-350, 2011.
CR Review No. 139780
Work on the induction of morphological information from texts is surveyed in this paper. This overview considers only systems that accept as input raw text--that is, unannotated natural language text--and produce as output a description of the morphological structure of the language using as little supervision as possible.
For the purposes of the paper, the authors define a hierarchy of morphological analysis that has as its base a “justification”--a linguistically informed motivation for the morphological description of the language and, at the top, a list of the affixes of the language. The actual segmentation of words into stem and affixes sits in the middle of this hierarchy.
Following a general introduction to the subject, the paper proceeds to a historical survey and motivation for unsupervised learning of morphology (ULM), starting with the work of Zellig Harris. This is followed by a section titled “Trends and Techniques in ULM,” which contains a table that forms a road map (described as brief, even though it covers more than two dense pages) to many of the early studies of ULM. This is followed by surveys of four principle approaches based on the following: border and frequency--where segmentation borders are deduced on the basis of substrings that occur with a variety of adjacent substrings; group and abstract--where words are first grouped according to some metric such as edit distance; features and classes--where a word is viewed as a set of features, for example, n-grams; and phonological categories and separation--where the phonemes of a word may be classed into categories such as vowels and consonants. The authors point out that, regrettably, there has been little cross-fertilization between these approaches.
The penultimate section discusses, among other topics, the language dependence of ULM and ULM’s relation to semantics, and addresses this question: Is ULM of any use? A brief subsection on future directions suggests areas where high-accuracy systems might emerge.
The authors conclude that ULM has made progress, but that there is a long way to go. The paper contains an extensive bibliography with over 250 entries. For anyone interested in finding out more about ULM, this paper is an excellent starting place.
Reviewer: J. P. E. Hodgson
[This is a previously published Editor's Pick.]
Comments