The phoneme revolution
Francis Steen, 22-24 July 2002 -- 4,000 words

These notes can be used as background for a joint article on the evolution of language with Per-Aage Brandt. The topic is the origins of symbolic thought and the proposed solution is the prolonged cultural innovation made possible by the migration of communicative cues into a systematic combinatorial space of vocalizations.

First installment

In a lunch conversation on 22 July 2002, Doug McAdam referenced his ongoing book project on the origins of sociality. The first chapter discusses the existential problem facing mankind: there is a demand for meaning, yet in fact there is no meaning. He raised the possibility that the cultural invention of language was the trigger that unleashed this thirst for existential meaning, and perhaps as a consequence the paleolithic cultural revolution.

Doug noted the desire to feel that we are cared for, but in fact we are alone; we invent meanings, but in reality there is no meaning. He noted that conflict tends to help establish identity, which is valorized by sacrifice -- during wartime, suicides drop to near zero. He suggested that Neanderthals may not have been able to speak -- I contemplated the possibility that they were relatively unimaginative. Language may have allowed you to perceive yourself as an object, as if from the outside. Evidence for a new focus on existential issues include ritual burial.

I began by making the following counterargument:

  1. Our ancestors all the way back communicated. What is special about language? Can we break language into a series of innovations, or must all components emerge at once?
  2. Children start picking up phonemes in the womb, which violates the recapitulation hypothesis. Chase play, in contrast, comes online in a manner consistent with recapitulation. This suggests there was at some point massive selection pressure to master phonemes.
  3. Chimpanzees don't get phonemes, but they do get arbitrary symbols.

So here's the idea. If we want to pursue the continuity thesis and build strands for a bridge across the chasm that separates us from the chimpanzees (and presumably from our common ancestor), let's propose that language involves three central innovations before we even get grammar:

  1. Symbolism

By symbolism I mean the ability to learn conventionalized verbal or visual cues as prompts for specific and appropriate mental simulations.

This may be mastered by vervets, but the repertoire is very small and there doesn't seem to be innovation. Are the existing signals shared by different troops of vervets (are they species-specific)? The evidence suggests there is quite a bit of genetic shaping of the signals and that the system has a very limited space. Vervets' capacity to form mental images in response to these cues is also debatable; the evidence is there but must still be considered somewhat tentative. In any event it suggests a very limited capacity of maybe a dozen symbols.

Symbolic capacity in chimpanzees is much more advanced; bonobos have mastered more than two hundred arbitrary symbols, or cue/meaning pairs. However, it's not clear that even among bonobos there is much spontaneous innovation.

  1. Generativity

By generativity I mean the capacity to generate new cue/meaning pairs. There is some very modest evidence that chimps can do this, but it is tentative. To be effective, an innovation must also spread through the population; that is to say, there must be some rudimentary epidemiology of representations. There is some evidence from Susan & Joe's work that cultural innovations spread among capuchins. Still, generativity is not a salient feature of non-human primate communication.

Let us imagine a generative species, say homo erectus. Individuals occasionally make up new cue/meaning pairs. They use gesture, posture, and sound, and these innovations spread among the population because they satisfy real communicational needs. A slow rate of innovation among loosely connected populations would lead to a network of slightly differing dialects, whose differences are in part non-overlapping (new cue/meaning pairs do not duplicate existing cue/meaning pairs somewhere else). A rapid rate of innovation among disconnected populations would lead to sharply differing dialects.

What are the limits to this type of generativity? Would people run out of cues before they run out of meanings? Would the cost of generating and learning new cues rise relatively rapidly? This may be the crucial question: there is no reason to imagine that people would run out of cues -- there is an infinite number of combinations of gestures, postures, and vocalizations -- but any generative activity of any magnitude would create a pressure to systematize. The benefit from systematization is simplicity and predictability in multiplicity. So the decisive innovation here is a basic set of building blocks out of which all cues must be constructed.

  1. Phonemes

The third vital innovation in the evolution of language, then, is a system for generating cues. Now, these cues are arbitrary, or nearly so; the system simply allows you to generate a new cue very rapidly. It provides a well-defined and suitably large combinatorial space.

Not all aspects of the communicative act are reduced to this digital system -- analog components such as intonation, emphasis, timing, gesture, posture, and so on remain.

What is the advantage of a set of basic building blocks for generating cues? This is really an astounding innovation. Why did it happen?

There are a couple of constraints. One is the problem of innovation. The other is the problem of spread. The third is the problem of performance or execution or use.

A combinatorial system makes innovation easy -- but innovation is easy anyway if all you need is some arbitrary cue.

Then there's the problem of spread -- which in part is a problem of memory. Sure, you can innovate innumerable cues, but can you remember and distinguish them? In part there's a problem of uptake: how do you know which elements in a complex communicative act is really the integral part of the new cue? This seems to be a conclusive argument: what constitutes the cue must be formalized because of the frame problem -- you won't be able to pick out what belongs to the cue and what is incidental. Take the "Gavagai" example from the "Two Dogmas of Empiricism" guy: we all assume that the native and the visitor share the same phoneme space!

So here's a new idea: the reason selective pressure was created in favor of vocal communication over gestural communication is that vocal communication is more easily systematized.

Second installment

In a lunch conversation on 23 July 2002, Per-Aage Brandt pointed out that grammar can coherently be seen as springing from our understanding of events. We may think of grammar as a formalized and lexicalized representation of our pre-existing understanding of event structure. This understanding has very different and likely much older roots than phonemes. Actual grammatization of event structures may have taken place gradually over very long time periods. Phonemes solve the problem of memorability and communicability -- by introducing a fixed set of building blocks for constructing cues in cue/meaning pairs, people were able to pick out these cues from the forest of communicative signals -- a version of the frame problem, which we also find in analytic philosophy.

Quine's "impossibility of translation" thesis uses the example of an anthropologist visiting a remote tribe. A tribesman points in the direction of a rabbit vanishing behind a tree and cries, "Gavagai!" Quine argued that the anthropologist have no basis for translating this into his own terms -- one simply cannot know what precisely it is that the tribesman is referring to.

Quine's example assumes that the native and the visitor share the same phoneme space. It is in fact this assumption that solves a huge part of the frame problem for Quine, although it is a part of the problem Quine does not address and may not have noticed. This leaves him only with the problem of reference -- and even this problem is posed as highly constrained (only categorical, external, and material referents are considered). Yet the act of communication itself is infinitely complex: how does the listener know that gesture, eye-direction, facial expression, posture, head movements, and props do not form part of the categorical significance of the communication?

The answer is that he knows this, and Quine knows this, and his readers know this because they all tacitly and unconsciously assume that the categorical meaning has been constrained to the phoneme space, has been extirpated from gesture, glances, facial expression, head movements, leg and foot movements, intonation, and so on, and squeezed into the combinatorial space of a fixed set of phonemes.

Quine's anthropologist has no more (and no less) grounds for assuming that the native adheres to this practice than he has for assuming that the native adheres to the practice of referring to the whole animal in this first act of naming, reminiscent of Adam's first act of naming the animals. These are both real problems, and they are evolutionary as much as they are philosophical problems. That is to say, we can recast Quine's impossibility of translation as an adaptive problem: how did our ancestors in fact fashion a set of communicative protocols that worked?

My argument here is that in evolutionary history, in the generative phase of language development that surely preceeded grammatization, phonemization of cues for categorical meanings would have solved an otherwise intractible problem: the problem of extracting from a complex communicative signal precisely those invariant features that signified the categorical referent.

Consider, for example, a multi-modal cue for the categorical referent "horse" that consists in a rapid wave of the left hand in an arc from the midline of the body to the left over the shoulder, a vocalization like that of hamster being run over by a toy truck, a facial gesture of St. Anthony being tempted by Susanne (or some such temptation), and a rapid cast of the eyes upwards and then down while the torso is heaved slightly up and forwards before slumping back. This is in principle a perfectly valid, relatively arbitrary cue for the categorical referent "horse" -- but how does the listener know which details of the act must be reproduced as belonging to the cue proper; which details are situational embellishments that intentionally or unintentionally convey the speaker's mood, nutritional state, age, and muscle tone; and which details are simply accidental, the results of poor execution, an imperfect or simply idiosyncratic performance? The answer of course is that the listener cannot know, or can know only in a limited number of cases, because the phase space for each act of communication is so vast that the possibilities for interpretation indeed are unmanageably numerous.

The suggestion, then, is that the construction of a phoneme space is an adaption designed to solve this basic frame problem of human communication.

Such phonemization of cues in cue/meaning pairs also relies on prior adaptation, such as categorization. Per-Aage pointed out that symbolism requires a realization that a communicative act is intentional -- otherwise the act of understanding is iconic, or purely cue-based. He also suggested that phonemization may have enabled a new kind of abstract thought. This isn't self-evident. I can now say and think and understand "dog" -- but already vervets need categorical prototypes to understand alarm calls. The question then is, are there certain types of thought that is made possible by phonemization?

Early language learning today (in modern humans) creates links between cues and meanings that make these seem natural. That is to say, my experiences of the word "hest" is in a palpable and phenomenologically unmediated sense a horse, while "horse" in a somewhat less immediate sense is the common term people use for a horse. Now, when I say "hest" is a horse, I mean in part that the phonological qualities are backstaged completely; "hest" is as it were organically melded with the mental images of a horse; it is of course very far from arbitrary, as it is tautologically true that "en hest er en hest" -- a horse is a horse.

My point is that the close connection between cue and meaning we experience in our mother tongue suggests that phonemic strings or cues get tightly integrated with concepts, and that they therefore would play an important part in thinking. Now, this is an interesting question: how did phonemization aid thinking?

There's in part a very elegant and simple answer: it's a lot easier to think in phonemes than in gestures/movements/facial expressions/intonation/etc cues. That is to say, I can generate new cue/meaning pairs using multiple modalities, but just as it is hard to remember and reproduce them, it is hard to think them. the extreme exonomy of phonemes renders trivial the task of remembering the cue itself, and the meaning it is paired with gets engraved in childhood. A case has been made for specialized memory systems for cue/meaning pairs; if this is correct, it is another important component in the construction of pregrammatical language.

Phonemes simplify cue-generation and memorability. The secret is to generate a combinatorial space that is vast based on as few elements as possible. The range of sizes of phoneme sets in human languages is very narrow -- between 14 (Hawaian) and maybe 40? Most languages appear to have between 20 and 30 -- it would be interesting to see a graph of the distribution. So the secret is to prune the elements radically -- imagine only being able to use 14 cues for communicating categorical referents!

The alphabet and the revolution in the spread and development of knowledge it facilitated represents merely the rediscovery of the power of phonemes. Ideographic writing systems suggest that analogical modes of communication have solid roots in the psyche. It also highlights the cost of going digital: when the cue-space is narrowed, cues are forced to become more arbitrary. Graphic cues -- multimodal cues -- can retain a larger analogical component.

The chinese, of course, as the Egyptians and Sumerians before them, retain a phonemic language, and have to learn writing or reading as an additional symbolic system. In their own minds, do they think phonemic strings or ideograms? Or a mixture of both, as in Japanese writing?

Now, this would be a good test of the hypothesis that phonemization, which has happened to all languages, is a significant aid to thinking.

Let's spell out the argument. What would it mean to say that phonemization aids thinking? The first level is thinking about categorical referents or about events. Here it's not clear that phonemization and lexicalization really adds that much. I can think about a horse running over rolling hills, his mane flowing in the wind, and a lion sneaking up on him from around a knob, and jumping up on his back, digging his teeth into his neck. I can run very complex simulations with no need for language. This is of course not to suggest that phoneme strings would not be very handy for communication; it just not clear that they're of any use for thinking.

But consider other cases. Let's say you want to think about what John wants. Or you want to track the distinctions between what John wants, what Jane thinks John wants, what Jane wants for John, what would be best for John, and what would be best for me for John to want. What it looks like to me is that phonemic strings are ideal for implementing a formal grammar.

Here is the argument:

  1. Grammar is based on event-structure, which our ancestors have parsed for tens or even hundreds of millions of years. The ability to parse, understand, and predict events is basic to mammalian cognition; its roots go even further back
  2. The development of the ability to run mental simulations led to a vast increase in the human phase space, and thus in a vast increase in meanings to be communicated
  3. This spurred a generative phase in the evolution of communication -- a phase where new cue/meaning pairs were frequently generated.
  4. This generative phase runs up against the limitations of human memory: as long as cues are generated out of the vast space of multimodal behaviors, the number of cue/meaning pairs will tend to remain small, as new pairs have a hard time spreading
  5. Phonemization is the adaptive solution to this problem. It introduces dramatic efficiency gains in creating, communicating, remembering, and understanding cue/meaning pairs.
  6. The cost is an increase in arbitrariness, since the combinatorial space is so restricted. Communication between groups could be rendered more difficult -- or would have to rely on pre-phonemic modes of communication.
  7. The dramatic efficiency gains produced by phonemization means that it is useful for thinking. Yet this usefulness may not lie primarily in thinking categorical, external referents, or cue/meaning pairs in the prototypical case of external objects as referents. Rather, phonemization may have been useful for thinking event structures -- that is to say, for grammaticizing.

So this is a scenario where event parsing comes first, along with categorization. These are ancient and prelinguistic, pace Whorf. Then comes mental simulation and episodic memory. This vastly expands the human phase space -- let's say erectus can do this, as a playful mapping of theory onto history. Their communication is multi-modal and analogical. They may have converted pretense into a semblance of a theory of mind. They do not have language. They innovate, but their innovations tend to peter out, like precarious fashions, and have a difficult time gaining a solid foothold and spreading.

Archaic homo then represents experiments in formalizing thought. they begin to develop various communicative invariants that can be combined. Call this the combinatorial phase, intermediate between the generative and the phonemic -- where the multimodal strategy is maintained, but we start getting standardized cues and combined cues. The standardized cues start migrating into vocalization, and archaic homo develops the first elementary phonemic communication systems. the phoneme space is narrow and the migration is incomplete: words cannot convey the richness of mental simulations. In consequence, thoughts cannot be synchronized among many, across time and space, and large communities are impossible.

So the question is this: does the construction of a phonemic system create adaptive pressure to lower the larynx? This is plausible. At some point during the last 500ky, during the archaic homo phase, the larynx drops, and in that subgroup of homo sapiens the combinatorial space expands dramatically. This provides the basis for constraining categorical references to the phonemic space -- to phonemic strings of a manageable length (up to seven phonemes or syllables). This

And here's the kicker: this in turn makes it possible to engage in symbolic thinking -- that is to say, to represent event structures symbolically.

The suggestion here is that the enlarged space of phonemes made room for a more complete lexicalization, or conversion of communicational cues into pure phonetic strings. The pressure to lexicalize, however, would not end with what had been communicated or even what had been thought to date. Phonemization made it possible and attractive to lexicalize all aspects of human cognition, with a differing urgency, and on a continuing basis.

In this model, which may of course be entirely mistaken, the major contributions of evolved cognitive mechanisms to the capacity for language is categorization, event-structure parsing, mental simulations, mental-state attribution, and the construction of a phonemic set. Once this latter has been accomplished, human ingenuity and innovation began the still ongoing task of lexicalizing and grammaticizing various aspects of cognition.

The new lexicals may have no object-like referents. They include words like "of," "under," and "and." There are different classes of innovations -- prepositions that indicate the relative position or movement of categorical referents, verbs that indicate actions, etc. Tomasello may have a table along these lines:

Age in
weeks
Nouns Verbs Adjectives Adverbs Prepositions Conjunctions
28
1
0
0
0
0
0
30
2
1
0
0
0
0
34
4
1
0
0
0
0
40
7
2
0
0
0
9
48
10
3
0
0
0
0

and so forth. When does the child begin to use the various parts of speech? What are the precise predictions of this model?

Children's language development may be used as a tentative test of this theory, as this development with some degree of plausibility recapitulates the philogeny of language. Even though phonemization has been moved into the fetal stage -- which suggests its central significance for human language -- the order of events may retain an informative degree of recapitulation.

The model is open to an evolved capacity for grammar, but the more parsimonious theory would be that grammar is simply the incorporation of modes of construal derived from independently evolved and preexisting cognitive systems, most prominently the understanding of events. More generally, it proposes that language can be understood as a lexicalization of pre-existing modes of cognition. This would explain why grammar appears to be innate, yet no definitive and all-encompassing innate grammar can be located (this would involve a discussion of Chomsky's project).

Now, the point is that grammaticalization of event structures required a sustained development that could have taken thousands of years. This would gradually allow for a greater degree of symbolic thinking, as more and more of experience and subjective phenomenology becomes lexicalized and grammaticalized.

In this view, such changes were largely cultural -- that is to say, the result of a sustained and lengthy cultural development and achievement. Now, if we want to say there were major biological innovations during the last 100,000 years, we would have to say that pockets of contemporary humans did not participate in this revolution. This could in principle of course be true; the world would be a richer place for it. Unfortunately the key cognitive innovations that make a fully-fledged human language possible appear to be universal. There is no such thing as a primitive language among any human group. Recent DNA analyses have suggested a common origin of non-African peoples at around 70,000 years ago; the number for all humans is a bit larger.

Interestingly, bushmen -- perhaps the best candidate for an archaic branch of homo sapiens -- have a few distinctively different phonemes, namely clicks. Are these phonemes possible for non-bushmen to learn? Do bushmen babies raised in other mother tongues still use clicks? Do bushmen have specialized physiological equipment for clicks? Until we know this, we cannot determine whether such clicks are local cultural innovations or part of a genetically determined set of phonemes.

The minimalist argument I'm presenting here is that the baby expects to find and starts to construct a set of constitutive elements of language in the form of phonemes already in the womb. This capacity and tendency is clearly evolved. The precise content of the set of phonemes bear some general relation to the physiology of the oral cavity, but there is clearly some play here. There is some evidence for specialized memory systems for learning cue/meaning pairs. Beyond this, we may simply be seeing the lexicalization of pre-existing cognitive inference systems -- a process that is ongoing.

So this is the idea: grammatization of language must be preceeded by phonemization, and full phonemization requires a dropped larynx. The larynx drops 130,000 year ago at the latest; the lexicalization of cognition takes tens of thousands of years to develop. there is no innate grammar -- but there is innate phonetics and innate event structure. This means you get somewhat different grammars, but based on universal event structure.

Why can't we simply ascribe grammar to ontogeny, or common experience? This explanation doesn't really explain anything: what we need is for some innovative early modern humans to realize that event structure can be lexicalized, and then this knowledge simply has to spread. Of course, it spreads only because each individual has an understanding of events. The question can then be rephrased: why can't we just say that event structure is picked up during ontogeny? This comes up against Hume's skepticism and critique of the concept of causation: the brain may need to be prepared to pick out certain relations, to be looking for them -- as we see for instance in Spelke's experiment. Still, there is lots of room for debate and detail here; in fact, there is no need to state the claim about grammar very strongly, as the theory is really compatible with several solutions.

The crux of the argument is this: you can't grammaticize before you construct the combinatorial space of phonemes. Equally, you cannot engage in symbolic thought as such before you grammaticize. Having grammaticized, you open up for symbolic thought, which is the origin of the modern mind, of the Paleolithic cultural explosion -- and, to return to Doug's original proposal, with some degree of plausibility the origin of the search for existential meaning.

See also Origins of Grammar.

Characterizing evolution

[2002-07-31 Stray idea -- discusses dropped larynx below.]

Darwin discusses three types of selection that drives evolution: artificial selection by animal and plant breeders, natural selection, and sexual selection. The first category should be extended to include non-human animals -- and possibly even beyond. Consider the blueberry, a common plant in the northern forests of the old and the new world (I'm having a huge breakfast of blueberries before leaving Stanford). A major selection pressure acting on the blueberry bush is exerted by the thrush and other birds who feed on it. It is likely that much like human breeders, the birds attempt to maximize their feeding strategy by selecting the largest berries that easily fit into their mouths. This should have the consequence of standardizing the size of berries -- not too small and not too large. In the early history of this relationship, this is likely to have driven the size of the blueberry fruiting body up, and once the optimal size was reached, the feeding preferences of the thrush would have stabilized the size. The relationship, of course, is mutually beneficial; the blueberry relies entirely on the thrush to spread its seeds, and by being dropped at its new location in a high-nutrient package sets the seed off to a great start.

Now, the point here is that in this case, there really is selection going on -- and it's no more or less artificial than the selection of seeds and animals by human farmers. New varieties of plants are selected because they are optimal for human foraging.

Either the lines between natural and artificial selection are just blurred -- and it's easy to see how sexual selection is just a variant of this process -- or you have different types of processes taking place. One may want to define a distinct category -- at one extreme end of the spectrum -- for the death of animals without mouths (Maupertuis' example in 1745), where there really is no selection taking place at all, just a failure of basic functionality. In the middle we find cases where selection is antagonistic and dependent on something akin to basic function -- for instance, where a predator picks out the slowest gazelle. In such antagonistic selection, evolution moves away from the preferences of the selecting animal -- we should call this natural de-selection! At the other extreme we find natural selection proper, where the preferences of one organism drives the evolution of the other to fit these preferences.

This scheme would allow us to say that natural selection is collaborative, while natural deselection is antagonistic. In addition to these types of natural selection -- which really involve acts of selection -- there is the third category that is hard to name, where an organism comes up against basic constraints of physics and chemistry. We might just call this physical failure or physiological failure -- it has no components that involve the preferences of another organism. So this is the idea. Darwin's conception of natural selection can usefully be parsed into two orthogonal or independent spectra:

  • Physiological success or failure (or benefits and disadvantages)
  • Natural selection or deselection

Any adaptation can then be measured against these three dimensions, which in principle should operate independently of each other. Take the lowering of the larynx. Along the first dimension, will it lead to physiological failure, as in choking to death? That would lead to the adaptation being weeded out of the gene pool. Or will the adaptation be a physiological success, for instance by being accompanied by appropriate breathing and swallowing reflexes? Along the selection dimension, will some agent find the results objectionable and for instance refuse to marry somone with a dropped larynx because his voice sounds weird? This would drive the adaptation to extinction through sexual deselection. Conversely, if some agent finds the increased range of the voice to be an attractive capacity, he may want to marry the girl or include the guy in his coalition; this would drive the adaptation forwards through natural selection.

Finally, this raises the question of whether there are additional dimensions that don't get captured by this scheme. Let's imagine, for instance, that the larynx drops slowly, limited by the threat of physiological failure, but driven forward by a functional advantage that nobody notices and that is therefore not subject to selection. How could this happen? Let's say that the individual is simply better at communicating what he wants as a child. --Implausible, since his parents won't be using the phonemes he can produce. Actually, you could imagine a scenario where the adult larynx can produce more vowels than the infant larynx -- if this is possible in fact, then a child born with a lower larynx might really have an advantage. The advantage would be that he could pick up the full range of adult words faster, and thus communicate his needs more effectively. The adults don't have a special preference for this capacity; they simply are placed in a better position to respond to the child according to their pre-existing preferences to help him. So here's an imaginary example of a situation that doesn't fit in the scheme -- an adaptation that is not preferred for itself, but simply enables others to express their existing preferences more effectively. The point is that there may be physiological advantages to an adaptation that is far less dramatic than failure or success -- there are degrees of failure and success. But if this is all, then it's adequately captured by the scheme -- we just have to specifiy that failure and success are extremes, and there are lots of tweens. Equally for the selection effects: fully antagonistic and fully cooperative are extremes; there are lots of tween cases.

Finally, there are interaction effects, where both dimensions contribute. In the case of the larynx, for instance, the drop imposes a physiological cost, but there are also physiological benefits, perhaps including the benefit of being able to communicate with your parents at an earlier age. In the first round, this is a physiological benefit because speaking with your parents is a function required for survival, as much as breathing. In the second round, one might imagine that parents would hasten this adaptation forward by taking particularly good care of children with a wider range in their tiny voices, or children better able to get the full range of phonemes. (On that note, Susan and Joe's child Kate is unable to say several consonants, including g and k. The way she handles this is by systematically and consistently substituting d for g and t for k -- check this with Susan.)

In brief, this is an imperfect analysis, but it does help to clarify some of the potential dynamics at work in evolution in ways that should make it easier to understand and discuss.

Bibliography

Maddieson, Ian (1984). Patterns of sounds. New York: Cambridge University Press, 1984. Series title: Cambridge studies in speech science and communication. Claims phoneme inventories vary from 11 to 141.

Ladefoged, Peter and Ian Maddieson (1996). The sounds of the world's languages. Cambridge, MA: Blackwell. Series title: Phonological theory.

Levinson, Stephen C. Language and Culture. MITECH online encyclopaedia, with references and links.

 

 

 

top

 

Debate
Evolution
CogSci

Maintained by Francis F. Steen, Communication Studies, University of California Los Angeles


CogWeb