New paper on language macro-families, from Austro-Tai to Indo-European-Chukotko-Kamchatkan

A new paper was published three days ago in PNAS by Gerhard Jäger on language macro-families, to add to Annemarie Verkerk's list of phylogenetic linguistics papers that have appeared this month.  I give here both a couple of encouraging results from the paper (mainly statistical support for the Austro-Tai hypothesis), and a couple of criticisms (mainly of the parsimony-based method of constructing the phylogeny).       

One of the greatest achievements of linguistics has been the discovery of large language families.  Some 583 languages of Europe and India from English to Nepali have been demonstrated to be descendants of a single common language, Proto-Indo-European.  A further 1274 languages spread around the Pacific and the Indian Ocean from Madagascar to Hawaii have been shown to be descended from a single language spoken in Taiwan, Proto-Austronesian. 

The 7-8000 languages of the world described so far have been placed into roughly 430 large groups of this type (37 in Eurasia alone), according to Glottolog.  These families are defined by shared innovations in basic vocabulary (such as English brother, Latin frater, Russian brat etc.).  Linguists dream of being able to link these language families into even larger units, and push back our knowledge of language history further back in time.  With more distantly related languages however, it becomes hard to distinguish these patterns from accidental similarity, or recent borrowing of loanwords, which confounds our ability to find more distant connections between languages.  Not that people have not tried doing this.  These attempts range from interesting

to silly,  

and projects such as the Tower of Babel, of which this is a typical entry: 

The above database is claiming that English man, Mandarin Chinese 男 nán 'male', Ancient Egyptian mn 'shepherd', Chechen naχ 'man', Yeniseian hīɣ 'man', and so on are sufficiently similar that they may be inherited from a common ancestor.  A claim like this in fact has nothing wrong with it in principle, except that most of the time, this type of work does not demonstrate that these patterns are not simply accidental.  As Steven Pinker put it in The Language Instinct (1994:255),

 I have no problem with Greenberg’s use of many loose correspondences, or even with the fact that some of his data contains random errors.  What bothers me more is his reliance on gut feelings of similarity rather than on actual statistics that control for the number of correspondences that might be expected by chance…Though I am willing to be patient with Nostratic and similar hypotheses pending the work of a good statistician with a free afternoon, I find the Proto-World hypothesis especially suspect.  (Comparative linguists are speechless.)

The new paper by Gerhard Jäger could therefore be viewed as one such work by a good statistician with a free afternoon - although the ASJP database which it uses is a project that has been developing for many years, containing basic vocabulary wordlists for about 4400 languages.  Jäger's paper used these wordlists to compute how similar languages are in their basic vocabulary, and see whether particular language macro-families emerge.  

I had assumed that it already had been tested by the makers of the ASJP database (e.g. these trees) and had yielded negative results.  I was therefore surprised by the main claim of the paper: that there is strong support when you do this for several controversial macro-families raised in the literature.  These include Austronesian being related to Tai-Kadai (Austro-Tai), Mongolic being related to Tungusic and Turkic (Altaic), and even some form of 'Eurasiatic' subsuming Indo-European, Uralic, Altaic and some language families of Siberia.

The paper takes word lists from 1,161 languages in Eurasia.  It then computes the similarity of words between each language; for example, it turns out that the Irish Gaelic klox 'stone' has a short edit distance to the word kox 'stone' in Northern Itelmen, a Chukotko-Kamchatkan language in far eastern Siberia. These distances are computed for each word in the forty-word list, and a total distance between languages is then computed.  By assuming that closely related languages are likely to be more similar, a tree can be constructed using a 'greedy minimal evolution' algorithm that puts more similar languages together in clades.  Most known language families are successfully recovered using this data and method.  

An interesting feature of the way that distances are computed is that similarities are weighted by sound class.  For example, hand in English is distant from mano in Spanish because the difference between h and m at the beginning of the word is large, whereas differences in vowels are penalized less.  In addition, the distances between words are corrected by how similar the languages are in the sounds that they use; English and Dutch use similar sound classes for example, and hence the likelihood of English and Dutch words resembling each other by chance is higher.  This is then corrected for when computing distance (interestingly, genuinely related languages are therefore penalized from the outset by virtue of having more similar sound systems).

These innovations in the method are ostensibly the reason for the new improved results.  Above the level of known language families, it turns out that there are larger units, some with high confidence values. The resulting tree looks like this:

It turns out that the closest relative of Indo-European is Chukotko-Kamchatkan, an answer nobody would have predicted given that Chukotko-Kamchatkan languages are spoken in far eastern Siberia.  Part of the explanation for this geographically implausible clade may be the 'surprisingly high number' of chance resemblances between Celtic and Chukotko-Kamchatkan words (c.f. the Irish and Northern Itelmen for 'stone' above), which Jäger gives an extended discussion of.  Indo-European turns out to be part of a large family spanning northern Eurasia, from the Urals to Mongolia, and up to eastern Siberia opposite Japan.

These are interesting results, and I don't know what people will make of them (I haven't seen any coverage or responses so far, in contrast with Mark Pagel et al.'s more publicized paper in 2013, also in PNAS).  For what it's worth, here are in my opinion two good findings from the paper:

i) Austro-Tai.  Several linguists have claimed that the evidence for Austronesian and Tai-Kadai being related is very strong.  As I've written in a previous blogpost, linguists such as Sagart and Ostapirat provide shared vocabulary as evidence but do not back it up with a statistical test by comparing them with other randomly selected languages.  It is therefore very interesting that Austronesian and Tai-Kadai come out as a strongly supported clade when such a test is done.  This is also an impressive result given how different the phonological systems are of Austronesian and Tai-Kadai languages.  It could still be partially due to borrowing, a question that could be answered by looking at what types of words are similar, as some words are known to be more likely to be borrowed than others.

ii) Less spectacularly, there are several geographically plausible macro-families, which could also simply be due to recent borrowing.  Examples include Sino-Tibetan/Hmong-Mien, Tungusic/Mongolic, and Ainu/Japanese.  This is still interesting, and shows the validity of the method in detecting genuine similarity in vocabulary between languages.  It only means that our methods for distinguishing inheritance from borrowing need to improve.  

And here are two weaknesses of the paper:

i) He chooses to test only languages of Eurasia, without giving any justification.  A much fairer test would be to pick language families from around the world.  The reason is that by picking language families of Eurasia, he is reducing the chance of finding language macro-families that are even more stretched than Indo-European-Chukotko-Kamchatkan.  If several Eurasian families turned out to be related most closely to language families of New Guinea or South America, for example, then this would cast doubt on how well the method really works.  As it stands, by testing only Eurasian language families, even the oddest macro-families (Ainu and Austro-Asiatic, Indo-European and Chukotko-Kamchatkan) can be argued away as not being completely impossible.

ii) The 'greedy minimal evolution' method of constructing the phylogeny.  For various reasons this method is too crude.  With a lot of language change, languages end up becoming similar by accident, such as the Celtic and Chukotko-Kamchatkan languages mentioned above.  Conversely, with a lot of language change, related languages can end up being very different from each other.  An algorithm that relates languages on the basis of similarity is therefore likely to confuse degree of relatedness with amounts of change (branch-lengths in the tree).

A better method is Bayesian phylogenetic inference, which tries out different trees and allows the branch lengths to vary, so that languages may be related but have been diverging for a long time, causing them to be not very similar.  Languages which are overall quite dissimilar can nevertheless turn out to be related if they have a few resemblances in the most slow-changing words, if these words are not shared by other languages.  

The solution is therefore to recognize that words change at different rates depending on how frequent they are (for example, forms for 'two' such as deux, dva, do, due etc. in Europe are retained from Proto-Indo-European).  Instead of computing the overall distance between languages, an algorithm that allows words to change at different rates is clearly going to capture genuine relatedness more effectively. Simulating the development of individual words also allows us to work out which ones are more likely to be borrowed and which ones are likely to be inherited.  Jäger points out the need for more work on simulating language contact, and also the interesting prospect of modeling the way that sounds change and using more informative comparison of words that way (as Johann-Mattis List's package in Python Lingpy does).  Another promising approach is to take semantic change into account and use a larger lexicon (e.g. Tier 'animal' in German is cognate with deer in English), perhaps by computing semantic distance between words using semantic vectors.

All of these are possible improvements to the methodology of the paper, some of which Jäger notes in conclusion.  Despite these weaknesses so far, this paper is probably the best work in showing that this kind of long distance comparison is possible.  It compares favorably to Mark Pagel et al.'s PNAS paper in 2013, which used putative cognates from the Tower of Babel (the database mentioned above that claims that Ancient Egyptian mn and English man may be related) to suggest that they were evidence for a Eurasiatic family.  They justified the validity of these long-distance etymologies by showing that the meanings of these words ('I', 'you', 'mother' etc.) tended to be highly frequent words in everyday speech, exactly what you would expect of words conserved over thousands of years.  The finding by itself I don't think is necessarily very useful, as it could simply show that the makers of the Tower of Babel have a bias towards looking for shared word forms that are more basic and more frequent, such as pronouns, numerals, demonstratives and everyday nouns, the kind of words which would make plausible ancient cognates.  

Pagel et al.'s paper is however interesting for calculating the half-life of words, namely the number of years before a word is typically replaced in a language; for example, words for 'I' seem to have a 50% chance of being replaced after as much as 77,000 years, based on the amount that these words have changed within known families.  This latter fact suggests that some words may indeed be conserved from ancestral languages many thousands of years ago.  In some respects that paper is the opposite in methodology to Jäger's, its strength being in the way that it studied the different rates of change of words rather than calculating distance between whole wordlists, but its weakness being in not evaluating how likely the Tower of Babel project is to find similarities in word forms due to chance.  

Apart from these improvements to methodology, the main work in the future will be more vocabulary data collection, which besides the ASJP database is now being done by Russell Gray and Quentin Atkinson and others in the 'Glottobank' project, building on work in specific regions such as the Indo-European Cognacy Database, the Austronesian Basic Vocabulary Database and the Trans-New Guinea database.    

Jäger's paper is encouraging in showing that such long-distance comparisons seem to be possible, and that the main challenges are in collecting data and refining these statistical techniques.  I am excited in particular by the prospect that these methods in the short-term will allow us to replace the Austronesian family with the larger Austro-Tai family - a massive family of over 500 million speakers that connects Thai and Lao speakers and their rice-farming origins in southern China with the great seafaring migrations of Austronesians out into the Pacific.

Dai village near Menghai in southwest China

Map of Austronesian migrations in the Pacific (with Hedvig and Damian Blasi in the Auckland Museum)

Phylogenetic happenings in September 2015

Humans who read grammars can use a variety of tool sets to analyse the linguistic diversity they encounter. One of these tool sets that is still relatively new to this particular set of humans is called 'phylogenetic comparative methods'. These methods assess linguistic features as they evolve on the branches of a family tree (aka phylogenetic tree). The combination of information on the history of a language family with data on typological features allows typologists to do cool things like infer what ancestral languages were like and how quickly features change.

For unknown reasons, magical things happened, planets aligned and in the first two weeks of September EIGHT new cool phylogenetic studies appeared! Let's have a look!

First up are several talks presented at the "Historical relationships among languages of the Americas" workshop held during the 48th Annual Meeting of the Societas Linguistica Europaea that was held in Leiden (NL) in early September. A number of talks dealt with the reconstruction of the genealogy of language families, and most of these used phylogenetic methods: Thiago Chacon and colleagues presented "New perspectives on Tukanoan language history: A combined framework of quantitative and qualitative approaches", Sérgio Meira and colleagues discussed "A character-based internal classification of the Cariban language family", and Elisabeth Norcliffe and colleagues focused on "The reconstruction and classification of the Barbacoan family of languages". How cool is that! Unfortunately I cannot discuss all of these here, but I'll add links to slides when they become available.

Whereas those three talks presented phylogenetic classifications of various types, Natalia Chousou-Polydouri and colleagues presented a phylogenetic comparative study with the title "Phylogenetic analysis of morphosyntactic data: A case study of negation in Tupí-Guaraní". The same team has previously worked on a phylogenetic tree of the Tupí-Guaraní languages, which are spoken in Amazonia and surrounding areas. In their SLE talk, Chousou-Polydouri et al. present an analysis of negation in 29 Tupí-Guaraní languages. They look at particular constructions used for negation, such as reflexes of the reconstructed morpheme *ani for 'no', as well as special constructions for negative imperatives and directives. They use ancestral state estimations to assess which of these morphemes and constructions can be reconstructed for Proto-Tupí-Guaraní, and show that there is evidence for 6 negator morphemes and 5 different negation constructions in Proto-Tupí-Guaraní. So now we know more about linguistic change in negation strategies in Tupí-Guaraní, and there is more to come from this team, as they are working on a big morphosyntax database!

To stay on the topic of Tupí(-Guaraní), last week a special issue of the Boletim do Museu Paraense Emílio Goeldi came out that focuses on the Tupian languages. In this special issue two phylogenetic studies appear, the first is by Ana Vilacy Galucio and colleagues and is entitled "Genealogical relations and lexical distances within the Tupian linguistic family". This paper is a phylogenetic investigation of the Tupian language family (of which Tupí-Guaraní is a subgroup) - the authors us distance-based methods to investigate the relations between 23 Tupian languages. Galucio et al. compare several different lexical datasets: first and foremost a 100-item Swadesh list, but also subsets of this list that only feature the most retentive (more stable) and least retentive (less stable) items, as well as a dataset of 90 plant and animal names. They find quite tree-like networks, suggesting that the Tupian languages diversified through periods of migration or political separation that created the major subgroups. However, their analysis on plant and animal names suggests that there has been undetected borrowing, indicating some form of (subsequent) language contact between some of the Tupian subgroups. More is to come from this team, as they will continue to investigate the phylogenetics of the Tupian family further using character-based methods!

The second phylogenetic study in the special issue is Joshua Birchall's "A comparison of verbal person marking across Tupian languages". Birchall has data on verbal person marking from 16 Tupian languages. He studies how verbal person marking has changed on the branches of two different phylogenetic trees, an expert classification as well as a tree inferred on the basis of lexical data. He finds that Proto-Tupian is most likely to have marked the subject and the object of both transitive and intransitive clauses on its verbs. However, subsequent changes have affected person marking on transitive and intransitive verbs differently. Birchall complements this phylogenetic analysis with an overview of the verbal person markers and their cognacy across languages, showing that quantitative and qualitative methods can go hand in hand to shed light on linguistic diversity :).

Let's move from South America to Africa! On the 14th, a study by Rebecca Grollemund and colleagues came out in the early edition of PNAS. Entitled "Bantu expansion shows that habitat alters the route and pace of human dispersals", this is a phylogenetic study that uses ancestral state estimation to infer the route by which Bantu speakers inhabited most of Sub-Saharan Africa. First, Grollemund et al. reconstruct a dated phylogenetic tree on the basis of lexical data from 409 Bantu and 15 Bantoid languages. Then, they use this tree to reconstruct the location of ancestral languages that must have been spoken on the route from what is now northwest Cameroon all the way to easternmost Kenya and southernmost South Africa. As they know where and when these ancestral languages were spoken, they are able to show that Bantu speakers passed through a savannah corridor in the rainforest that appeared around 4000 years ago. Additional studies on migration rates of the various Bantu speaking peoples into and out of the rainforest indicates that Bantu speakers prefer to live on the savannahs, as moves into the rainforest are typically delayed by around 300 years. This paper is an excellent example of using phylogenetics to make inferences about human history, in this case the impressive spread of the Bantu language family!

For the last study I want to discuss we need to move from Africa to Australia! This one came out on the 16th. Kevin Zhou and Claire Bowern write about "Quantifying uncertainty in the phylogenetics of Australian numeral systems", a phylogenetic comparative study of the number of numeral terms as well as their compositionality in the Pama-Nyungan languages of Australia. What is so amazing about Australian numeral systems (= terms for numbers such as 1, 2, 3, etc.) is that they are typically tiny: Zhou & Bowern write that most of the languages they investigate have a range of 3 to 20 numeral terms, with a median of 4! As we've seen above in some of the other studies, ancestral state estimations are used to infer the Proto-Pama-Nyungan numeral system, which probably had 4 numeral terms. They perform additional tests to determine whether there is a correlation in the compositionality of numeral terms for ‘3’ and ‘4’. Zhou & Bowern demonstrate that such a correlation indeed exists, i.e. if languages form '4' by 2 + 2, they are likely to also form '3' in a non-opaque manner, by adding 2 + 1. This is a great paper with some very cool figures - Figure 1 is very good for seeing how numeral systems have become smaller in some Pama-Nyungan subgroups and bigger in others.

Thank you for joining me in a journey around the world where we find these cool studies of language family histories and typological diversity featuring explanations in terms of historical change! If only every month could be like September 2015...

If I have missed anything else phylogenetic that appeared in the last three weeks, please comment and I shall update this post.

EDIT 1: even though it is not about linguistics, a paper introducing a new database on Austronesian subpernatural beliefs and practices that features several phylogenetic comparative analyses deserves mention here! It came out in PLoS One on the 23rd. The database is called Pulotu and is freely accessible here.

EDIT 2: as a special birthday gift for yours truly, a study on language macro-families came out in PNAS on the 24th. Jeremy Collins discusses it fully in this new post.

The World Atlas of Universal Grammar

The World Atlas of Universal Grammar is an ongoing project documenting features of languages, by a team of more than eight authors.  Each feature is a structural property that describes one aspect of cross-linguistic homogeneity.  A feature has between 0 and 87178291200 values, shown by different colours on the maps.  Here is a sample of the features listed by author.

Steven Pinker
Does the language reverse the order of words in a sentence to form a question?
Yes      No
Does the language employ musical melodies for words and major/minor keys for polarity?
Yes      No
Is the language made up only of two-word utterances, with larger compositional meanings deduced from pragmatics?
Yes      No
Is it the ‘language’ of bee dance?
Yes      No
Does the language specify grammatical relations not in terms of agent and patient, but in terms of evolutionarily significant relationships (predator-prey, eater-food, enemy-ally, permissible-impermissible sexual partners, etc.)?
Yes      No
Is the language similar to the banter among New Guinea highlanders in the film of their first contact with the rest of the world, or the prattle of little girls in a Tokyo playground?
Yes      No
Is it a ‘rational’ language spoken in a utopian commune, lacking arbitrary signs larger than a phoneme and allowing the meaning of any word to be deduced from its phonology?
Yes      No

Mark Baker
Does the language have obligatory subject and object-agreement on the verb, free word order, and object incorporation, as well as prohibiting reflexive pronouns and indefinite nominals?
Yes      No

Noam Chomsky
How closely does the language resemble what a superbly competent engineer might have constructed, given certain design specifications?

Richard Kayne
Does the language have Subject-Verb-Object order?

Luigi Rizzi
What type of wh-movement does the language have?
Overt      Covert

Guglielmo Cinque
What is the ordering of noun and adjective, noun and numeral, noun and demonstrative, possessor and possessum, verb and object, and verb and subject?
Head-initial    Head-final

Ian Roberts
Is there V-to-V movement?
Yes      No
Does Force attract Fin?
Yes      No
Which of the following realizes Force?
Merge   Move   Both    Neither
Is it natural to hypothesize that the accusative form of the object is found when the object moves to SpecAgrOP, and therefore that the subject must have raised out of VP because it precedes the object, up to the lowest position available to the subject which is SpecTP, making the V higher than T and therefore in AgrS?
Yes      No

Géraldine Legendre
Rank these constraints in order of importance:
The highest A-specifier in the sentence must be filled   
Lexical items must contribute to the interpretation of a structure
Focused elements are aligned with the left edge of the VP
Focused elements are aligned with the left edge of the clause
Minimal projection
Economy of expression
No morphology
Obligatory heads
No lexical movement
Optional locative specifier
Optional manner specifier
Optional reason specifier

David Pesetsky
Does the language have case marking?
Yes      No
If not, is it a Bantu language that can drop the nominal augment on nouns if they are preverbal and raised, but not if they are unraised?
Yes      No
So it does have case marking after all!
Yes      No
Could this finding get published in Science, be reported as an AP news item, and end up as a theme for a joke by a late-night talk-show host?
Yes      No

