Monday, October 16, 2017

A week of Bantu grammar reading: The good, the bad, and the ugly

In my week of grammar reading, there were some truly good, truly bad, and truly ugly aspects. It was entirely devoted to Bantu, as it has been for some time, because of a project on Bantu gender systems I am doing together with Francesca Di Garbo of Stockholm University. Bantu languages are best known for their complex morphology, both involving the noun (gender) and the verb (Tense-Mood-Aspect). The noun typically is classified in up to around 12 different genders, signified by prefixes that attach to the noun root, and prefixes that mark agreement with the noun throughout the clause. To start with the good, below is an overview of nominal prefixes in Ding (diz), spoken in the Democratic Republic of Congo. The Roman numerals in the first column signify each gender, the two subsequent columns relate the singular and plural prefixes used for nouns, including different forms, and the final column gives examples of the type of noun found in each gender. 

Mertens (1938: 26)

By far the large majority of Bantu languages have systems such as Ding. What we are looking for in this project is not the presence of gender systems per se, but rather variation regarding the number of genders, the places where gender marking is found aside from on the noun, and occasions when agreement is not based on gender, but rather on animacy - more on latter the below. We look for over 20 different word classes that may carry agreement marking: attributive adjectives, demonstratives, genitives, predicative nominals, possessive pronouns, and so on. Many of these do have gender agreement in many Bantu languages, but unfortunately this is not always very well described. 

The Ding grammar by Joseph Mertens is in one word, AWESOME. It is part of a three-volume description of ethnography, grammar, and lexicon, in total of over 1000 pages. The grammar is very extensive, features many examples, and is written clearly and honestly - the author did not find out exactly how relative clauses work in Ding, and just says this. Yay! And, as it turns out, agreement is not always assigned on the basis of gender. Below is a full overview of page 64 and 65, where demonstratives are described. In Table 1, the demonstratives for each gender are given - for each gender (I-VII) and number (SG/PL), there is a separate demonstrative, which bear some similarity to the nominal markers above.  

Mertens (1938: 64-65)

The first example on this page shows how gender agreement works: muur wu tɛɛn ntɛɛn 'cet homme bavarde' - muur means man, it starts with the mu- prefix for singular gender I. wu means something like 'this', i.e. it is a demonstrative that indicates something is located close to the speaker. It agrees with the noun muur, and is also singular gender I. This is the status quo: different parts of the clause, such as the demonstrative, are marked for the same gender as the (subject) noun. However, many Bantu languages diverge from this by allowing agreement to not reflect gender assignment, but rather to reflect animacy. Mertens describes this just below the first set of examples: "a) Des noms de personnes, appartenant à d’autres classes que la première, en adoptent pourtant le démonstratif." Nouns that designate humans, even if they belong to a different gender, take the demonstrative of the singular gender I. The first gender typically contains nouns designating persons, such as kinship terms, and thus by assigning human nouns agreement in the first gender, the animacy status of these nouns is appropriately flagged. As Mertens continues, in Ding, the same is true for animals: "b) Même remarque pour les noms d’animaux", which is not unreasonable, as many animals in stories have human-like qualities. In Ding, animacy-based agreement is also found on verbs, but not for any of the other parts of the clause we looked at. 

Francesca also had a great find this week on Bangi (bin), spoken in Congo, the Democratic Republic of Congo, and the Central African Republic, it is worth reading in full: 

 Whitehead (1899: 8)

This passage contains so many (implicit) claims about language use, second language acquisition, contact-induced change, simplification, prescriptivism, prestige, it is really something special. The processes in noun class reduction through contact-induced change are one of the things we are trying to study in this project. Also, it points out the problem that we face in trying to uncover animacy-based agreement, or other types of agreement - some grammar writers will not report on these deviances from 'regular' gender-based agreement. 

To continue with the bad. This week I was also looking on information on Tsaangi (tsa), spoken in Congo and Gabon. I went looking for the following two references: 

Loubelo, Fidèle. (1987) Description phonologique du itsaangi, parler de Madouma-Mossendjo. Brazzaville: Université Marien Ngouabi MA thesis.

Loubelo, Fidèle. (1990) Le nom en kitsa:ngi, langue bantoue du Congo. Dakar: Université Cheikh Anta Diop doctoral dissertation. 

Since one of these references focuses especially on the noun, I tried to look around on the internet to see if I could find the author and ask them whether they might be able to share their work. I found their name mentioned in at least three places (here and here and here), mentioning that this person was murdered in late 1998, during the Republic of Congo Civil War. This could be another person with the same name, but the sources mention this person was a minister, and being a minister and a linguist may very well go hand in hand, with many people studying Bantu languages that are ultimately employed by Christian organizations to translate the Bible. This was such a saddening finding. The wars fought in Africa and those still being fought today are horrible. 

This leaves us at the ugly. I usually don't like to speak ill of other people's work, and this classification is partly in jest - normally I would classify the following simply as 'bad'. But this week I read a grammar of a Bantu language that did not discuss the gender system at all - despite there in all likelihood being one. This was a grammar of Teke-Eboo (ebo), spoken in Congo and the Democratic Republic of Congo, by Edouard Tetsio. The grammar consists of four parts, preliminaries, history, and ethnography, p. 1-108; grammar, p. 109-168, texts with commentary, p. 169-268, and a lexicon, p. 269-312. The emphasis on ethnography suggest that the author was probably not a linguist, so I can forgive him - however, as gender systems are so prolific in Bantu languages, I don't think I have ever before seen a grammar that does not talk about gender, even if it is mostly absent. On page 131-132, the author discusses sex-based gender, of the type we find in much of Europe, with a female/male, or female/male/neuter distinction. However, this is not relevant for any Bantu language we have come across so far (when Bantu languages restructure their gender system so far to end up with a 2-way or 3-way distinction, they do this on the basis of animacy, not sex). Then, on page 137 below, the author discusses the sentence Onké wu ombi o yaya "Une vilaine femme arrive", about mid-page.

Etsio (1999: 137)

At the bottom of that paragraph, the author writes that attributive adjectives agree with the noun they modify, ombi being the singular form, and abi the plural. These are in all likelihood prefixed adjectives, the root being -bi, the singular gender I agreement prefix being om-, and the singular gender I agreement prefix being a-. (To go into unnecessary detail, since wu is inserted between the noun onké and the 'adjective' ombi, it is likely that ombi is a noun meaning something like 'ugliness', and wu is a so-called associative marker, also marked for gender. Most Bantu languages do not have a large class of adjectives, and uses nouns (in genitive constructions) and verbs (in relative clauses) instead.) However, despite the author noticing that these adjectives agree with the noun, there is no description of the gender system, or further description of other parts of the clause that also show agreement. He seems to have completely missed this feature - something which could only have happened if he didn't know that this is a relevant feature of Bantu languages, and there was no proofreader of the grammar that had this knowledge either. Weirdly enough, Etsio has also published a grammar of Lingala, also without describing its gender system

Of course I may be entirely wrong here, but so far further investigations suggest Teke-Eboo has at least something resembling a traditional Bantu gender system that Etsio completely missed. For now, goodnight and hope the next week of grammar reading only brings good.


Etsio, Edouard. 1999. Parlons téké: langue et culture. (Collection Parlons.) Paris: L'Harmattan.

Etsio, Edouard. 2003. Parlons Lingala. Paris: L'Harmattan. 240pp.

Mertens, Joseph. 1935, 1938, 1939. Les Ba Dzing de la Kamtsha. (Mémoires / Institut Royal Colonial Belge, Section des Sciences Morales et politiques). Bruxelles: Campenhout. (3 vols.)

Whitehead, John. 1899. Grammar and Dictionary of the Bobangi Language. London: Baptist Missionary Society.

Wednesday, September 27, 2017

Linguistic map making: Drawing polygons

Hedvig has written on how Ethnologue has become even more restricted than it already was, and what resources are out there that could be used instead. One of the things I miss from Ethnologue are its maps - although at least recently it was still possible to access most of these, by downloading them instead of viewing them on your browser. In her post, Hedvig points out that Langscape can be used instead, and that's all great.

But what if you wanted to draw a map yourself? Especially one which you intend to publish? Some institutions may have access to the World Language Mapping System (WLMS), which lies at the core of Ethnologue's (and Langscape's) maps, and was made by Global Mapping International (which recently was closed, and now the WLMS is back formally with SIL). I'm not sure about the details (and the user agreement parts of the WLMS website are down), but paying a lot of money for the WLMS must enable users to draw their own maps and publish them, as long as they cite the source.

Not everyone has access to the World Language Mapping System, and even if you do, it is very likely that your specific needs are not covered by it. For example, have a look at the following map of the border between the Central African Republic, South Sudan, and the Democratic Republic of the Congo. As you can see, where these three countries meet in the center of the map, Zande, one of the biggest Ubangi languages, is spoken on all three sides of the border.


However, there are some Bantu languages spoken in this border area. One of them is Homa (hom), and as you can see by its location on Glottolog and comparing it with the langscape map, it simply does not have a polygon in the World Language Mapping System / Ethnologue maps.


Homa is a small, underdescribed Bantu language, which according to Sommer (1992: 352) may be threatened with extinction. The only extremely sketchy description of it is Santandrea (1963), who describes animacy-based agreement on adjectives, and suggests a heavily attrited gender system - something rather unusual for a Bantu language, with their generally extensive and healthy gender systems. This is the immediate reason for this post - together with Francesca Di Garbo I am looking at gender systems in Bantu, and I would very much like to be able to plot Homa on a map, not just with a point as in Glottolog, but with a polygon that I can color to indicate its special characteristics. A polygon rather than a point also makes far more clear that this language community is spoken in Zande country, far away from most of the rest of the Bantu languages.

Turns out, there is an extremely easy way to do this. One can use Google Earth to draw polygons anywhere on the world's surface, save them, and load them into R to make nice maps. The link to Google Earth is here (use within browser, wants Chrome), but you can download it here.

Once you open the Google Earth application, you can draw a polygon with the 'draw polygon' tool in the toolbar above the map. While the window is open, you can make the polygon by clicking on the map. Then you name it and save it as a kml file - described in much more detail here. This is the polygon I drew for Homa, see explanation below:

source: Google Earth

The location of Homa speakers according to Glottolog is close to Nagasi. Santandrea (1948: 81) states speakers of the language can be found in Tombora, and Sommer (1992: 352) puts their location "around towns of Mopoi and Tambura". As you can see on the Glottolog map, this is an area just north of where Glottolog puts the centroid for Homa. So, using Google Earth I draw a kind of oblong shape around these towns, the northwestern one being Tambura, and saved the polygon as Homa.kml. I don not know why there is a discrepancy between these sources and the Glottolog point, that is a story for another time.

Next, we can read the .kml file into R, and place it on a map. Please see code below.

# a libary you need to read in .kml files
Homa <- readOGR(dsn="Homa.kml")

# a libary you need to make the map

# plotting the map
map("world2Hires", xlim=c(23, 31), ylim=c(1, 8), boundary=TRUE)

# putting in country names so we can situate them
text(x = 30, y = 7.5, "South Sudan")
text(x = 26, y = 4.5, "Democratic Republic of Congo")
text(x = 24.5, y = 7, "Central African Republic")

# plot the Homa polygon

plot(Homa, col ="magenta", add=TRUE)

The resulting plot looks like this:
This is only a very partial solution. If you wanted to draw a big map with many languages on it, it would be an enormous amount of work to go through the literature and surveys on where different languages are spoken. This work was already done, at least in part, by Ethnologue / the World Language Mapping System, and it is rather sad to do such work twice, or trice, etc. However, as the Homa case points out, data on where languages are spoken may be missing in Ethnologue, or may be incomplete, or no longer correct. Especially when you know a particular area in detail, it may be worth drawing your own map, and Google Earth + R makes this very easy. Of course, it would be even better to use actual data to draw ethno-linguistic maps, and not a 70-year old description, but for some areas of the world, that is something only for the very far future.


Santandrea, Stefano. 1948. Little Known Tribes of the Bahr El Ghazal. Sudan Notes and Records XXIX. 78-106.

Santandrea, Stefano. 1963. Short Notes on the Bɔdɔ, Huma and Kare Languages. Sudan Notes and Records 44. 82-99.

Sommer, Gabriele. 1992. A Survey on Language Death in Africa. In Brenzinger, Matthias (ed.), Language Death: Factual and Theoretical Explorations with Special Reference to East Africa, 301-413. Berlin/New York: Berlin: Mouton de Gruyter.

Tuesday, September 19, 2017

Public service announcement: list of databases and more

Public service announcement: there are website that keep a well-curated list of things that are useful to linguistics researchers and students, including the following:
It would appear that some don't know about these lists, so now you know/are reminded :).

Lists are good, and instead of reinventing them you can look through these and add to them. For more hopefully useful stuff like this, go here.

Monday, August 28, 2017

Ethnologue more restricted

In April this year, Ethnologue changed access restrictions to their website again. Now, non-paying users from high income countries can only access 1 page per month before they are banned, previously it was 7. In light of this change, we will go through some basics regarding the paywall again (old post here) and where you can go instead. Finally, I list some questions should any SIL International/Ethnologue staff see this post.

Basics on the pay-wall
We haven't received much detailed information on this change, but if it's the same as last time it means that users with IP-addresses in countries that are classified by the world bank as "high-income" will be restricted. Cloudflare would appear to be the service provider managing this for Ethnologue. Previously, we've learned that only 5% of users look at more than 7 pages per month. We don't know how many go to more than 1 page (probably a lot more though!).

SIL International also maintains the ISO 639-3 codes for language names (one of 6 language ISO-codes). Those pages are NOT affected by this restriction. Ethnologue and SIL International are not the same thing, SIL International produce more things than just Ethnologue.

Old editions of Ethnologue have different restrictions than the current edition.

Ethnologue is mainly funded by Wycliffe Global Alliance (an explicitly Christian organisation), and not by any state or academic institution. This information is based on what I understand from financial statements, I may be mistaken. Clarification form Ethnologue/SIL International staff is highly appreciated here. Please note that there are many other ISO industry standards that are pay-access only, the fact that SIL International provides 639-3 openly is fortunate.

It appears to us, the users, that SIL International have made these decisions to remedy a financial situation. It is not clear at this time if SIL International is seeking other ways of bringing in funds, like more traditional grants from research councils.

Where to go instead?
Much of the information that Ethnologue provides is actually available elsewhere. Here is a table displaying some of the places you could go to instead of Ethnologue.

Family trees MultiTree, Glottolog
Codes Glottolog, ISO 639-3 repository 
Alternative names Glottolog
Endangerment level UNESCO Atlas of the World's Languages in Danger
Maps of language areas (polygon data) Langscape
Population stats (Old Ethnologue editions), CIA World Factbook, Wikipedia

The information that Ethnologue provides that is the hardest to replace is population stats. The pages that I regret the most that I cannot access are the overall summary stat pages, they're nice for showing size of language families and the power law of speaker populations.

Here is some more details on some of the resources listed above.


MultiTree is by Linguist List and is a catalogue of lots of different language trees. You can search through the database for lots of different trees and compare them, very cool!

Glottolog is provided by the Max Planck Society, and edited by Harald Hammarström, Robert Forkel and Martin Haspelmath. Most of the detailed curation of the data is managed by Harald. Glottolog provides a lot of information, mainly language codes, trees, location (dots, not polygons) and references. Each tree in Glottolog has a clear reference to a published source, which is very handy. There is also clear information on how the classification is handled.

If you disagree with information you find on Glottolog, or want to add information, you can file a GitHub-issue or click the little alarm bell symbol on the relevant page.

UNESCO Atlas of Languages in Danger
This atlas is the complimentary online version of the 2010 print edition and edited by Christopher Moseley. It contains information on 2,464 languages. This is the scale, and the number of languages at each level:
  • Vulnerable (592)
  • Definitely endangered (640)
  • Severely endangered (537)
  • Critically endangered (577)
  • Extinct (228)

Langscape is a website by the Maryland Language Science Center. They provide games, lesson material for teachers and - most interestingly to us - maps. These interactive maps are actually based on the polygon set of SIL International, and they're not available for download freely. They are however accessible in the interactive web browser interface. One way you can see that these are Ethnologue polygons, is that the genealogical classification is the same. For example: Mande languages are marked as the same family as Bantu languages (not the case in Glottolog).

One of the games that Langscape has is an identification game, not that different from the Great Language Game that we wrote a paper about! We also made a new game, LingQuest, that you can play.

Alright, now you know. Best of luck with whatever research you have that is dependent on this kind of information.

Questions for Ethnologue/SIL International staff (should they be reading)

  1. will the ISO 639-3 codes ever be behind a pay-wall?
  2. are there other products of SIL International that may become like Ethnologue?
  3. is it still only in effect in high-income countries (according to the world bank)?
  4. is it intentional that the Ethnoblog and the summary statistics pages are included under the pay-wall?
  5. how are Ethnologue and SIL International funded?
  6. have you considered other funding options?
  7. how does Ethnologue and SIL International see their own roles in modern academia and some academics dependence on the data, despite these resources not being traditionally funded by academic institutions?
  8. why was the change made?
  9. was the change announced anywhere publicly?
  10. how many users access more than 1 page per month?
  11. how many users access more than 7 pages per month?
  12. how has the user stats changed the past 2 years?
  13. how many of your users are commercial and how many are academic, by estimation?
Note that Ethnologue is not only used by the academic research community, but also by commercial and governmental institutions (for example in this scandal). In fact, considering the new restrictions on access and problems with the basic data (opaque decisions and sources), perhaps academics shouldn't really use Ethnologue much at all.

Tuesday, August 1, 2017

ELAN: making tier(s) out of search results

Hedvig in her office in Canberra figuring this out
and writing this guide.
Here is another guide for how to do something practical in ELAN. Previously, we relayed Eri Kashima's guide for sensible auto-segmentation with PRAAT and ELAN (time saver!). (For all posts about fieldwork on this blog, see this tag.)

This time: how to take your search results and make the matching annotations into new separate tier(s). This is useful if you for example want to cycle through only the annotations that match a certain search query in transcription mode. This post has a longer guide, and a short guide at the end.

You can also use this guide if you want to compare several different transcriptions with each other, for example older and newer versions or if you are collaborating with different people. In that case, start from step (4).

For those who don't do a lot of transcription: ELAN (EUDICO Linguistic Annotator) is a program from TLA at MPI-Nijmegen. This program allows us to easily annotate audio and/or video files with lots of relevant data. We can use ELAN to count things, but we can also export as CSV-files for analysis later (Excel, R, Libreoffice etc). ELAN is free and great. If you ever need to do transcription, do it in ELAN. Do not create long text-documents with no linking to the audio, it is just ridiculous. Download ELAN here.

Version of ELAN: 4.8.1 (to my knowledge though this should work the same for other versions)

We're going to:
  • search in a clever way
  • export those results
  • import them as new tier(s) into the .eaf-file you're working on
  • thus creating a tier with a defined subset of other existing tiers, making work speedier on targeted parts of your corpus
You can click the images for larger versions.

Example case
I've got a transcribed file where I've noticed some different pronunciation of a certain word. I'd like to pick out only the annotations containing that word, make a new tier with only them, and write down some clever things about this word in that tier. I don't want to have to scroll through all annotations to get to only these.

I work on Samoan, and the word I'm looking at means "to tell/explain": fa'amatala. "Fa'amatala" is the dictionary entry for this word, but it varies in pronunciation in actual speech. I've asked my transcription assistant to mark down vowel length and presence and absence of glottal stops (as opposed to more orthographic transcription). She has done this pretty consistently (as far as I can tell, it's hard to hear glottal stops sometimes), and since I know what kind of variations to expect I can easily find the instances for this word. Due to t and k-style (lects in Samoan) and speed these are the variations we can expect:
  • fa'amatala
  • fa:matala
  • famatala
  • fa'amakala
  • fa:makala
  • famakala
Besides the obvious difference in pronunciation, I've noticed something unusual going on in the realisation of the realisation of t/k, sort of like an affricate. So, I'd like to listen to all instances of this word with all these spellings and make notes of that.

Here are the steps. At the end is a short guide for when you've started to get the hang of this but need basic guidance.

Step 1) clever searching
In ELAN we can search for simple words, but we can also do something a bit more clever: we can search using regular expressions. Now, you don't need to have a complicated query or know all regex magic to make use of this. In this case, we're simply going to use the 'OR'-function. 'OR' in regular expressions is expressed by the vertical line/pipe character: "|" .

So, I'm searching for "fa'amakala|fa:makala|famakala|fa'amatala|fa:matala|famatala" in the tier marked "transcription". No need for bracketing, asterisks or anything like that in this case. If you want to do more complicated things with regular expressions, I highly recommend this guide and cheat sheet for regular expressions in ELAN by Ulrike Mosel*.

Search query results

Here are our search results:
  • uma fa'amatala i a'u i le tala o le video 
  • fa:makala loa le!
  • fa:makala?
  • fa:makala ka:maloa lale e 
  • ma: e mafai ona e fa:matala mai fapefea le vaitaimi na'e tuputupu 'ae i: falealupo
  • mafai ona e fa'amatala i a'u 
  • fa'amatala?
  • i e mafai ona e fa:matala i le ese'esega o gagana sa:moa 
  • e mafai ona e fa'amatala i le tala le lenei 
  • i fasa:moa, fa'amolemole fa'amatala i le a
  • le kusi la ga ae kago famakala aka 
  • o: mai o le se famakala aku le mea 
  • fa:makala uma ?
  • e ke kago famakala le aka 
That looks good! Not all variations we thought might exist occurred (we didn't get "famatala"), but that's normal. (In fact, specifically not getting that form is expected. Shortening of vowel + the t-lect should not co-occur often, if we believe what Mayer, Ochs and others have said about Samoan variation.)

If you want to edit your search query, you don't need to start all over. Just click the search window again right there over your results, it'll be editable again. (This took me a while to realize.)

Step 2) exporting the search results
This is is very easy, in the search window you have up, go to "Query>Export" and choose to export as tab-delimited text.
Export search query results
Exporting search results dialogue window
Name your file something sensible, and put it in a good place. Now let's have a look at said file outside of ELAN, shall we? The file will have the file-extension ".txt", but it is a tab-separated file (".tsv"). Open it in some spreadsheet program (excel, numbers, libreoffice, google sheets, whathaveyou) and it should look a little something like this:

Search results file opened in Excel, specifying tab as delimiter.
That looks kinda alright, doesn't it? There's no headings, but we can figure this out. There's some things in there that we didn't ask to have, for example the first column is the file location. That's not needed for what we're doing, and I'll show you how to handle that in the next step. Don't worry.

Step 3) creating tier(s) out of the search results
Now we go back to ELAN and we import this file as a tier. What will happen here is that a entire new .eaf-file will be created, the tier will actually not be imported directly into whichever file you currently have open.  This means that it doesn't matter which .eaf-file you currently have open when you import (or indeed if any is open). Counterintuitive, I know, but don't worry - I've figured it out. It's not that complicated, just stay with me.

File>Import> CSV/Tab-delimited Text file

Importing CSV/Tab-delimited Text file
Next up you will get a window asking you questions about the file you're trying to import. Remember how the file didn't have headings for the columns? How will we figure out what is what? Not to worry, it's like this:

1 col: ignore (uncheck)
2 col: Tier
3 col: Begin time
4 col: ignore (uncheck)
5 col: end time
6 col: ignore (uncheck)
7 col: Duration (not sure why this is needed but oh well)
8 col: ignore (uncheck)
9 col: Annotation

Import CSV/Tab-delimited Text file dialogue window.
I wish that ELAN had a way of automatically recognizing its own search output, but it doesn't and we know how to do this anyway so it's all good. No need to specify the other options, just leave them unchecked.
An actual ghost

Now you will have a new .eaf-file with the same name as the file with the search results. This file will contain only the tier(s) you had searched within and only the annotations matching the search query. There's no audio file and no other tiers. It's like a ghost tier, haunting the void of empty silence of this lonely .eaf-file.
A lonely ghost tier in an otherwise empty .eaf-file
Save this file and other files currently open in some clever place(s), quit ELAN and then restart ELAN. Sometimes there seems to be a problem for ELAN to accurately see files later on in this process unless you do this. I don't know why this is, but saving, closing and restarting seems to help, so let's just do that :)!
Chris O'Dowd as Roy Trenneman in IT-crowd
Step 4) importing the search results tier into the original file
Now here's where I slightly lied to you: we're not going to import the tier into your file. We're going to merge the search-results-tier-only-file with the other .eaf -file that has all the audio and other tiers and the result is going to be a new .eaf-file. So you'll have three files by the end of this:
  • a) your original .eaf-file with audio and lotsa tiers
  • b) your .eaf-file with only the search results-tier and no audio etc (ghost-tier)
  • c) a new merged file consisting of the two above listed
Don't worry, I've got this.  I'm henceforth going to call these files (a), (b) and (c) as indicated above.

Open file (a). Select "Merge Transcriptions..."

File>Merge >Transcriptions...

Select Merge transcriptions
Now, select file (a) as the current transcription (this is default anyway), file (b) as the second source and choose a name and location for the new file, file (c), in the "Destination" window. You can think of "Destination" as "Save as.." for file (c) - our new file.

Specifying what should be merged and how
Do not, I repeat, do not append. And no need to worry about linked media, because (b) doesn't have any audio or anything (remember, it's a ghost). Just leave all those boxes unchecked.

Let ELAN chug away with the merging, and then you're done!

Step 5) finished!
Tadaaa! We're done! That wasn't so bad, was it? And look at what we've created!

Here's my merged file - file (c). I've taken the search-results tier and renamed it ("famakala"). I also copied it and renamed that one ("famakala - comments"). That way, I have a tier for making comments about the transcription annotation that has the exact same annotation distributions, but different values.
Final merged file in annotation mode, with the search results tier renamed and copied.
Here's the same file in the transcription mode, configured to only show the two tiers targeting the search query:
Final merged file in transcription mode, showing only the search results tiers.
Now, some final notes:
  • You might want to rename file (c) and delete file (a) and (b), for your own sanity later when managing the files, if for nothing else
  • Don't know how to get to transcription mode? Go to "Options>Transcription Mode".
  • Your tiers aren't showing up properly in transcription mode? Check that the "linguistic types" of the tiers are what you think they are and that that's what you've configured to see in transcription mode. Transcription mode can only show you tiers of one linguistic type at once (unless columns but that complex). I also don't get it really, but then again I barely get "linguistic types" at all though
  • Transcription mode getting clogged up with lots of irrelevant tiers? Got o "Configure..." left in the transcription mode window, select the right linguistic type and "Select tiers.." in the bottom left. Tick only the tiers you want to see at that moment
  • You can import several tiers at once by this method, you don't have to merge one search result at a time, see below
  • You might want to do something complicated related to speakers, see below
Several tiers at once
You can either search several tiers at once in the search mode and hence have several tiers in the search query output, or you could do several searches separately and then append the resulting tsv-files together afterwards in your spreadsheet-program. If there is a different value in the "Tier" column, ELAN will make several tiers when importing back as an .eaf-file. So, you can do several tiers at once.

Speaker tiers
Everyone organises their ELAN-files differently. I have a separate tier where I indicate who the speaker is in the annotation (see above screenshots). This is in contrast to how a lot of other people do it, with different tiers for different speakers. This means that I can search many speakers at the same time, or condition the search for "when X is indicated in speaker-ID-tier". 

If you're doing different tiers for different speakers, you might have to figure out something a bit different from me in order to search many speakers at the same time. It's not that difficult though, you just have to meddle a bit with the search query (or just search one speaker at a time). Contact me if you want help.

On a related note, if someone ever was to ask me to do separate speakers in different tiers, I can use the above process to separate out only annotations with a certain value in the speaker-tier and then import them back as tiers per speaker. I'd rather not, I like it this way. But, I like making sure that the way I set things up is possible to configure to please others as well. Flexibility is good, don't lock yourself into a too narrow set-up that doesn't allow you to change without losing data.

That granted, I need to do manual fidgety things for overlapping speech given this model. That's inconvenient, but I'm ok with it.

Short guide
Step 1) Clever searching
Step 2) export search results
  • Query>Export (Save as tab-delimited text file)
Step 3) create new tier
  • File>Import> CSV/Tab-delimited Text file
  • Specify columns (1 col: ignore, 2 col: Tier, 3 col: Begin time, 4 col: ignore, 5 col: end time, 6 col: ignore, 7 col: Duration, 8 col: ignore , 9 col: Annotation)
  • Save new .eaf-file. 
  • Quit and restart ELAN
Step 4) Creating merged file
  • Open original file with audio and other tiers
  • File>Merge transcriptions...
  • Select .eaf-file with search results as second source (do not append)
  • Save new merged file
  • Delete superfluous files
Step 5) done
  • rename and copy tiers if necessary

I'm sure there's other ways of doing this, but this is what has worked well for me. I'd like this to be easier in ELAN, but in the meantime this works so I'm gonna do it like this.

I find, in general, that I learn more about ELAN and other similar tools by just trying lots of different things and probing the system. Sure, there's manuals, but they often envisage a different usage than I'm after. For example, I'm not clear on what I actually gain by "linguistic types" in what I want to do. Nevermind, probing, searching and sharing seem to be the best way to go for tailored functions. Usually, what you can conceptually imagine as a useful thing exists somewhere (it's like rule 34 but for software). I didn't know how this worked until I thought to myself: "there must be a way of importing search results". And lo and behold, there is. Now here's something I've learned and that you now can do too! Good luck!
Good bye!
Richard Ayoade as Maurice Moss in IT-crowd
* No, I don't know why it is that two linguists who are working/worked on specifically Samoan are trying to teach other linguists to use regular expressions in ELAN. Must be something in the water.
Ulrike Mosel and Hedvig Skirgård (yours truly) in Canberra
Samoan water, Neiafu-Tai village

    • Sloetjes, H., & Wittenburg, P. (2008).
      Annotation by category – ELAN and ISO DCR.
      In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008).
    • Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H. (2006).
      ELAN: a Professional Framework for Multimodality Research.
      In: Proceedings of LREC 2006, Fifth International Conference on Language Resources and Evaluation.
    • Brugman, H., Russel, A. (2004).
      Annotating Multimedia/ Multi-modal resources with ELAN.
      In: Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation.
    • Crasborn, O., Sloetjes, H. (2008).
      Enhanced ELAN functionality for sign language corpora.
      In: Proceedings of LREC 2008, Sixth International Conference on Language Resources and Evaluation.
    • Lausberg, H., & Sloetjes, H. (2009).
      Coding gestural behavior with the NEUROGES-ELAN system.
      Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.591.