Tuesday, August 1, 2017

ELAN: making tier(s) out of search results

Hedvig in her office in Canberra figuring this out
and writing this guide.
Here is another guide for how to do something practical in ELAN. Previously, we relayed Eri Kashima's guide for sensible auto-segmentation with PRAAT and ELAN (time saver!). (For all posts about fieldwork on this blog, see this tag.)

This time: how to take your search results and make the matching annotations into new separate tier(s). This is useful if you for example want to cycle through only the annotations that match a certain search query in transcription mode. This post has a longer guide, and a short guide at the end.

For those who don't do a lot of transcription: ELAN (EUDICO Linguistic Annotator) is a program from TLA at MPI-Nijmegen. This program allows us to easily annotate audio and/or video files with lots of relevant data. We can use ELAN to count things, but we can also export as CSV-files for analysis later (Excel, R, Libreoffice etc). ELAN is free and great. If you ever need to do transcription, do it in ELAN. Do not create long text-documents with no linking to the audio, it is just ridiculous. Download ELAN here.

Version of ELAN: 4.8.1 (to my knowledge though this should work the same for other versions)

We're going to:
  • search in a clever way
  • export those results
  • import them as new tier(s) into the .eaf-file you're working on
  • thus creating a tier with a defined subset of other existing tiers, making work speedier on targeted parts of your corpus
You can click the images for larger versions.

Example case
I've got a transcribed file where I've noticed some different pronunciation of a certain word. I'd like to pick out only the annotations containing that word, make a new tier with only them, and write down some clever things about this word in that tier. I don't want to have to scroll through all annotations to get to only these.

I work on Samoan, and the word I'm looking at means "to tell/explain": fa'amatala. "Fa'amatala" is the dictionary entry for this word, but it varies in pronunciation in actual speech. I've asked my transcription assistant to mark down vowel length and presence and absence of glottal stops (as opposed to more orthographic transcription). She has done this pretty consistently (as far as I can tell, it's hard to hear glottal stops sometimes), and since I know what kind of variations to expect I can easily find the instances for this word. Due to t and k-style (lects in Samoan) and speed these are the variations we can expect:
  • fa'amatala
  • fa:matala
  • famatala
  • fa'amakala
  • fa:makala
  • famakala
Besides the obvious difference in pronunciation, I've noticed something unusual going on in the realisation of the realisation of t/k, sort of like an affricate. So, I'd like to listen to all instances of this word with all these spellings and make notes of that.

Here are the steps. At the end is a short guide for when you've started to get the hang of this but need basic guidance.

Step 1) clever searching
In ELAN we can search for simple words, but we can also do something a bit more clever: we can search using regular expressions. Now, you don't need to have a complicated query or know all regex magic to make use of this. In this case, we're simply going to use the 'OR'-function. 'OR' in regular expressions is expressed by the vertical line/pipe character: "|" .

So, I'm searching for "fa'amakala|fa:makala|famakala|fa'amatala|fa:matala|famatala" in the tier marked "transcription". No need for bracketing, asterisks or anything like that in this case. If you want to do more complicated things with regular expressions, I highly recommend this guide and cheat sheet for regular expressions in ELAN by Ulrike Mosel*.

Search query results

Here are our search results:
  • uma fa'amatala i a'u i le tala o le video 
  • fa:makala loa le!
  • fa:makala?
  • fa:makala ka:maloa lale e 
  • ma: e mafai ona e fa:matala mai fapefea le vaitaimi na'e tuputupu 'ae i: falealupo
  • mafai ona e fa'amatala i a'u 
  • fa'amatala?
  • i e mafai ona e fa:matala i le ese'esega o gagana sa:moa 
  • e mafai ona e fa'amatala i le tala le lenei 
  • i fasa:moa, fa'amolemole fa'amatala i le a
  • le kusi la ga ae kago famakala aka 
  • o: mai o le se famakala aku le mea 
  • fa:makala uma ?
  • e ke kago famakala le aka 
That looks good! Not all variations we thought might exist occurred (we didn't get "famatala"), but that's normal. (In fact, specifically not getting that form is expected. Shortening of vowel + the t-lect should not co-occur often, if we believe what Mayer, Ochs and others have said about Samoan variation.)

If you want to edit your search query, you don't need to start all over. Just click the search window again right there over your results, it'll be editable again. (This took me a while to realize.)

Step 2) exporting the search results
This is is very easy, in the search window you have up, go to "Query>Export" and choose to export as tab-delimited text.
Export search query results
Exporting search results dialogue window
Name your file something sensible, and put it in a good place. Now let's have a look at said file outside of ELAN, shall we? The file will have the file-extension ".txt", but it is a tab-separated file (".tsv"). Open it in some spreadsheet program (excel, numbers, libreoffice, google sheets, whathaveyou) and it should look a little something like this:

Search results file opened in Excel, specifying tab as delimiter.
That looks kinda alright, doesn't it? There's no headings, but we can figure this out. There's some things in there that we didn't ask to have, for example the first column is the file location. That's not needed for what we're doing, and I'll show you how to handle that in the next step. Don't worry.

Step 3) creating tier(s) out of the search results
Now we go back to ELAN and we import this file as a tier. What will happen here is that a entire new .eaf-file will be created, the tier will actually not be imported directly into whichever file you currently have open.  This means that it doesn't matter which .eaf-file you currently have open when you import (or indeed if any is open). Counterintuitive, I know, but don't worry - I've figured it out. It's not that complicated, just stay with me.

File>Import> CSV/Tab-delimited Text file

Importing CSV/Tab-delimited Text file
Next up you will get a window asking you questions about the file you're trying to import. Remember how the file didn't have headings for the columns? How will we figure out what is what? Not to worry, it's like this:

1 col: ignore (uncheck)
2 col: Tier
3 col: Begin time
4 col: ignore (uncheck)
5 col: end time
6 col: ignore (uncheck)
7 col: Duration (not sure why this is needed but oh well)
8 col: ignore (uncheck)
9 col: Annotation

Import CSV/Tab-delimited Text file dialogue window.
I wish that ELAN had a way of automatically recognizing its own search output, but it doesn't and we know how to do this anyway so it's all good. No need to specify the other options, just leave them unchecked.
An actual ghost

Now you will have a new .eaf-file with the same name as the file with the search results. This file will contain only the tier(s) you had searched within and only the annotations matching the search query. There's no audio file and no other tiers. It's like a ghost tier, haunting the void of empty silence of this lonely .eaf-file.
A lonely ghost tier in an otherwise empty .eaf-file
Save this file and other files currently open in some clever place(s), quit ELAN and then restart ELAN. Sometimes there seems to be a problem for ELAN to accurately see files later on in this process unless you do this. I don't know why this is, but saving, closing and restarting seems to help, so let's just do that :)!
Chris O'Dowd as Roy Trenneman in IT-crowd
Step 4) importing the search results tier into the original file
Now here's where I slightly lied to you: we're not going to import the tier into your file. We're going to merge the search-results-tier-only-file with the other .eaf -file that has all the audio and other tiers and the result is going to be a new .eaf-file. So you'll have three files by the end of this:
  • a) your original .eaf-file with audio and lotsa tiers
  • b) your .eaf-file with only the search results-tier and no audio etc (ghost-tier)
  • c) a new merged file consisting of the two above listed
Don't worry, I've got this.  I'm henceforth going to call these files (a), (b) and (c) as indicated above.

Open file (a). Select "Merge Transcriptions..."

File>Merge >Transcriptions...

Select Merge transcriptions
Now, select file (a) as the current transcription (this is default anyway), file (b) as the second source and choose a name and location for the new file, file (c), in the "Destination" window. You can think of "Destination" as "Save as.." for file (c) - our new file.

Specifying what should be merged and how
Do not, I repeat, do not append. And no need to worry about linked media, because (b) doesn't have any audio or anything (remember, it's a ghost). Just leave all those boxes unchecked.

Let ELAN chug away with the merging, and then you're done!

Step 5) finished!
Tadaaa! We're done! That wasn't so bad, was it? And look at what we've created!

Here's my merged file - file (c). I've taken the search-results tier and renamed it ("famakala"). I also copied it and renamed that one ("famakala - comments"). That way, I have a tier for making comments about the transcription annotation that has the exact same annotation distributions, but different values.
Final merged file in annotation mode, with the search results tier renamed and copied.
Here's the same file in the transcription mode, configured to only show the two tiers targeting the search query:
Final merged file in transcription mode, showing only the search results tiers.
Now, some final notes:
  • You might want to rename file (c) and delete file (a) and (b), for your own sanity later when managing the files, if for nothing else
  • Don't know how to get to transcription mode? Go to "Options>Transcription Mode".
  • Your tiers aren't showing up properly in transcription mode? Check that the "linguistic types" of the tiers are what you think they are and that that's what you've configured to see in transcription mode. Transcription mode can only show you tiers of one linguistic type at once (unless columns but that complex). I also don't get it really, but then again I barely get "linguistic types" at all though
  • Transcription mode getting clogged up with lots of irrelevant tiers? Got o "Configure..." left in the transcription mode window, select the right linguistic type and "Select tiers.." in the bottom left. Tick only the tiers you want to see at that moment
  • You can import several tiers at once by this method, you don't have to merge one search result at a time, see below
  • You might want to do something complicated related to speakers, see below
Several tiers at once
You can either search several tiers at once in the search mode and hence have several tiers in the search query output, or you could do several searches separately and then append the resulting tsv-files together afterwards in your spreadsheet-program. If there is a different value in the "Tier" column, ELAN will make several tiers when importing back as an .eaf-file. So, you can do several tiers at once.

Speaker tiers
Everyone organises their ELAN-files differently. I have a separate tier where I indicate who the speaker is in the annotation (see above screenshots). This is in contrast to how a lot of other people do it, with different tiers for different speakers. This means that I can search many speakers at the same time, or condition the search for "when X is indicated in speaker-ID-tier". 

If you're doing different tiers for different speakers, you might have to figure out something a bit different from me in order to search many speakers at the same time. It's not that difficult though, you just have to meddle a bit with the search query (or just search one speaker at a time). Contact me if you want help.

On a related note, if someone ever was to ask me to do separate speakers in different tiers, I can use the above process to separate out only annotations with a certain value in the speaker-tier and then import them back as tiers per speaker. I'd rather not, I like it this way. But, I like making sure that the way I set things up is possible to configure to please others as well. Flexibility is good, don't lock yourself into a too narrow set-up that doesn't allow you to change without losing data.

That granted, I need to do manual fidgety things for overlapping speech given this model. That's inconvenient, but I'm ok with it.

Short guide
Step 1) Clever searching
Step 2) export search results
  • Query>Export (Save as tab-delimited text file)
Step 3) create new tier
  • File>Import> CSV/Tab-delimited Text file
  • Specify columns (1 col: ignore, 2 col: Tier, 3 col: Begin time, 4 col: ignore, 5 col: end time, 6 col: ignore, 7 col: Duration, 8 col: ignore , 9 col: Annotation)
  • Save new .eaf-file. 
  • Quit and restart ELAN
Step 4) Creating merged file
  • Open original file with audio and other tiers
  • File>Merge transcriptions...
  • Select .eaf-file with search results as second source (do not append)
  • Save new merged file
  • Delete superfluous files
Step 5) done
  • rename and copy tiers if necessary

I'm sure there's other ways of doing this, but this is what has worked well for me. I'd like this to be easier in ELAN, but in the meantime this works so I'm gonna do it like this.

I find, in general, that I learn more about ELAN and other similar tools by just trying lots of different things and probing the system. Sure, there's manuals, but they often envisage a different usage than I'm after. For example, I'm not clear on what I actually gain by "linguistic types" in what I want to do. Nevermind, probing, searching and sharing seem to be the best way to go for tailored functions. Usually, what you can conceptually imagine as a useful thing exists somewhere (it's like rule 34 but for software). I didn't know how this worked until I thought to myself: "there must be a way of importing search results". And lo and behold, there is. Now here's something I've learned and that you now can do too! Good luck!
Good bye!
Richard Ayoade as Maurice Moss in IT-crowd
* No, I don't know why it is that two linguists who are working/worked on specifically Samoan are trying to teach other linguists to use regular expressions in ELAN. Must be something in the water.
Ulrike Mosel and Hedvig Skirgård (yours truly) in Canberra
Samoan water, Neiafu-Tai village

    • Sloetjes, H., & Wittenburg, P. (2008).
      Annotation by category – ELAN and ISO DCR.
      In: Proceedings of the 6th International Conference on Language Resources and Evaluation (LREC 2008).
    • Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., Sloetjes, H. (2006).
      ELAN: a Professional Framework for Multimodality Research.
      In: Proceedings of LREC 2006, Fifth International Conference on Language Resources and Evaluation.
    • Brugman, H., Russel, A. (2004).
      Annotating Multimedia/ Multi-modal resources with ELAN.
      In: Proceedings of LREC 2004, Fourth International Conference on Language Resources and Evaluation.
    • Crasborn, O., Sloetjes, H. (2008).
      Enhanced ELAN functionality for sign language corpora.
      In: Proceedings of LREC 2008, Sixth International Conference on Language Resources and Evaluation.
    • Lausberg, H., & Sloetjes, H. (2009).
      Coding gestural behavior with the NEUROGES-ELAN system.
      Behavior Research Methods, Instruments, & Computers, 41(3), 841-849. doi:10.3758/BRM.41.3.591.