Guidelines Annotations

The content of this page was prepared by Dr. Katarzyna Klessa. See also: Lee, A., Bessell, N., van den Heuvel, Klessa, K., & Saalasti, S. (2023). The DELAD initiative for sharing language resources on speech disorders. Language Resources and Evaluation. https://doi.org/10.1007/s10579-023-09655-2

Annotation tools and techniques for disordered speech

The knowledge of speech annotation tools, data formats and procedures can be useful at various stages of designing, developing, processing, and analysing speech resources. When creating corpora of disordered speech, certain aspects of annotation tools and techniques become particularly important.

Freely available tools supporting multilayer speech annotation and annotation mining

Most annotation tools used in phonetic research can be successfully used for annotating the audio and/or video recordings of both typical and disordered speech. Below is a list of some of the freely available tools that can be used for those purposes:

Praat: https://www.fon.hum.uva.nl/praat/
ELAN: https://archive.mpi.nl/tla/elan
Annotation Pro: http://annotationpro.org/
SPPAS: https://sppas.org/
Web-based services supported and developed by CLARIN for different languages.

Praat, ELAN, Annotation Pro and SPPAS include functionalities enabling multilayer annotations based on synchronized inputs representing different components of spoken communication. The labelling schemes may involve both linguistic (e.g., phrases, words, syllables, individual sounds) and para/extralinguistic features (e.g., hesitation markers, physiological sounds produced by speakers or voice emotion correlates). Apart from permitting various annotation mining and speech analysis options, the annotation desktop tools also feature visual representation of the multilayer annotations that are time-aligned with the speech signal display; for example, in the form of oscillograms or spectrograms. In cases where a visual component is necessary at the stage of annotation, ELAN can be used because it supports the display of video files and multilayer annotation of video recordings (e.g., adding annotation layers including gesture or mimicry labels). Annotations created using the above software tools can be enriched, searched and processed both manually and by means of automatic procedures. For example, SPPAS and the web-based CLARIN tools support automatized grapheme to phoneme conversion of transcriptions, speech to text conversion as well as speech segmentation into different levels such as phones, syllables or words. All of the tools listed above enable the development of plugins / extensions of their functionality in order to customize it to the user’s needs.

Interoperability of annotation file formats

The annotation formats in ELAN, Annotation Pro and SPPAS are XML-based, while Praat has its own file format. An important common feature of the file formats is the time-stamp information. The native file formats can be converted to one another by means of either built-in functions or external converters (e.g., Annotation Pro enables quick data import and export between its native format and Praat, ELAN, SPPAS). The possibility of file format conversion makes it possible to use different tools depending on the researcher’s needs. One can develop a corpus using one tool and analyse data with another one. It may be especially important in the case of the corpora of disordered speech due to stricter limitations regarding access rights and data sharing possibilities for clinical data.

Annotation specifications and rating scales

At some stage of the development of experimental procedures, it might be necessary to adjust certain elements so that they are better suited to reflect disordered speech. One level of adjustment is the use of an extended version of the International Phonetic Alphabet such as ExtIPA (Extensions to the International Phonetic Alphabet), see:

extIPA Symbols for Disordered Speech (PDF file) © ICPLA;
VOQS: Voice Quality Symbols (PDF file) © Ball, Esling & Dickson.

The ExtIPA charts enable transcribers to mark additional information characteristic for atypical phenomena in speech along with the standard transcription labels. Another kind of adjustment is to allow signalling uncertainty in the transcribed material. Uncertainty is frequently reported by annotators of spontaneous speech recordings and also those of disordered speech. The reasons for uncertainty include overlapping of neighbouring speech sounds (e.g., due to coarticulation), overlapping of speech and noise events, and the occurrence of any unexpected phenomena in the speech signal, all resulting in ambiguity of certain segment boundary positions (some of which are in fact continuous transitions between sounds and could be interpreted as transition areas instead of boundary points). Uncertainty markers may be included in the extended version of the phonetic alphabet or by means of using software tools supporting annotation of uncertainty. For example, SPPAS annotations can include uncertainty markers and can be automatically searched for such markers in order to either enlist the “uncertain” boundary positions and to determine their appropriate location or, on the other hand, to confirm the status of certain boundary positions as inherently indeterminate (ambiguous), at least for a particular purpose.

Discrete, continuous, and mixed rating scales in speech annotation and perception tests

Many features of spoken utterances can be annotated using labels assigned to predefined categories or classes. When recordings are annotated with the use of software tools, individual labels representing those categories or classes are attached to subsequent segments (or boundary-delimited intervals). An example of such category-based labelling would be the transcription labels denoting syllables, sounds, words, or whole phrases time-aligned with the speech sound signal. However, when we consider the features of spontaneous or disordered speech, it sometimes turns out that continuous dimensions or transitions are more suitable than distinct categories, particularly for some of the paralinguistic or extralinguistic features, for example, those related to voice quality, individual voice characteristics or speaker’s states and attitudes.

Examples of graphical representations of continuous and mixed rating scales

The above figure depicts selected examples of graphical representations of continuous and mixed rating scales available in the Annotation Pro software tool:

the left-most example shows a feature space for labelling emotions using two dimensions of activation and valence (see e.g., Smith & Ellsworth, 1985);
the image in the middle represents a mixed rating scale: discrete (10 distinct categories for emotion perception labelling are represented in the pie chart) and continuous (the distance from the centre of the circle refers to the emotion intensity) (cf. Ekman, 1992; Plutchik, 1982);
the right-most image represents the continuum of phonation types based on Ladefoged (1971) referring to the degree of openness.

Technically, Annotation Pro makes it possible to use one of the built-in images such as those shown above or to design and upload a custom-made image (in the form of a JPG or PNG file), tailored to the needs of a given study. The annotation procedure consists in clicking the picture and thus marking one or more points in the picture. It is important, however, that the results of annotation based on various rating scales are stored in such file formats that the information coming from different sources can be combined within a common workspace. This way, the potential correlations and dependencies between them can be inspected and expressed using approaches based on both qualitative (e.g., visual inspection of multilayer annotations) and quantitative measures.

Selected references

Ball, M. J. (2021). Transcribing disordered speech. In M. J. Ball (Ed.), Manual of clinical phonetics (pp. 163-174). Routledge.

Ball, M. J., Howard, S. J., & Miller, K. (2018). Revisions to the extIPA chart. Journal of the International Phonetic Association, 48(2), 155-164.

Bigi, B. (2015). SPPAS-multi-lingual approaches to the automatic annotation of speech. The Phonetician, 111, 54-69. https://zenodo.org/record/5749242#.Yv-WInbMLqU

Bigi, B. (2021). SPPAS annotation format: XRA Schema (version 1.5). HAL open science. https://hal.archives-ouvertes.fr/hal-03468454/document

Boersma, P., & Weenink, D. (1992-2021). Praat: Doing phonetics by computer (Version 6.2) [Computer program]. https://www.praat.org

Duckworth M., Allen G., Hardcastle W., & Ball M. (1990). Extensions to the international phonetic alphabet for the transcription of atypical speech. Clinical Linguistics & Phonetics, 4(4), 273-280. https://doi.org/10.3109/02699209008985489

Ekman P. (1992). An argument for basic emotions. Cognition and Emotion, 6(3-4), 169-200. https://doi.org/10.1080/02699939208411068

Ferré, G. (2012). Functions of three open-palm hand gestures. Multimodal Communication, 1(1), 5-20. https://hal.archives-ouvertes.fr/hal-00666025/document

Jarmołowicz-Nowikow, E., & Karpiński, M. (2011). Communicative intentions behind pointing gestures in task-oriented dialogues. Proceedings of the 2nd Conference on Gesture and Speech in Interaction (GESPIN).

Kisler, T., Reichel, U., & Schiel, F. (2017). Multilingual processing of speech via web services. Computer Speech & Language, 45, 326-347.

Klessa K., Karpiński M., & Wagner A. (2013). Annotation Pro – A new software tool for annotation of linguistic and paralinguistic features. In D. Hirst & B. Bigi (Eds.), Proceedings of Tools and Resources for the Analysis of Speech Prosody (TRASP), 51-54. http://www2.lpl-aix.fr/~trasp/Proceedings/19897-trasp2013.pdf

Ladefoged, P. (1971). Preliminaries to linguistic phonetics. University of Chicago Press.

Lee, A., Bessell, N., van den Heuvel, H., Klessa, K. & Saalasti, S. (2023). The DELAD initiative for sharing language resources on speech disorders. Language Resources and Evaluation (in press).

Lorenc, A. (2016). Transkrypcja wymowy w normie i w przypadkach jej zaburzeń. Próba ujednolicenia i obiektywizacji [Transcription of pronunciation in speech norm and in cases of speech disorders. An approach to standardize and objectify] In B. Kamińska & S. Milewski (Eds.), Logopedia artystyczna (pp. 107-143). Harmonia Universalis.

Plutchik, R. (1982). A psychoevolutionary theory of emotions. Social Science Information, 21(4-5), 529-553.

Smith, C. A., & Ellsworth, P. C. (1985). Patterns of cognitive appraisal in emotion. Journal of Personality and Social Psychology, 48(4), 813-838. Wittenburg, P., Brugman, H., Russel, A., Klassmann, A., & Sloetjes, H. (2006). ELAN: A professional framework for multimodality research. Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’06), 1556-1559. http://www.lrec-conf.org/proceedings/lrec2006/pdf/153_pdf.pdf