Research


ALTA Institute

Funded by Cambridge English Language Assessment, the Automated Language Teaching and Assessment Institute (ALTA) was launched in October 2013. The institute is administered by the Computer Laboratory and includes principal investigators from the Departments of Engineering and Theoretical and Applied Linguistics. The initial period of funding is for 5 years.

The broad remit of the institute is to conduct research in corpus linguistics, computational linguistics, speech processing, machine learning and computer systems and platforms, relevant to the activities of the sponsor.

In DTAL, our focus is on corpus linguistics and the principal investigator is Dr Paula Buttery. Corpus research involves the analysis of written text and/or transcribed speech databases. In our case we will be working with transcribed recordings of learners of English taking oral tests. We aim to find the key indicators of proficiency in spoken language, and will use these findings to inform systems of automated language teaching and assessment.

Find out more here.

top


Research interests

Second language learning
My current projects relate to second language learning and form part of the ALTA Institute research programme. We are working to better understand learner proficiency levels in spoken English and provide individualised, automated teaching feedback.

First language acquisition
The way that children learn language is the focus of an ongoing book project for Cambridge University Press, co-authored with Paula Buttery. We present empirical methods to investigate first language acquisition.

Innovation in spoken English
This was the topic of my PhD, focused on the ‘zero auxiliary’ (omission of the tensed verb in questions such as ‘where you been’, ‘how you doing’ ‘we going to town’) in British English. I investigated zero auxiliary frequencies in the spoken section of the British National Corpus and found evidence for social, discourse and grammatical factors underlying its use. I subsequently used these findings to inform a repair algorithm for spoken language processing, in a 2010 ACL paper with my supervisor Paula Buttery (see below).

top


Publications

In press: Andrew Caines, Paula Buttery & Michael McCarthy. ‘You Still Talking to Me?’ The Zero Auxiliary Progressive in Spoken British English, Twenty Years On. In: Vaclav Brezina, Robbie Love and Karin Aijmer (eds.), Corpus Approaches to Contemporary British Speech: Sociolinguistic Studies of the Spoken BNC2014. London: Routledge.

In press: Andrew Caines & Paula Buttery. The effect of topic on language use in the Cambridge Learner Corpus. In: Lynne Flowerdew and Vaclav Brezina (eds.), Written and spoken learner corpora and their use in different contexts. London: Bloomsbury.

In press: Fridah Katushemererwe, Andrew Caines & Paula Buttery. Building natural language processing tools for Runyakitara. Applied Linguistics Review.

2017: Andrew Caines, Michael McCarthy and Paula Buttery. Parsing transcripts of speech. Proceedings of SCNLP. pdficon_small

2017: Andrew Caines, Emma Flint and Paula Buttery. Collecting fluency corrections for spoken learner English. Proceedings of BEA. pdficon_small

2017: Emma Flint, Elliot Ford, Olivia Thomas, Andrew Caines and Paula Buttery. A text normalisation system for non-standard English words. Proceedings of WNUT. pdficon_small

2017: Andrew Caines. Spoken CALL Shared Task system description. Proceedings of SLaTE. pdficon_small

2016: Russell Moore, Andrew Caines, Calbert Graham & Paula Buttery. Automated speech-unit delimitation in spoken learner English. Proceedings of COLING. pdficon_small

2016: Andrew Caines, Michael McCarthy & Anne O’Keeffe. Spoken language corpora and pedagogic applications. In: Fiona Farr and Liam Murray (eds.), The Routledge Handbook of Language Learning and Technology. London: Routledge.

2016: Andrew Caines, Christian Bentz, Calbert Graham, Tim Polzehl & Paula Buttery. Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CrowdED Corpus. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). pdficon_small

2016: Wanru Zhang, Andrew Caines, Dimitrios Alikaniotis & Paula Buttery. Predicting author age from Weibo microblog posts. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). pdficon_small

2016: Andrew Caines, Christian Bentz, Dimitrios Alikaniotis, Fridah Katushemererwe & Paula Buttery. The Glottolog Data Explorer: Mapping the world’s languages. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Workshop proceedings

2015: Russell Moore, Andrew Caines, Calbert Graham & Paula Buttery. Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English. Proceedings of Text, Speech and Dialogue (TSD 2015). Berlin: Springer pdficon_small

2014: Andrew Caines & Paula J. Buttery. The effect of disfluencies and learner errors on the parsing of spoken learner language. Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages (SPMRL-SANCL 2014), Co-located with COLING 2014, Dublin pdficon_small

2012: Andrew Caines. `You talking to me?’ Testing corpus data with a shadowing experiment. In: Stefan Th. Gries and Dagmar Divjak (eds.), Frequency effects in language learning and processing. Berlin: Mouton de Gruyter.

2012: Paula Buttery & Andrew Caines. Normalising frequency counts to account for `opportunity of  use’ in learner corpora. In: Yukio Tono, Yuji Kawaguchi, and Makoto Minegishi (eds.), Developmental and Crosslinguistic Perspectives in Learner Corpus Research. Amsterdam: John Benjamins.

2012: Paula Buttery & Andrew Caines. Reclassifying subcategorization frames for experimental analysis and stimulus generation. In: Proceedings of the Language Resources and Evaluation Conference (LREC) 2012  pdficon_small

2012: Andrew Caines & Paula Buttery. Annotating progressive aspect constructions in the spoken section of the British National Corpus. In: Proceedings of the Language Resources and Evaluation Conference (LREC) 2012  pdficon_small

2010: Andrew Caines & Paula Buttery. `You talking to me?’ A predictive model for zero auxiliary constructions. In: Proceedings of the Workshop on Natural Language Processing and Linguistics, Finding the Common Ground, Annual Meeting of the Association for Computational Linguistics (ACL) 2010  pdficon_small

top


Presentations

2017

September — ‘Parsings transcripts of speech’; SCNLP, Copenhagen.

August — ‘Spoken CALL Shared Task system description’; SLaTE, Stockholm.

June — ‘Computational Linguistics & Ugandan Languages’; Makerere University, Kampala.

March — ‘What is Linguistics?’; Sidney Sussex College HE+ academic taster sessions, Cambridge.

2016

December — ‘Automated speech-unit delimitation in spoken learner English’; COLING’16, Osaka.

May — ‘Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CrowdED Corpus’; LREC’16, Slovenia.

May — ‘Predicting author age from Weibo microblog posts’; LREC’16, Slovenia.

May — ‘The Glottolog Data Explorer: Mapping the world’s languages’; VisLR Workshop at LREC’16, Slovenia.

April — ‘Automatic identification of speech-unit boundaries’, with Russell Moore and Paula Buttery; ALTA Institute Seminar, Cambridge.

January — ‘Darker shades of grey: experiments in multi-dimensional error analysis’, with Paula Buttery and Michael McCarthy; IVACS Symposium, Barcelona

January — ‘Crowdsourcing a bilingual speech corpus’, with Christian Bentz, Calbert Graham and Paula Buttery; IVACS Symposium, Barcelona

2015

November — ‘R Markdown for Reproducible Research’; CRUK Reproducible Research Workshop, Cambridge

September — ‘Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English’; TSD Conference, Pilsen

August — ‘Crowdsourcing error annotations in a corpus of learner spoken English’; EUROCALL Conference, Padua

July — ‘Automated processing, grading and correction of spontaneous spoken learner data’; Corpus Linguistics Conference, Lancaster University

July — ‘Crowdsourcing a multi-lingual speech corpus: recording, transcription and annotation of the CrowdIS corpora’; Corpus Linguistics Conference, Lancaster University

May — ‘Mapping the Endangerment Status of the World’s Languages’; CRASSH Graphical Display Workshop, University of Cambridge

May — ‘Building Natural Language Processing tools for Runyakitara’; BAAL Language in Africa Special Interest Group Annual Meeting, Aston University

May — ‘Working in shades of grey: error analysis in spoken learner language’; BAAL Testing, Evaluation and Assessment Special Interest Group one-day conference, University of Cambridge

May — ‘Disfluency detection in spoken learner English’; NLIP Seminar, University of Cambridge

Feb — ‘The automatic assessment of spoken language’; English Profile Seminar, Cambridge University Press

Feb — ‘Reproducible Research’; PhD Training Workshops, DTAL, University of Cambridge [see GitHub repo for further info]

Jan — ‘How to make your own work reproducible: R Markdown’; Replication Workshop, Social Sciences Research Methods Centre, University of Cambridge [see GitHub repo for further info]

2014

Dec — ‘The ALTA Institute’; Language Technology Group seminar, University of Copenhagen.

Oct — ‘Creating presentations with reveal.js and Shiny’; R User Group, Cambridge [see Outreach page for further info]

Oct — ‘Spoken corpus annotation’; ALTA Institute Seminar, Cambridge.

Aug — ‘The effect of disfluencies and learner errors on the parsing of spoken learner language’; SPMRL-SANCL Workshop, co-located with COLING, Dublin.

Jul — ‘The effect of topic on opportunity of use in the Cambridge Learner Corpus’; Teaching and Language Corpora (TaLC) Conference, Lancaster.

Jun — ‘Infinite shades of grey: what constitutes an error?’ Inter-Varietal Applied Corpus Studies (IVACS) Conference, Newcastle.

Feb — ‘The effect of topic on errors in a corpus of learner English’; DTAL Computational Linguistics Cluster workshop, Cambridge.

top