Funded by Cambridge Assessment English, the Automated Language Teaching and Assessment Institute (ALTA) was launched in October 2013. The broad remit of the institute is to conduct research in corpus linguistics, computational linguistics, speech processing, machine learning and computer systems and platforms, relevant to the activities of the sponsor. Find out more here.
My current projects relate to second language learning and form part of the ALTA Institute research programme. We are working to better understand learner proficiency levels in spoken English and provide individualised, automated teaching feedback.
I have worked on NLP for the analysis of online hacking forums. This involves domain adaptation, text classification and transfer to languages other than English. I was involved in a 6-month project funded by the Alan Turing Institute’s Defence & Security Programme.
I am interested in the processing of non-standard and low-resource natural languages, where ‘non-standard’ includes speech and online discourse, and ‘low-resource’ refers to any text type which is not well represented by current NLP models and resources. Specifically, I’ve been involved in projects to normalise transcriptions of speech, web vocabulary, and online forum posts, as well as the development of educational technology for the Runyakitara languages of western Uganda.
Innovation in spoken English
This was the topic of my PhD, focused on the ‘zero auxiliary’ (omission of the tensed verb in questions such as ‘where you been’, ‘how you doing’ ‘we going to town’) in British English. I investigated zero auxiliary frequencies in the spoken section of the British National Corpus and found evidence for social, discourse and grammatical factors underlying its use. I subsequently used these findings to inform a repair algorithm for spoken language processing, in a 2010 ACL paper with my supervisor Paula Buttery (see below).
In press: Fridah Katushemererwe, Andrew Caines & Paula Buttery. Building natural language processing tools for Runyakitara. Applied Linguistics Review.
2018: Andrew Caines, Sergio Pastrana, Alice Hutchings and Paula Buttery. Aggressive language in an online hacking forum. Proceedings of the 2nd Abusive Language Workshop (ALW 2018).
2018: Kate Knill, Mark Gales, Konstantinos Kyriakopoulos, Andrey Malinin, Anton Ragni, Yu Wang & Andrew Caines. Impact of ASR Performance on Free Speaking Language Assessment. Proceedings of INTERSPEECH.
2018: Claudia Baur, Andrew Caines, Cathy Chua, Johanna Gerlach, Mengjie Qian, Manny Rayner, Martin Russell, Helmer Strik & Xizi Wei. Overview of the 2018 Spoken CALL Shared Task. Proceedings of INTERSPEECH.
2018: Andrew Caines, Paula Buttery & Michael McCarthy. ‘You Still Talking to Me?’ The Zero Auxiliary Progressive in Spoken British English, Twenty Years On. In: Vaclav Brezina, Robbie Love and Karin Aijmer (eds.), Corpus Approaches to Contemporary British Speech: Sociolinguistic Studies of the Spoken BNC2014. London: Routledge.
2018: Sergio Pastrana, Alice Hutchings, Andrew Caines & Paula Buttery. Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum. Proceedings of the 21st International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2018).
2017: Andrew Caines, Diane Nicholls & Paula Buttery. Annotating errors and disfluencies in transcriptions of speech. University of Cambridge Technical Report Number 915.
2017: Andrew Caines & Paula Buttery. The effect of topic on language use in the Cambridge Learner Corpus. In: Lynne Flowerdew and Vaclav Brezina (eds.), Written and spoken learner corpora and their use in different contexts. London: Bloomsbury.
2016: Andrew Caines, Michael McCarthy & Anne O’Keeffe. Spoken language corpora and pedagogic applications. In: Fiona Farr and Liam Murray (eds.), The Routledge Handbook of Language Learning and Technology. London: Routledge.
2016: Andrew Caines, Christian Bentz, Calbert Graham, Tim Polzehl & Paula Buttery. Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CrowdED Corpus. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16).
2016: Wanru Zhang, Andrew Caines, Dimitrios Alikaniotis & Paula Buttery. Predicting author age from Weibo microblog posts. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16).
2016: Andrew Caines, Christian Bentz, Dimitrios Alikaniotis, Fridah Katushemererwe & Paula Buttery. The Glottolog Data Explorer: Mapping the world’s languages. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Workshop proceedings
2015: Russell Moore, Andrew Caines, Calbert Graham & Paula Buttery. Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English. Proceedings of Text, Speech and Dialogue (TSD 2015). Berlin: Springer
2014: Andrew Caines & Paula J. Buttery. The effect of disfluencies and learner errors on the parsing of spoken learner language. Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages (SPMRL-SANCL 2014), Co-located with COLING 2014, Dublin
2012: Andrew Caines. `You talking to me?’ Testing corpus data with a shadowing experiment. In: Stefan Th. Gries and Dagmar Divjak (eds.), Frequency effects in language learning and processing. Berlin: Mouton de Gruyter.
2012: Paula Buttery & Andrew Caines. Normalising frequency counts to account for `opportunity of use’ in learner corpora. In: Yukio Tono, Yuji Kawaguchi, and Makoto Minegishi (eds.), Developmental and Crosslinguistic Perspectives in Learner Corpus Research. Amsterdam: John Benjamins.
2012: Paula Buttery & Andrew Caines. Reclassifying subcategorization frames for experimental analysis and stimulus generation. In: Proceedings of the Language Resources and Evaluation Conference (LREC) 2012
2012: Andrew Caines & Paula Buttery. Annotating progressive aspect constructions in the spoken section of the British National Corpus. In: Proceedings of the Language Resources and Evaluation Conference (LREC) 2012
2010: Andrew Caines & Paula Buttery. `You talking to me?’ A predictive model for zero auxiliary constructions. In: Proceedings of the Workshop on Natural Language Processing and Linguistics, Finding the Common Ground, Annual Meeting of the Association for Computational Linguistics (ACL) 2010
October — ‘Aggressive language in an online hacking forum’; 2nd Abusive Language Workshop, Brussels.
July — ‘Data science approaches to understanding key actors on online hacking forums’; Cambridge Cybercrime Centre Annual Conference, Cambridge.
June — ‘Annotating transcriptions of learner speech’; ALTA Institute Seminar, Cambridge.
May — ‘Automatic analysis of online hacking forums’; Collaborating with the Machine, Cambridge Digital Humanities Workshop.
May — ‘Data science approaches to understanding key actors in online hacking forums’; with Sergio Pastrana, Security Group Seminar, Computing Science & Technology, Cambridge.
September — ‘Parsings transcripts of speech’; SCNLP, Copenhagen.
August — ‘Spoken CALL Shared Task system description’; SLaTE, Stockholm.
June — ‘Computational Linguistics & Ugandan Languages’; Makerere University, Kampala.
March — ‘What is Linguistics?’; Sidney Sussex College HE+ academic taster sessions, Cambridge.
December — ‘Automated speech-unit delimitation in spoken learner English’; COLING’16, Osaka.
May — ‘Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CrowdED Corpus’; LREC’16, Slovenia.
May — ‘Predicting author age from Weibo microblog posts’; LREC’16, Slovenia.
May — ‘The Glottolog Data Explorer: Mapping the world’s languages’; VisLR Workshop at LREC’16, Slovenia.
April — ‘Automatic identification of speech-unit boundaries’, with Russell Moore and Paula Buttery; ALTA Institute Seminar, Cambridge.
January — ‘Darker shades of grey: experiments in multi-dimensional error analysis’, with Paula Buttery and Michael McCarthy; IVACS Symposium, Barcelona
January — ‘Crowdsourcing a bilingual speech corpus’, with Christian Bentz, Calbert Graham and Paula Buttery; IVACS Symposium, Barcelona
November — ‘R Markdown for Reproducible Research’; CRUK Reproducible Research Workshop, Cambridge
September — ‘Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English’; TSD Conference, Pilsen
August — ‘Crowdsourcing error annotations in a corpus of learner spoken English’; EUROCALL Conference, Padua
July — ‘Automated processing, grading and correction of spontaneous spoken learner data’; Corpus Linguistics Conference, Lancaster University
July — ‘Crowdsourcing a multi-lingual speech corpus: recording, transcription and annotation of the CrowdIS corpora’; Corpus Linguistics Conference, Lancaster University
May — ‘Mapping the Endangerment Status of the World’s Languages’; CRASSH Graphical Display Workshop, University of Cambridge
May — ‘Building Natural Language Processing tools for Runyakitara’; BAAL Language in Africa Special Interest Group Annual Meeting, Aston University
May — ‘Working in shades of grey: error analysis in spoken learner language’; BAAL Testing, Evaluation and Assessment Special Interest Group one-day conference, University of Cambridge
May — ‘Disfluency detection in spoken learner English’; NLIP Seminar, University of Cambridge
Feb — ‘The automatic assessment of spoken language’; English Profile Seminar, Cambridge University Press
Feb — ‘Reproducible Research’; PhD Training Workshops, DTAL, University of Cambridge [see GitHub repo for further info]
Dec — ‘The ALTA Institute’; Language Technology Group seminar, University of Copenhagen.
Oct — ‘Creating presentations with reveal.js and Shiny’; R User Group, Cambridge [see Outreach page for further info]
Oct — ‘Spoken corpus annotation’; ALTA Institute Seminar, Cambridge.
Aug — ‘The effect of disfluencies and learner errors on the parsing of spoken learner language’; SPMRL-SANCL Workshop, co-located with COLING, Dublin.
Jul — ‘The effect of topic on opportunity of use in the Cambridge Learner Corpus’; Teaching and Language Corpora (TaLC) Conference, Lancaster.
Jun — ‘Infinite shades of grey: what constitutes an error?’ Inter-Varietal Applied Corpus Studies (IVACS) Conference, Newcastle.
Feb — ‘The effect of topic on errors in a corpus of learner English’; DTAL Computational Linguistics Cluster workshop, Cambridge.