Research


ALTA Institute

Funded by Cambridge Assessment English, the Automated Language Teaching and Assessment Institute (ALTA) was launched in October 2013. The broad remit of the institute is to conduct research in corpus linguistics, computational linguistics, speech processing, machine learning and computer systems and platforms, relevant to the activities of the sponsor. Find out more here.

top


Research interests

Language learning
My current projects relate to second language learning and form part of the ALTA Institute research programme. We are working to better understand learner proficiency levels in spoken English and provide individualised, automated teaching feedback.

Security NLP
I have worked on NLP for the analysis of online hacking forums. This involves domain adaptation, text classification and transfer to languages other than English. I was involved in a 6-month project funded by the Alan Turing Institute’s Defence & Security Programme.

Low-resource NLP
I am interested in the processing of non-standard and low-resource natural languages, where ‘non-standard’ includes speech and online discourse, and ‘low-resource’ refers to any text type which is not well represented by current NLP models and resources. Specifically, I’ve been involved in projects to normalise transcriptions of speech, web vocabulary, and online forum posts, as well as the development of educational technology for the Runyakitara languages of western Uganda.

Innovation in spoken English
This was the topic of my PhD, focused on the ‘zero auxiliary’ (omission of the tensed verb in questions such as ‘where you been’, ‘how you doing’ ‘we going to town’) in British English. I investigated zero auxiliary frequencies in the spoken section of the British National Corpus and found evidence for social, discourse and grammatical factors underlying its use. I subsequently used these findings to inform a repair algorithm for spoken language processing, in a 2010 ACL workshop paper with my supervisor Paula Buttery (see below).

top


Publications

In press: Fridah Katushemererwe, Andrew Caines & Paula Buttery. Building natural language processing tools for Runyakitara. Applied Linguistics Review.

2018: Andrew Caines, Sergio Pastrana, Alice Hutchings and Paula Buttery. Automatically identifying the function and intent of posts in underground forums. Crime Science 7:19.

2018: Anna Samuel, Andrew Caines & Paula Buttery. Cambridge Language Sciences Language, Brains & Machines Research Strategy Forum: an initial literature survey. pdficon_small

2018: Andrew Caines, Sergio Pastrana, Alice Hutchings and Paula Buttery. Aggressive language in an online hacking forum. Proceedings of the 2nd Abusive Language Workshop (ALW 2018). pdficon_small

2018: Kate Knill, Mark Gales, Konstantinos Kyriakopoulos, Andrey Malinin, Anton Ragni, Yu Wang & Andrew Caines. Impact of ASR Performance on Free Speaking Language Assessment. Proceedings of INTERSPEECH. pdficon_small

2018: Claudia Baur, Andrew Caines, Cathy Chua, Johanna Gerlach, Mengjie Qian, Manny Rayner, Martin Russell, Helmer Strik & Xizi Wei. Overview of the 2018 Spoken CALL Shared Task. Proceedings of INTERSPEECH. pdficon_small

2018: Andrew Caines, Paula Buttery & Michael McCarthy. ‘You Still Talking to Me?’ The Zero Auxiliary Progressive in Spoken British English, Twenty Years On. In: Vaclav Brezina, Robbie Love and Karin Aijmer (eds.), Corpus Approaches to Contemporary British Speech: Sociolinguistic Studies of the Spoken BNC2014. London: Routledge.

2018: Sergio Pastrana, Alice Hutchings, Andrew Caines & Paula Buttery. Characterizing Eve: Analysing Cybercrime Actors in a Large Underground Forum. Proceedings of the 21st International Symposium on Research in Attacks, Intrusions and Defenses (RAID 2018).

2017: Andrew Caines, Diane Nicholls & Paula Buttery. Annotating errors and disfluencies in transcriptions of speech. University of Cambridge Technical Report Number 915.

2017: Andrew Caines & Paula Buttery. The effect of topic on language use in the Cambridge Learner Corpus. In: Lynne Flowerdew and Vaclav Brezina (eds.), Written and spoken learner corpora and their use in different contexts. London: Bloomsbury.

2017: Andrew Caines, Michael McCarthy and Paula Buttery. Parsing transcripts of speech. Proceedings of SCNLP. pdficon_small

2017: Andrew Caines, Emma Flint and Paula Buttery. Collecting fluency corrections for spoken learner English. Proceedings of BEA. pdficon_small

2017: Emma Flint, Elliot Ford, Olivia Thomas, Andrew Caines and Paula Buttery. A text normalisation system for non-standard English words. Proceedings of WNUT. pdficon_small

2017: Andrew Caines. Spoken CALL Shared Task system description. Proceedings of SLaTE. pdficon_small

2016: Russell Moore, Andrew Caines, Calbert Graham & Paula Buttery. Automated speech-unit delimitation in spoken learner English. Proceedings of COLING. pdficon_small

2016: Andrew Caines, Michael McCarthy & Anne O’Keeffe. Spoken language corpora and pedagogic applications. In: Fiona Farr and Liam Murray (eds.), The Routledge Handbook of Language Learning and Technology. London: Routledge.

2016: Andrew Caines, Christian Bentz, Calbert Graham, Tim Polzehl & Paula Buttery. Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CrowdED Corpus. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). pdficon_small

2016: Wanru Zhang, Andrew Caines, Dimitrios Alikaniotis & Paula Buttery. Predicting author age from Weibo microblog posts. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). pdficon_small

2016: Andrew Caines, Christian Bentz, Dimitrios Alikaniotis, Fridah Katushemererwe & Paula Buttery. The Glottolog Data Explorer: Mapping the world’s languages. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16). Workshop proceedings

2015: Russell Moore, Andrew Caines, Calbert Graham & Paula Buttery. Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English. Proceedings of Text, Speech and Dialogue (TSD 2015). Berlin: Springer pdficon_small

2014: Andrew Caines & Paula J. Buttery. The effect of disfluencies and learner errors on the parsing of spoken learner language. Proceedings of the First Joint Workshop on Statistical Parsing of Morphologically Rich Languages and Syntactic Analysis of Non-Canonical Languages (SPMRL-SANCL 2014), Co-located with COLING 2014, Dublin pdficon_small

2012: Andrew Caines. `You talking to me?’ Testing corpus data with a shadowing experiment. In: Stefan Th. Gries and Dagmar Divjak (eds.), Frequency effects in language learning and processing. Berlin: Mouton de Gruyter.

2012: Paula Buttery & Andrew Caines. Normalising frequency counts to account for `opportunity of  use’ in learner corpora. In: Yukio Tono, Yuji Kawaguchi, and Makoto Minegishi (eds.), Developmental and Crosslinguistic Perspectives in Learner Corpus Research. Amsterdam: John Benjamins.

2012: Paula Buttery & Andrew Caines. Reclassifying subcategorization frames for experimental analysis and stimulus generation. In: Proceedings of the Language Resources and Evaluation Conference (LREC) 2012  pdficon_small

2012: Andrew Caines & Paula Buttery. Annotating progressive aspect constructions in the spoken section of the British National Corpus. In: Proceedings of the Language Resources and Evaluation Conference (LREC) 2012  pdficon_small

2010: Andrew Caines & Paula Buttery. `You talking to me?’ A predictive model for zero auxiliary constructions. In: Proceedings of the Workshop on Natural Language Processing and Linguistics, Finding the Common Ground, Annual Meeting of the Association for Computational Linguistics (ACL) 2010  pdficon_small

top


Presentations

2018

November — ‘Automatic assessment and teaching of numeracy skills to Indian schoolchildren’; Cambridge Global Challenges Conference (poster)

November — ‘Language, Brains & Machines: an initial literature review’; Cambridge Language Sciences Annual Symposium (poster)

November — ‘A collaborative game-based approach to documenting linguistic variation in Brazil’; Cambridge Language Sciences Annual Symposium (poster)

November — ‘Developing a prototype web-app for numeracy assessment and teaching’; Cambridge Language Sciences Annual Symposium (poster)

October — ‘Aggressive language in an online hacking forum’; 2nd Abusive Language Workshop, Brussels (poster)

July — ‘Data science approaches to understanding key actors on online hacking forums’; Cambridge Cybercrime Centre Annual Conference, Cambridge

June — ‘Annotating transcriptions of learner speech’; ALTA Institute Seminar, Cambridge

May — ‘Automatic analysis of online hacking forums’; Collaborating with the Machine, Cambridge Digital Humanities Workshop

May — ‘Data science approaches to understanding key actors in online hacking forums’; with Sergio Pastrana, Security Group Seminar, Computing Science & Technology, Cambridge

2017

November — ‘Crowdsourcing an error-annotated speech corpus’; Cambridge Language Sciences Annual Symposium (poster)

November — ‘Identifying the native language of learners of spoken English’; Cambridge Language Sciences Annual Symposium (poster)

November — ‘Towards a Virtual Cambridge: learning English in a virtual world’; Cambridge Language Sciences Annual Symposium (poster)

September — ‘Parsings transcripts of speech’; SCNLP, Copenhagen

September — ‘Collecting fluency corrections for spoken learner English’; BEA, Copenhagen (poster)

September — ‘A text normalisation system for non-standard English words’; W-NUT, Copenhagen (poster)

August — ‘Spoken CALL Shared Task system description’; SLaTE, Stockholm

June — ‘Computational Linguistics & Ugandan Languages’; Makerere University, Kampala

March — ‘What is Linguistics?’; Sidney Sussex College HE+ academic taster sessions, Cambridge

2016

December — ‘Automated speech-unit delimitation in spoken learner English’; COLING’16, Osaka (poster)

November — ‘Automated speech-unit delimitation in spoken learner English’; Cambridge Language Sciences Annual Symposium (poster)

May — ‘Crowdsourcing a multilingual speech corpus: recording, transcription and annotation of the CrowdED Corpus’; LREC’16, Slovenia (poster)

May — ‘Predicting author age from Weibo microblog posts’; LREC’16, Slovenia (poster)

May — ‘The Glottolog Data Explorer: Mapping the world’s languages’; VisLR Workshop at LREC’16, Slovenia

April — ‘Automatic identification of speech-unit boundaries’; ALTA Institute Seminar, Cambridge

January — ‘Darker shades of grey: experiments in multi-dimensional error analysis’; IVACS Symposium, Barcelona

January — ‘Crowdsourcing a bilingual speech corpus’; IVACS Symposium, Barcelona

2015

November — ‘R Markdown for Reproducible Research’; CRUK Reproducible Research Workshop, Cambridge

September — ‘Incremental Dependency Parsing and Disfluency Detection in Spoken Learner English’; TSD Conference, Pilsen

August — ‘Crowdsourcing error annotations in a corpus of learner spoken English’; EUROCALL Conference, Padua

July — ‘Automated processing, grading and correction of spontaneous spoken learner data’; Corpus Linguistics Conference, Lancaster University

July — ‘Crowdsourcing a multi-lingual speech corpus: recording, transcription and annotation of the CrowdIS corpora’; Corpus Linguistics Conference, Lancaster University

May — ‘Mapping the Endangerment Status of the World’s Languages’; CRASSH Graphical Display Workshop, University of Cambridge

May — ‘Building Natural Language Processing tools for Runyakitara’; BAAL Language in Africa Special Interest Group Annual Meeting, Aston University

May — ‘Working in shades of grey: error analysis in spoken learner language’; BAAL Testing, Evaluation and Assessment Special Interest Group one-day conference, University of Cambridge

May — ‘Disfluency detection in spoken learner English’; NLIP Seminar, University of Cambridge

Feb — ‘The automatic assessment of spoken language’; English Profile Seminar, Cambridge University Press

Feb — ‘Reproducible Research’; PhD Training Workshops, DTAL, University of Cambridge [see GitHub repo for further info]

Jan — ‘How to make your own work reproducible: R Markdown’; Replication Workshop, Social Sciences Research Methods Centre, University of Cambridge [see GitHub repo for further info]

2014

Dec — ‘The ALTA Institute’; Language Technology Group seminar, University of Copenhagen.

Oct — ‘Creating presentations with reveal.js and Shiny’; R User Group, Cambridge [see Outreach page for further info]

Oct — ‘Spoken corpus annotation’; ALTA Institute Seminar, Cambridge.

Aug — ‘The effect of disfluencies and learner errors on the parsing of spoken learner language’; SPMRL-SANCL Workshop, co-located with COLING, Dublin.

Jul — ‘The effect of topic on opportunity of use in the Cambridge Learner Corpus’; Teaching and Language Corpora (TaLC) Conference, Lancaster.

Jun — ‘Infinite shades of grey: what constitutes an error?’ Inter-Varietal Applied Corpus Studies (IVACS) Conference, Newcastle.

Feb — ‘The effect of topic on errors in a corpus of learner English’; DTAL Computational Linguistics Cluster workshop, Cambridge.

top