Resources


The CrowdED Corpus
CrowdED_wide
(Image: James Cridland, 2007, https://flic.kr/p/Wd54U)

The CrowdED Corpus is a database of speech recordings and transcriptions collected entirely via crowdsourcing. The corpus is in two parts: (1) CrowdED_bilingual, featuring bilingual individuals speaking in both German and English, and (2) CrowdED_english, featuring English speech by native speakers of English. Release 1 contains 1000 recordings from 80 individuals, amounting to approximately 33,400 words (divided approximately 55:45 between the bilingual and English subcorpora). It is made freely available for research purposes under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 international licence (CC BY-NC-SA 4.0).

We presented a poster about construction of the CrowdED Corpus at Language Resources & Evaluation Conference (LREC) 2016. You can find our full paper in the conference proceedings here. The CrowdED Corpus was funded by Crowdee and CrowdFlower.

The CrowdED Corpus is available for download from Ortolang. As ever, we welcome your feedback!

top


Text Normalisation

Text normalisation is an important prerequisite for text-to-speech (TTS), which is in turn fundamental to dialogue systems and accessibility technology. As a result of Emma Flint & Elliot Ford’s summer internships 2016, sponsored by Cambridge English Language Assessment, we’ve made a text normalisation system available under a GNU General Public Licence, as described in a paper presented at WNUT 2017. You can clone or download the GitHub repository and we welcome feedback!

top


Languages of the World
map-amazon

We’ve mapped the world’s seven thousand ‘languoids’ using a Shiny App, colour-coding their endangerment status from ‘extinct’ to shades of ‘endangered’ through to ‘vulnerable’ or ‘living’. Our data come from the Glottolog language catalogue. Hence we refer to the app as the ‘Glottolog Data Explorer’ in a paper with Christian Bentz, Dimitrios Alikaniotis, Fridah Katushemererwe and Paula Buttery for the VisLR Workshop at Language Resources & Evaluation Conference (LREC) 2016. You can find our full paper in the conference proceedings here.

Feedback welcome!

top


WaCky wordlist
wacky-wordlist

Here’s a list of English words with their part-of-speech tag and phonemic form: WaCky wordlist. It’s based on the 2-billion word UKWaC web corpus, tagged with TreeTagger, and phonemicized with phonemizer. Words occurring 100 times or more are included, and the wordlist is available for download.

top


NLP for Chinese
weibo

R scripts and dictionaries to normalise non-standard orthography in and extract relevant features from Chinese social media texts. As described in a paper with Wanru Zhang, Dimitrios Alikaniotis and Paula Buttery for Language Resources & Evaluation Conference (LREC) 2016. You can find our full paper in the conference proceedings here. Go to the GitHub repo.

Feedback welcome!

top