Datasets


CELER: A 365 Participant Corpus of Eye Movements in L1 and L2 English Reading

Eye movement recordings of 69 native English speakers and 296 English learners reading Wall Street Journal (WSJ) newswire sentences. Each participant reads 156 sentences: 78 sentences shared across participants and 78 unique to each participant.

  1. OPMI
    CELER: A 365-Participant Corpus of Eye Movements in L1 and L2 English Reading
    Yevgeni Berzak, Chie Nakamura, Amelia Smith, Emily Weng, Boris Katz, Suzanne Flynn, and Roger Levy
    Open Mind 2022

OneStopQA

OneStopQA is a multiple choice reading comprehension dataset annotated according to the STARC (Structured Annotations for Reading Comprehension) scheme. The reading materials are Guardian articles taken from the OneStopEnglish corpus. Each article comes in three difficlty levels, Elementary, Intermediate and Advanced. Each paragraph is annotated with three multiple choice reading comprehension questions. The reading comprehension questions can be answered based on any of the three paragraph levels.

  1. ACL
    STARC: Structured Annotations for Reading Comprehension
    Yevgeni Berzak, Jonathan Malmaud, and Roger Levy
    In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics 2020

Treebank of Learner English

The Treebank of Learner English (TLE) is a collection of 5,124 English as a Second Language (ESL) sentences (97,681 words), manually annotated with POS tags and dependency trees in the Universal Dependencies formalism. The dataset represents upper-intermediate level adult English learners from 10 native language backgrounds, with over 500 sentences for each native language. The sentences were drawn from the Cambridge Learner Corpus First Certificate in English (FCE) corpus.

  1. ACL
    Universal Dependencies for Learner English
    Yevgeni Berzak, Jessica Kenney, Carolyn Spadine, Jing Xian Wang, Lucia Lam, Keiko Sophie Mori, Sebastian Garza, and Boris Katz
    In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics 2016

Anchoring and Agreement in Syntactic Annotations

  • POS and Universal Dependencies annotations and reviews for 300 sentences from the PTB-WSJ section 23 (4 annotators).
  • POS and Universal Dependencies annotations, reviewes and disagreement rankings for 360 sentences from the CLC FCE (5 annotators).
  1. EMNLP
    Anchoring and agreement in syntactic annotations
    Yevgeni Berzak, Yan Huang, Andrei Barbu, Anna Korhonen, and Boris Katz
    In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing 2016

Language and Vision Ambiguities (LAVA)

Language and Vision Ambiguities (LAVA) is a multimodal corpus that supports the study of ambiguous language grounded in vision. The corpus contains ambiguous sentences coupled with visual scenes that depict the different interpretations of each sentence. LAVA sentences cover a wide range of linguistic ambiguities, including PP and VP attachment, conjunctions, logical form, anaphora and ellipsis.

  1. EMNLP
    Do You See What I Mean? Visual Resolution of Linguistic Ambiguities
    Yevgeni Berzak, Andrei Barbu, Daniel Harari, Boris Katz, and Shimon Ullman
    In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing 2015