OneStop: A 360-Participant English Eye Tracking Dataset with Different Reading Regimes#
📄 Paper | 📚 Documentation | 💾 Data | 🔬 More from LaCC Lab
Example#
Overview#
OneStop Eye Movements (in short OneStop) is a large-scale English corpus of eye movements in reading with 360 L1 participants, 2.6 million word tokens and 152 hours of recorded eye tracking data. The dataset was collected using an EyeLink 1000 Plus eyetracker (SR Research).
OneStop comprises four sub-corpora, one for each of the following reading regimes:
Ordinary reading for comprehension
Information seeking
Repeated reading
Information seeking in repeated reading
We provide the entire corpus, as well as each of the sub-corpora separately. If you are looking for a general purpose eye tracking corpus (like Dundee, GECO, MECO and others), we recommend downloading the ordinary reading sub-corpus.
Key Features#
Texts and Reading Comprehension Materials#
30 articles with 162 paragraphs in English from the Guardian.
Annotations of part-of-speech tags, syntactic dependency trees, word frequency and word surprisal.
Each paragraph has two versions: an Advanced version (original Guardian text) and a simplified Elementary version.
Extensively piloted reading comprehension questions based on the STARC (Structured Annotations for Reading Comprehension) annotation framework.
3 multiple-choice reading comprehension questions per paragraph.
486 reading comprehension questions in total.
Auxiliary text annotations for answer choices.
Statistics#
Statistics of OneStop and other public broad-coverage eyetracking datasets for English L1.
Category |
Dataset |
Subjects |
Age |
Words |
Words Recorded |
Qestions |
Subjects per Question |
Questions per Subject |
---|---|---|---|---|---|---|---|---|
Reading Comprehension |
OneStop |
360 |
22.8±5.6 |
19,425 (Advanced) |
2,632,159 (Paragraphs) |
486 |
20 |
54 |
66 |
NA |
2,539 |
167,574 |
20 |
95 |
20 |
||
Passages |
Dundee |
10 |
NA |
51,502 |
307,214 |
NA |
10 |
NA |
14 |
21.8±5.6 |
56,410 |
774,015 |
NA |
14 |
NA |
||
84 |
NA |
2,689 |
225,624 |
0 |
0 |
0 |
||
46 |
21.0±2.2 |
2,109 |
83,246 |
48 |
46 |
48 |
||
Sentences |
69 |
26.3±6.7 |
61,233 |
122,423 |
78 |
69 |
78 |
|
18 |
34.3±8.0 |
15,138 |
272,484 |
42 |
18 |
42 |
||
43 |
25.8±7.5 |
1,932 |
81,144 |
110 |
43 |
110 |
‘Reading Comprehension’ are datasets with a substantial reading comprehension component over piloted reading comprehension materials. The remaining datasets are general purpose datasets over passages or individual sentences. ‘Words’ is the number of words in the textual corpus. ‘Words Recorded’ is the number of word tokens for which tracking data was collected. ‘NA’: data not available.
Controlled experimental manipulations#
Reading goal: ordinary reading for comprehension or information seeking.
Paragraph difficulty level: original Guardian article (Advanced) or simplified (Elementary).
Question identity: one of three possible questions for each paragraph.
Prior exposure to the text: first reading or repeated reading.
Experiment Structure#
Each participant reads 10 Guardian articles paragraph by paragraph, and answers a reading comprehension question after each paragraph. After reading a 10-article batch, participants read two of the previously presented articles for a second time. Half of the participants are in an information seeking regime where they are presented with the question prior to reading the paragraph.
Trial Structure#
A single experimental trial consists of reading a paragraph and answering one reading comprehension question about it. Trials have the following Interest Periods, corresponding to pages in the experiment:
Question Preview (only in the information seeking regime) - the participant reads the question (self-paced).
Paragraph - the participant reads the paragraph (self-paced).
Question - the participant reads the question again (self-paced).
QA - retains the question, and also displays the four possible answers in a cross arrangement. Participants are then required to choose one of the answers and confirm their choice (self-paced).
Feedback - informs the participants on whether they answered the question correctly (presented for one second).
Pages presented only in the information seeking regime are depicted in green.
Overall, the dataset includes 19,438 regular trials, 9,720 in the information seeking regime and 9,718 in the ordinary reading regime. The dataset also includes 3,888 repeated reading trials, split equally between the two reading regimes.
Obtaining the Data#
Direct Download from OSF#
The data is hosted on OSF. We provide the possibility to download four sub-corpora with eye movement recordings from paragraph reading, one for each of the following reading regimes:
OneStop Ordinary Reading - download this data if you are interested in a general purpose eye tracking dataset
and two versions of the full dataset:
OneStop - paragraph reading for the entire dataset.
OneStop - Full - complete trials for the entire dataset, divided into title, question preview, paragraph, question, answers, QA (question + answers) and feedback.
In addition, we release metadata files which include the participant questionnaire and experiment summary statistics.
Python Script#
The data can also be downloaded using the provided Python script. The script will download and extract the data files.
Basic usage to download the entire dataset:
Make sure you have Python installed.
If you don’t have Python installed, you can download it from here.
Get the Code
Open your terminal/command prompt:
Windows: Press
Win + R
, typecmd
and press EnterMac: Press
Cmd + Space
, typeterminal
and press Enter
Run this command to download the code:
git clone https://github.com/lacclab/OneStop-Eye-Movements.git
Move into the downloaded folder:
cd OneStop-Movements
Run the Download Script
Run this command to download the full dataset:
python onestop/download_data_files.py
The data will be downloaded to a folder called “OneStop”
Available options:
--extract
: Extract downloaded zip files (default: True)--asc
: Download ASC files (default: False)--edf
: Download EDF files (default: False)-o, --output-folder
: Specify output folder (default: “OneStop”)--mode
: Choose dataset subcorpora to download (default: “full”)Options: “full”, “all-regimes”, “ordinary”, “information-seeking”, “repeated”,”information-seeking-in-repeated”
Download Specific Parts (Optional)
Example usage to download only the ordinary reading subset:
python onestop/download_data_files.py --mode ordinary
pymovements integration#
OneStop is partially integrated into the pymovements package. The package allows for easy download of the full paragraph corpora. The following code snippet shows how to download the data:
First, install the package in the terminal:
pip install pymovements
Then, use the following python code to download the data:
import pymovements as pm
dataset = pm.Dataset('OneStop', path='data/OneStop')
dataset.download()
This will download the data to the data/OneStop
folder. You can also specify a different path by changing the path
argument.
Documentation Structure#
The documentation is organized into the following sections:
Data Files and Variables: Provides detailed information about the data files and variables used in the project.
Known Issues: Documents any known issues.
Scripts: Contains scripts for data preprocessing and analysis reproduction.
Citation#
Paper: OneStop: A 360-Participant English Eye Tracking Dataset with Different Reading Regimes
@article{berzak2025onestop,
title={OneStop: A 360-Participant English Eye Tracking Dataset with Different Reading Regimes},
author={Berzak, Yevgeni and Malmaud, Jonathan and Shubi, Omer and Meiri, Yoav and Lion, Ella and Levy, Roger},
journal={PsyArXiv preprint},
year={2025}
}
License#
The data and code are licensed under a Creative Commons Attribution 4.0 International License.