Word-timestamped transcripts of two spoken narrative recall functional neuroimaging datasets

After watching audiovisual movies, human participants produced spoken narrative recollections during functional magnetic resonance imaging (fMRI); presented here are word-level timestamps of their speech, temporally aligned to the publicly shared fMRI data. For the “FilmFestival” dataset, twenty participants watched ten short audiovisual movies, approximately 2-8 minutes each. For the “Sherlock” dataset, seventeen participants watched the first half of the first episode of BBC's Sherlock (48 minutes). After viewing, participants then verbally described what they remembered about the movies in their own words. Participants’ speech was recorded using an MR-compatible microphone. The audio recordings were transcribed, then timestamped by a forced aligner; missing timestamps were filled in manually by human transcriptionists referencing the audio recording. Each file contains the participant's recall word by word, onset of each word in seconds with 1/10th-second precision, and the corresponding fMRI volume number (TR). This dataset can be used to investigate topics such as naturalistic memory and language production.

a b s t r a c t After watching audiovisual movies, human participants produced spoken narrative recollections during functional magnetic resonance imaging (fMRI); presented here are wordlevel timestamps of their speech, temporally aligned to the publicly shared fMRI data.For the "FilmFestival" dataset, twenty participants watched ten short audiovisual movies, approximately 2-8 minutes each.For the "Sherlock" dataset, seventeen participants watched the first half of the first episode of BBC's Sherlock (48 minutes).After viewing, participants then verbally described what they remembered about the movies in their own words.Participants' speech was recorded using an MR-compatible microphone.The audio recordings were transcribed, then timestamped by a forced aligner; missing timestamps were filled in manually by human transcriptionists referencing the audio recording.Each file contains the participant's recall word by word, onset of each word in seconds with 1/10 th -second precision, and the corresponding fMRI volume number (TR).This dataset can be used to investigate topics such as naturalistic memory and language production. ©

Value of the Data
• The word-level timestamps are matched to the volume numbers of fMRI data collected as the participants spoke, therefore enabling time-resolved analyses of brain activity during speech.• Scientists interested in questions about language can use these data to study brain activity during extended natural speech.
• Scientists interested in questions about memory can use these data to study brain activity during spontaneous spoken recollection.• The speech was recorded as participants recalled an array of movies with widely varied content, providing opportunities for analyses of memories of stimulus features such as emotionality, social situations, music, and more.

Objective
This word-timestamps dataset supplements the "Sherlock" and "FilmFestival" fMRI datasets (results published in [1][2][3][4] ), facilitating future analysis.Participants' speech has been analyzed to record the onset of each word, and aligned to the fMRI volume numbers (TRs).In prior publications, timestamps were only provided at the "scene" level, not at the word level.

Data Description
The "FilmFestival: Word Timestamps" dataset includes transcribed and timestamped verbal recollections of twenty human participants collected during a free spoken memory task following naturalistic movie-watching in an MRI scanner.The "Sherlock: Word Timestamps" dataset includes transcribed and timestamped verbal recollections of seventeen human participants.See Tables 1-2 .Each excel spreadsheet (XLSX) contains participants' verbal recollections timestamped at the word level and matched with the corresponding MRI volume number (TR).Each row of the XLSX file includes the transcription of the participant's utterance, the onset time of the utterance relative to the onset of the scanning run in seconds, the corresponding TR number, and the start time of the corresponding TR.See Tables 3-4 .A list of files from the "FilmFestival: Word Timestamps" dataset.While there are 20 participants, there are 25 files because some participants' spoken recall was split into two recordings due to their length.These subjects are JNB_194233, JNO_194133, ONH_224233, SBS_213433, and SKG_214133.

Experimental Design, Materials and Methods
While in an MRI scanner, participants watched audiovisual movies, then verbally recounted what they remembered from the movies in their own words.Participants' speech was recorded with an Optoacoustics FOMRI III MR-compatible microphone and the recording software Audacity.The fMRI data from two such experiments, as well as annotations of the movies, are posted publicly elsewhere (see Data Accessibility).
Human workers manually transcribed the audio recording of each subject's spoken recall and segmented each transcript into discrete sentences based on pauses in the speech and changes in topic.Timestamps for each word were first generated using the forced aligner Gentle [5] .Any missing timestamps were filled in manually by the same workers referencing the audio file using the software Audacity.The onset of the fMRI scan for each subject was then determined from the audio recording, and timestamps were adjusted to align with the fMRI volumes.The original audio recordings are not publicly shared due to privacy concerns.
The word timestamps are not adjusted to account for hemodynamic lag.In prior studies which used scene-level timestamps with the same primary fMRI data referenced here [ 1 , 2 ], timestamps were shifted by 3 TRs, relative to the brain data, in order to account for hemodynamic lag.For most analyses, we recommend applying a 3 TR shift for these word timestamps; for example, if the onset time given in this dataset for word X is t = 55 TRs, the corresponding BOLD response is considered to be at t = 58 TRs.Nonetheless, we supply the un-shifted word timestamps so that users can apply the parameters of their own choosing.
In the word timestamp spreadsheets, column A lists the transcribed speech, one word per row.Column B lists the onset time of each word in seconds (accurate to a tenth of a second), as identified by forced aligner or human judgment.Column C lists the onset time of each word in TRs, i.e., in which TR (brain volume number) the word began (sometimes words extend into the next TR).Column D lists the onset time (in seconds) of the TRs in Column C. If the word onset occurred exactly at the start boundary of a TR then the word was included as part of that TR.See Tables 3-4 .
For Sherlock: Columns B-D give timing information for the words which are aligned to the Princeton Dataspace version of the fMRI data.Columns E-G give timing information aligned to the OpenNeuro version of the fMRI data.Note that subject numbers do not match between the two versions of the fMRI data: subject 5 is omitted in the OpenNeuro version; subjects 1-4 are the same between the two versions; subjects 6-17 in the Princeton Dataspace version correspond to subjects 5-16 in the OpenNeuro version.See Table 3 .
For FilmFestival: Columns B-D give timing information for the words which are aligned to the fMRI data posted on OpenNeuro.No other versions of the fMRI data are publicly available at the time of this writing.See Table 4 .
The word onset time is derived from the audio file.The TR number was calculated using the following formula in Excel: = CEILING(([word onset time] + 0.1) / 1.5, 1).The CEILING function rounds a number up away from zero to the desired multiple of significance.Since TRs are 1.5 seconds, dividing the word onset time by 1.5 and rounding up to the ones place would return the correct aligned TR.Rounding down would return the aligned TR minus one.0.1 was added to the word onset time before division to account for instances where the word onset time is equal to a TR onset time.We considered words beginning at the same time as a TR as belonging to that TR, but the formula and CEILING function would return the previous TR.Therefore 0.1 was added to offset the word onset time so that when the time was divided by 1.5 the CEILING function would round up to the correct TR.See Table 5 .

Table 5
Example of formula used to calculate the TR number from the word onset time.The formula we used is in Column B. We considered a word beginning at the same time as a TR as belonging to that TR.TR 2 begins at 1.5 seconds, so a word beginning at 1.5 seconds should be listed as TR 2. However, simply dividing the word onset time by 1.5 does not always return this, as shown in Column C. By adding 0.1 to 1.5 before dividing, instead of returning a whole value for the TR number (in this example, 1) that would not be rounded by the CEILING function, 1.1 is the result of the division which is rounded up by CEILING and correctly returns 2 as the TR number, as shown in Column B. The TR onset time was calculated in Excel using the formula: = ([TR number] * 1.5) -1.5.Since the first TR, TR 1, begins at time zero, it is necessary to subtract 1.5 seconds when converting from TR number to TR onset time.

Table 1 A
list of the files from the "Sherlock: Word Timestamps" dataset.

Table 3
An example of the format and columns included in each Sherlock timestamp file (example is from NN1_JEX_181931).Columns B-D give timing information for the words which are aligned to the Princeton Dataspace version of the fMRI data.Columns E-G give timing information aligned to the OpenNeuro version of the fMRI data.

Table 4
An example of the format and columns included in each FilmFestival timestamp file (example is from ANE_164234_recallA). Columns B-D give timing information for the words which are aligned to the fMRI data posted on OpenNeuro.No other version of the fMRI data are publicly available at the time of this writing.