Discussing research – day 4: language for discussing data collection

When discussing your research methods and results, it’s essential to explain how you gathered your data.

In order to make your text less repetitive and more engaging for your reader, it’s useful to use synonyms where possible.

Verbs for discussing data collection

Gather – meaning: to look for and bring together information from different sources

We gathered data from three villages in the area.

Collect – meaning: to put together things from different sources

After collecting samples from the site, we began our analysis.

Capture – meaning: to collect information or data, often using a computer

Pronunciation: The letters ‘p’ and ‘t’ together create a /tʃ/ sound – /ˈkæptʃə/

We captured and analysed the data using our newly-designed software.

extract – meaning: to remove one substance from another OR to select information to use for a specific purpose

The process of extracting metal from ore can be labour-intensive.

We extracted data from local archives.

Pronunciation: When using this word in the noun form the word stress changes – an extract

Survey – meaning: to ask people a series of questions in order to get information from them

Pronunciation: The first syllable in this word used the long vowel sound /ɜː/ (i.e. her, third, work) – /ˈsɜːveɪ/ (not /ˈsɔːveɪ/)

The UK pronunciation of this word has the stress on the first syllable, while the North American pronunciation of this word has the stress on the second syllable.

We surveyed 250 people aged 18-40 from the local area.

Curate – meaning: to collect, select, organise and present information

Pronunciation: There is an intrusive /j/ sound after the /k/ sound at the start of this word – /kjʊˈreɪt/ (not /ku:ˈreɪt)

We curated our findings ahead of the presentation.

Nouns and useful collocations

A dataset – meaning: a collection of different sets of connected data that can be processed by a computer as a single unit

The word ‘data’ has two possible pronunciations in UK English – /ˈdeɪtə/ and /ˈdɑːtə/

These adjectives can be used with this noun: initial, existing, available, reliable

We used an existing dataset from a previous study.

A survey – meaning: a series of questions that you ask a number of people

  • These adjectives can be used with this noun: brief, small-scale, comprehensive, detailed, large-scale
  • These verbs are often used with this noun: design, compile, carry out, conduct, publish, launch, complete, participate in
  • These verbs are often used after this noun: suggest, claim, reveal, show, highlight, indicate, discover, report, conclude

We designed and conducted a comprehensive survey of 300 primary school teachers in the region.

A report – meaning: an official document giving information about a specific subject

  • These adjectives can be used with this noun: initial, original, unconfirmed, anecdotal, accurate
  • These verbs are often used with this noun: compile, prepare, produce, issue, launch, deliver, submit, present
  • These verbs are often used after this noun: note, mention, demonstrate, indicate, address, outline, investigate, examine, discuss, conclude, confirm, propose, recommend

Although their initial report indicated that there was strong evidence of malpractice, the subsequent investigation concluded that no regulations had been broken.

The extracts below from different parts of the same article show how some of this language can be used.

Epidemic dreams: dreaming about health during the COVID-19 pandemic

Extract part 1 – from section on materials and methods

2.1.1. Tweets (pre-pandemic and pandemic waking discussions)

As a pre-pandemic baseline, we used an existing dataset of tweets from the period between 1 January and 24 February of 2014 to capture pre-pandemic waking discussions. This dataset consisted of 974 482 English tweets posted by 240 959 unique users. To then study pandemic waking discussions, we used an existing dataset of 129 911 732 tweets collected through the Twitter Streaming API [30]. The dataset collected by Chen et al. includes tweets containing words from a manually curated list of terms related to COVID-19 and is openly shared via GitHub [31]. From this initial dataset, we extracted 57 287 490 English tweets posted in the year 2020 from 1 February to 30 April by 11 318 634 unique users.

Extract part 2 – from the conclusion

The second limitation has to do with our method for extracting mentions of medical conditions from text. Although MedDL is a state-of-the-art tool with top-class accuracy, its output is not error-free. Because MedDL was trained on social media data only, misclassifications could be more frequent when applied to the dream reports dataset. Our qualitative analysis did not produce evidence for any systematic error that would compromise our results. However, future work could collect additional training data specific to dream reports.

The third limitation has to do with the quality and scope of our two datasets. Our Twitter dataset, albeit large, is not fully representative of the general population. Studies on Twitter are exposed to issues of data noise [43,44], representativeness [45] and self-presentation biases [46]. In the USA, the country in which Twitter has highest penetration rate, socio-demographic characteristics deviate from the general population: Twitter users are much younger, with a higher level of formal education, and are more likely to support the Democratic Party [32]. Our collection of dream reports has limitations too. We gathered the reports through a web survey without imposing limits or constraints on the input text, which could introduce some noise in the data.

This extract is taken from: Šćepanović Sanja, Aiello Luca Maria, Barrett Deirdre, Quercia Daniele, 2022 Epidemic dreams: dreaming about health during the COVID-19 pandemic R. Soc. open sci. 9211080211080 http://doi.org/10.1098/rsos.211080

