3. Data
Swedish tweets were downloaded from Feb., 10, 2022 (two weeks before the Lund study was published) to Nov., 10, 2022. The tweets were collected with the keywords m-?RNA.* (‘?’ the preceding character is optional.; ‘.*’: ≥ 0 characters) or the hashtag #mRNA and lang:sv (Swedish content). The final tweet data set consisted of 1,700 unique tweets from 730 different users. Apart from the previous, we also collected ca 7,600 unique posts from the popular Swedish forum Flashback (
https://www.flashback.org/), from 18 different discussion threads, all related to COVID-19 and mRNA vaccination.
3.1. Preprocessing and data cleansing
We preprocessed the data using R 4.2.1. For each tweet and for each post, we stored the text and some relevant metadata such as the date of publication. Duplicate data as well as Swedish stop words and numbers were deleted, while the textual content was turned to lowercase. During a normalization process, identified token variants such as ‘mrna vaccin’ and ‘mrnavaccin’ were converted to a single uniform format, here ‘mrna-vaccin’. Furthermore, the dataset was further tokenized (basically separating punctuation and metadata from words). Multiword expressions, statistically significant collocations and phrasal verbs were also recognized, and their contiguous components were joined with an underscore prior to further processing (e.g., big_pharma; in_vitro or spruta_in ‘to inject’). Posts with less than 3 tokens were removed due to a small search volume. For the structural topic modeling we used the R package stm (version 1.3.6) [9].
4. Structural Topic Modeling
Latent Dirichlet Allocation (LDA) [10], is a popular topic modeling method that uses the statistical analysis of textual data to identify themes or topics that occur in a document collection. Structural topic model (STM) has emerged as an extension to LDA allowing the integration of covariates into the prior distributions for document-topic compositions and topic-word proportions. Thereby, STM can be used to model how the content of a collection of documents changes as a function of document-level covariates such as day and time, and gain insights and understanding on how topics evolve. Since there is no “correct” solution for determining the optimal number of topics k that should be generated during the model selection process, several diagnostic aspects of the topic modeling were evaluated to decide the number of topics, k, to use. The stm package implements several evaluation metrics, such as the spread of semantic coherence [11] and exclusivity, which both capture what humans qualitatively perceive as good topics [9]. After preprocessing of the data, a document-term matrix was created with 7,600 documents, 18,900 terms (i.e., unique words) and used for modeling, while the best model yielded 14 topics. Figure 1a shows the semantic coherence vs the exclusivity of the models, while figure 1b shows the temporal evolution of two of the identified topics.