8 Miscellaneous

8.1 Pre-processing Behavioral Data

oNovitas offers several variables that reveal information about a participant’s browsing behavior. First, each news item comes with a variable called time_spent_on[news item id]. These variables are floats that describe how much time a participant spent reading a news item. Second, scroll_sequence describes how the participant navigated through a the feed: it contains all the news items the mouse hovered over or the thumb touched, on desktop and mobile devices respectively. Third, viewport_data describes how long a news item was visible to a participant.

These variables may reveal psychological drivers of consumer choices (Fisher and Woolley 2023) However, the latter two variables are represented in a slightly complex data structure, which is why this section explains, how to process the information with R.

First, we load some packages:

if (!requireNamespace("groundhog", quietly = TRUE)) {
    install.packages("groundhog")
    library("groundhog")
}

pkgs <- c("magrittr", "data.table", "stringr", "jsonlite")

groundhog::groundhog.library(pkg = pkgs,
                             date = "2023-09-25")

This chunk checks if the {groundhog} package is installed; if not, it installs the package. groundhog was developed by the Penn Wharton Credibility Lab and is designed for package management and reproducible science (see, e.g., Trisovic et al. 2022; Lindsay 2023). It loads (and installs, if necessary) packages & their dependencies as available on chosen date on CRAN. Doing so, it keeps rather than replaces, existing other versions of a required package. It ensures that all operating systems and R versions install/load the same package version.

Using groundhog::groundhog.library(), the chunk loads the following packages: {magrittr}, {data.table}, {stringr}, and {jsonlite}.

The columns that need to be processed look as follows:

participant.code	scroll_sequence	viewport_data
mcs4n5kj	i0-i20-ibr-i20-i19-i3-i7-i5-i5-i9-i10-i16	[{““doc_id”“:20,”“duration”“:1.2},{”“doc_id”“:19,”“duration”“:2.332},{”“doc_id”“:3,”“duration”“:3.209},{”“doc_id”“:7,”“duration”“:3.392},{”“doc_id”“:27,”“duration”“:5.113},{”“doc_id”“:5,”“duration”“:4.946},{”“doc_id”“:9,”“duration”“:9.885},{”“doc_id”“:10,”“duration”“:8.958},{”“doc_id”“:16,”“duration”“:47.007},{”“doc_id”“:8,”“duration”“:45.589},{”“doc_id”“:22,”“duration”“:0.068},{”“doc_id”“:15,”“duration”“:0.15},{”“doc_id”“:null,”“duration”“:0.166},{”“doc_id”“:null,”“duration”“:0.114},{”“doc_id”“:null,”“duration”“:0.114}]

8.1.1 scroll_sequence

DT[1, .(participant.code,
        item = news.1.player.scroll_sequence %>%
          strsplit(split = '-') %>%
          unlist() %>%
          str_replace_all(pattern = 'i', replacement = ''))][!(item == shift(item, type = "lead")), ][, sequence := 1:.N]

This chunk operates on a data table named DT. It focuses on the news.1.player.scroll_sequence column. The steps involved are as follows:

It selects the first row of the data table (DT[1, ...]) and extracts two columns: participant.code and a new column called item.
The news.1.player.scroll_sequence text is split by hyphens (-), the resulting list is unlisted to create a single vector of items, and any occurrences of the letter i are removed, effectively converting it into a numeric sequence.
Rows where the item is the same as the next row’s item are filtered out. A new column called sequence is added, numbering the rows from 1 to the total number of rows.

participant.code	item	sequence
mcs4n5kj	0	1
mcs4n5kj	20	2
mcs4n5kj	br	3
mcs4n5kj	20	4
mcs4n5kj	19	5
mcs4n5kj	3	6
mcs4n5kj	7	7
mcs4n5kj	5	8
mcs4n5kj	9	9
mcs4n5kj	10	10

8.1.2 viewport_data

DT[1, 
   .(participant.code, 
     news.1.player.viewport_data %>% 
       str_replace_all(pattern = '""', replacement = '"') %>% 
       fromJSON)][!is.na(doc_id)]

In this chunk, the focus is on the news.1.player.viewport_data column within the data table DT. Here’s what happens:

The first row of the data table is selected, and two columns are extracted: participant.code and a modified version of news.1.player.viewport_data.

The news.1.player.viewport_data text is processed to replace instances of "" with a single double quote (").

The modified text, which is JSON-like data, is parsed into an R object using the fromJSON() function.

Rows with missing doc_id values are filtered out.

participant.code	doc_id	duration
mcs4n5kj	20	1.200
mcs4n5kj	19	2.332
mcs4n5kj	3	3.209
mcs4n5kj	7	3.392
mcs4n5kj	27	5.113
mcs4n5kj	5	4.946
mcs4n5kj	9	9.885
mcs4n5kj	10	8.958
mcs4n5kj	16	47.007
mcs4n5kj	8	45.589
mcs4n5kj	22	0.068
mcs4n5kj	15	0.150