8  Miscellaneous

8.1 Pre-processing Behavioral Data

oNovitas offers several variables that reveal information about a participant’s browsing behavior. First, each news item comes with a variable called time_spent_on[news item id]. These variables are floats that describe how much time a participant spent reading a news item. Second, scroll_sequence describes how the participant navigated through a the feed: it contains all the news items the mouse hovered over or the thumb touched, on desktop and mobile devices respectively. Third, viewport_data describes how long a news item was visible to a participant.

These variables may reveal psychological drivers of consumer choices (Fisher and Woolley 2023) However, the latter two variables are represented in a slightly complex data structure, which is why this section explains, how to process the information with R.


First, we load some packages:

if (!requireNamespace("groundhog", quietly = TRUE)) {
    install.packages("groundhog")
    library("groundhog")
}

pkgs <- c("magrittr", "data.table", "stringr", "jsonlite")

groundhog::groundhog.library(pkg = pkgs,
                             date = "2023-09-25")

This chunk checks if the {groundhog} package is installed; if not, it installs the package. groundhog was developed by the Penn Wharton Credibility Lab and is designed for package management and reproducible science (see, e.g., Trisovic et al. 2022; Lindsay 2023). It loads (and installs, if necessary) packages & their dependencies as available on chosen date on CRAN. Doing so, it keeps rather than replaces, existing other versions of a required package. It ensures that all operating systems and R versions install/load the same package version.

Using groundhog::groundhog.library(), the chunk loads the following packages: {magrittr}, {data.table}, {stringr}, and {jsonlite}.


The columns that need to be processed look as follows:

participant.code scroll_sequence viewport_data
mcs4n5kj i0-i20-ibr-i20-i19-i3-i7-i5-i5-i9-i10-i16 [{““doc_id”“:20,”“duration”“:1.2},{”“doc_id”“:19,”“duration”“:2.332},{”“doc_id”“:3,”“duration”“:3.209},{”“doc_id”“:7,”“duration”“:3.392},{”“doc_id”“:27,”“duration”“:5.113},{”“doc_id”“:5,”“duration”“:4.946},{”“doc_id”“:9,”“duration”“:9.885},{”“doc_id”“:10,”“duration”“:8.958},{”“doc_id”“:16,”“duration”“:47.007},{”“doc_id”“:8,”“duration”“:45.589},{”“doc_id”“:22,”“duration”“:0.068},{”“doc_id”“:15,”“duration”“:0.15},{”“doc_id”“:null,”“duration”“:0.166},{”“doc_id”“:null,”“duration”“:0.114},{”“doc_id”“:null,”“duration”“:0.114}]

8.1.1 scroll_sequence

DT[1, .(participant.code,
        item = news.1.player.scroll_sequence %>%
          strsplit(split = '-') %>%
          unlist() %>%
          str_replace_all(pattern = 'i', replacement = ''))][!(item == shift(item, type = "lead")), ][, sequence := 1:.N]

This chunk operates on a data table named DT. It focuses on the news.1.player.scroll_sequence column. The steps involved are as follows:

  • It selects the first row of the data table (DT[1, ...]) and extracts two columns: participant.code and a new column called item.
  • The news.1.player.scroll_sequence text is split by hyphens (-), the resulting list is unlisted to create a single vector of items, and any occurrences of the letter i are removed, effectively converting it into a numeric sequence.
  • Rows where the item is the same as the next row’s item are filtered out. A new column called sequence is added, numbering the rows from 1 to the total number of rows.
participant.code item sequence
mcs4n5kj 0 1
mcs4n5kj 20 2
mcs4n5kj br 3
mcs4n5kj 20 4
mcs4n5kj 19 5
mcs4n5kj 3 6
mcs4n5kj 7 7
mcs4n5kj 5 8
mcs4n5kj 9 9
mcs4n5kj 10 10

8.1.2 viewport_data

DT[1, 
   .(participant.code, 
     news.1.player.viewport_data %>% 
       str_replace_all(pattern = '""', replacement = '"') %>% 
       fromJSON)][!is.na(doc_id)]

In this chunk, the focus is on the news.1.player.viewport_data column within the data table DT. Here’s what happens:

The first row of the data table is selected, and two columns are extracted: participant.code and a modified version of news.1.player.viewport_data.

The news.1.player.viewport_data text is processed to replace instances of "" with a single double quote (").

The modified text, which is JSON-like data, is parsed into an R object using the fromJSON() function.

Rows with missing doc_id values are filtered out.

participant.code doc_id duration
mcs4n5kj 20 1.200
mcs4n5kj 19 2.332
mcs4n5kj 3 3.209
mcs4n5kj 7 3.392
mcs4n5kj 27 5.113
mcs4n5kj 5 4.946
mcs4n5kj 9 9.885
mcs4n5kj 10 8.958
mcs4n5kj 16 47.007
mcs4n5kj 8 45.589
mcs4n5kj 22 0.068
mcs4n5kj 15 0.150