if (!requireNamespace("groundhog", quietly = TRUE)) {
install.packages("groundhog")
library("groundhog")
}
<- c("magrittr", "data.table", "stringr", "jsonlite")
pkgs
::groundhog.library(pkg = pkgs,
groundhogdate = "2023-09-25")
8 Miscellaneous
8.1 Pre-processing Behavioral Data
oNovitas
offers several variables that reveal information about a participant’s browsing behavior. First, each news item comes with a variable called time_spent_on[news item id]
. These variables are floats that describe how much time a participant spent reading a news item. Second, scroll_sequence
describes how the participant navigated through a the feed: it contains all the news items the mouse hovered over or the thumb touched, on desktop and mobile devices respectively. Third, viewport_data
describes how long a news item was visible to a participant.
These variables may reveal psychological drivers of consumer choices (Fisher and Woolley 2023) However, the latter two variables are represented in a slightly complex data structure, which is why this section explains, how to process the information with R
.
First, we load some packages:
This chunk checks if the {groundhog}
package is installed; if not, it installs the package. groundhog was developed by the Penn Wharton Credibility Lab and is designed for package management and reproducible science (see, e.g., Trisovic et al. 2022; Lindsay 2023). It loads (and installs, if necessary) packages & their dependencies as available on chosen date on CRAN. Doing so, it keeps rather than replaces, existing other versions of a required package. It ensures that all operating systems and R versions install/load the same package version.
Using groundhog::groundhog.library()
, the chunk loads the following packages: {magrittr}
, {data.table}
, {stringr}
, and {jsonlite}
.
The columns that need to be processed look as follows:
participant.code | scroll_sequence | viewport_data |
---|---|---|
mcs4n5kj | i0-i20-ibr-i20-i19-i3-i7-i5-i5-i9-i10-i16 | [{““doc_id”“:20,”“duration”“:1.2},{”“doc_id”“:19,”“duration”“:2.332},{”“doc_id”“:3,”“duration”“:3.209},{”“doc_id”“:7,”“duration”“:3.392},{”“doc_id”“:27,”“duration”“:5.113},{”“doc_id”“:5,”“duration”“:4.946},{”“doc_id”“:9,”“duration”“:9.885},{”“doc_id”“:10,”“duration”“:8.958},{”“doc_id”“:16,”“duration”“:47.007},{”“doc_id”“:8,”“duration”“:45.589},{”“doc_id”“:22,”“duration”“:0.068},{”“doc_id”“:15,”“duration”“:0.15},{”“doc_id”“:null,”“duration”“:0.166},{”“doc_id”“:null,”“duration”“:0.114},{”“doc_id”“:null,”“duration”“:0.114}] |
8.1.1 scroll_sequence
1, .(participant.code,
DT[item = news.1.player.scroll_sequence %>%
strsplit(split = '-') %>%
unlist() %>%
str_replace_all(pattern = 'i', replacement = ''))][!(item == shift(item, type = "lead")), ][, sequence := 1:.N]
This chunk operates on a data table named DT
. It focuses on the news.1.player.scroll_sequence
column. The steps involved are as follows:
- It selects the first row of the data table (
DT[1, ...]
) and extracts two columns:participant.code
and a new column calleditem
. - The
news.1.player.scroll_sequence
text is split by hyphens (-
), the resulting list is unlisted to create a single vector of items, and any occurrences of the letteri
are removed, effectively converting it into a numeric sequence. - Rows where the
item
is the same as the next row’sitem
are filtered out. A new column calledsequence
is added, numbering the rows from 1 to the total number of rows.
participant.code | item | sequence |
---|---|---|
mcs4n5kj | 0 | 1 |
mcs4n5kj | 20 | 2 |
mcs4n5kj | br | 3 |
mcs4n5kj | 20 | 4 |
mcs4n5kj | 19 | 5 |
mcs4n5kj | 3 | 6 |
mcs4n5kj | 7 | 7 |
mcs4n5kj | 5 | 8 |
mcs4n5kj | 9 | 9 |
mcs4n5kj | 10 | 10 |
8.1.2 viewport_data
1,
DT[
.(participant.code, 1.player.viewport_data %>%
news.str_replace_all(pattern = '""', replacement = '"') %>%
!is.na(doc_id)] fromJSON)][
In this chunk, the focus is on the news.1.player.viewport_data
column within the data table DT.
Here’s what happens:
The first row of the data table is selected, and two columns are extracted: participant.code
and a modified version of news.1.player.viewport_data
.
The news.1.player.viewport_data
text is processed to replace instances of ""
with a single double quote ("
).
The modified text, which is JSON-like data, is parsed into an R object using the fromJSON()
function.
Rows with missing doc_id
values are filtered out.
participant.code | doc_id | duration |
---|---|---|
mcs4n5kj | 20 | 1.200 |
mcs4n5kj | 19 | 2.332 |
mcs4n5kj | 3 | 3.209 |
mcs4n5kj | 7 | 3.392 |
mcs4n5kj | 27 | 5.113 |
mcs4n5kj | 5 | 4.946 |
mcs4n5kj | 9 | 9.885 |
mcs4n5kj | 10 | 8.958 |
mcs4n5kj | 16 | 47.007 |
mcs4n5kj | 8 | 45.589 |
mcs4n5kj | 22 | 0.068 |
mcs4n5kj | 15 | 0.150 |