Section 4 Empirical Strategy

It is my opinion that an emphasis on the effects of causes rather than on the causes of effects is, in itself, an important consequence of bringing statistical reasoning to bear on the analysis of causation and directly opposes more traditional analyses of causation. — (Holland 1986)

While the previous chapter derived predictions about the causes of effects, this chapter deals with the statistical measurement of effects of these causes. In particular, it describes how one can screen the data that were generated in our experiment to identify and to describe reciprocal behavior, if there is any. The corresponding Appendix XY introduces the empirical workhorse model and derives the identifying assumptions formally. It shows how one can make causal inferences using the experimental data. This chapter mainly argues why the assumptions are reasonable to make in our experimental setting and reports the strategy that results from the formal derivations.

The first strategy I present is called the scientific solution17 (see Appendix XY) which is based on the idea that one can use the experiment’s first stage and use it as a control condition which one then compare with the second stage as a treatment condition. In this sense, the treatment can be understood as the exposure to reciprocity. This within-subject design seems reasonable as the game that was played in the first stage is similar to the second stage’s subgame where the agent faces the performance-based mechanism: in both (sub)games, the participants engage in the same real-effort task and are paid proportional to their effort provision. The two (sub)games only differ with respect to (1) the workload decision as well as (2) the social component: The agents’ payoffs from the second stage depend on the principals’ decision (and the principals’ earnings depend on the agents’ decisions). While the latter argument (2) is exactly what should differ between the two treatment conditions, the former (1) should not be much of a problem if the agents chose their optimal level of effort provision in the sequence “find optimal level, then choose workload and perform accordingly”. I believe that this is a reasonable assumption to make, especially because we designed the control questions such that participants must have understood this particular decision to answer them. Hence, they were aware of the decision’s consequences and were forced to choose their effort provision at that point of time. Under this assumption, the workload decision becomes irrelevant for self-interested agents and was just a commitment device for reciprocal agents who intended to punish the principal. Consequently, the workload decision, if anything, is expected to strengthen the effect of reciprocity. In conclusion, I argue that the first stage is as good as ceteris paribus comparable to the respective subgame of the second stage.

We intend to interpret the observed difference between the productivity in Stage 1 and the performance in Stage 2 as the causal effect of reciprocity. To do so, we have to rule out all factors that can possibly cause a difference. Because the experiment was designed such that each within comparison is based on measurements that were conducted in the same sequence ( “first control then treatment”), we have to rule out that time did not confound the observed difference. After all, it might be that subjects have been tired after running through the box clicking task for the first time. Alternatively, they could have improved their ability to click on boxes as fast as possible during the first stage. One could then argue that the observed difference is not driven by reciprocity, but by learning or fatigue effects. The analysis therefore depends on two postulates called causal transience and temporal stability.18 In broad terms, they mean that the effect the control condition might have on the effort provision in any stage is reversible and that the immediate effect of the control condition is stable (that is, the same at every point in time). These two identifying assumptions are powerful because they allow me to interpret the observed differences as causal as long as they are reasonable. (Note that there is no omitted variable bias because one and the same observation is exposed to the control and the treatment condition.) Whether they are reasonable, is hard to say. The task itself was designed such that no knowledge is needed to complete it. As a consequence, there was no knowledge to gain during the first stage. Also, each participant was exposed to a short trial round which allowed them to acquire the skills even before the first stage began. In addition, there was an extensive break between the two stages due to the control questions. This gave the subjects time to recover. The problem we have is, that we cannot test whether there were learning or fatigue effects which is what qualifies them as postulates. I thus assume them to hold true and leave it to further considerations to design an experimental environment to test them.

The important question then is, whether the effect, if we observe one, matches my predictions. The following equation, while considering the specific subset of agents that were exposed to the performance-based mechanism in Stage 2 exclusively, describes the observed differences (which we intend to interpret as causal):

\[ \Delta Y_u = \alpha + \beta Y_{u1} + \upsilon_u \]

I use the subscript \(u\) to denote agents of this subset and define the left-hand side as the observed differences: \(\Delta Y_u \equiv Y_{u2} - Y_{u1}\). Importantly, \(Y_{uT}\) describes what I earlier referred to as the productivity (in \(T=1\)) or the performance (in \(T=2\)) and which I denoted as \(l\) in Chapter XY. \(Y_{uT}\) is thus, not to be confused with the reciprocity parameter \(Y_{ij}\) from the previous section.

To understand the regression expression, revisit Figure XY. \(\Delta Y_u\) describes the difference between the red curve and the red dashed (\(45^{\circ}\)-) line and \(Y_{u1}\) constitutes the horizontal axis. The predictions describe a negative difference to the left of \(Y_{u1} = 0.5\) and a positive difference on the right of this threshold. Consider now Figure XY, which illustrates the same elements as Figure XY but explains \(\Delta Y_u\) at the vertical axis.

include_graphics("images/11_OLS_Strategy.pdf")

Here, the red line, which I intend to estimate, is predicted to cross the horizontal axis at the threshold (\(Y_{u1} = 0.5\)) such that \(\Delta Y_u=0\) at this particular point. Considering the regression, this translates into a negative constant (\(\alpha < 0\)) as well as a positive slope of \(\beta = |2 \cdot \alpha|\) (if one expects the causal effect to be linear).

While the theory of Chapter XY would be supported by data that are best described by parameters that correspond to these predictions (\(\alpha<0\) and \(\beta \approx |2 \cdot \alpha|\)), there are, of course, some other scenarios one can think of: If, for instance, \(\Delta Y_u\) (and thus \(\alpha\) and \(\beta\)) equal zero at any point, one would conclude that reciprocity did not affect the working morale at all and reject my predictions. The second case, where \(\Delta Y_u\) is non-zero, is a little more complex to evaluate. If \(\alpha \neq 0\) and \(beta=0\), one could reject the predictions as well. (In addition, one might be tempted to reject the assumptions of causal transience and temporal stability: \(\alpha < 0\) together with \(\beta = 0\), for instance, implies that, no matter the productivity, the agents are expected to perform worse in Stage 2 compared to Stage 1 – and this could be explained by fatigue. It could, however also mean that participant’s dislike being monitored by a real person.) As a negative \(\beta\) stands in stark contrast to my predictions, one could conclude that my predictions turn out to be wrong if \(\beta \leqslant 0\), no matter the constant.

If, however, \(\alpha < 0\) and \(0 < \beta < |2 \cdot \alpha|\) or \(0 < |2 \cdot \alpha| < \beta\), the intersection between the horizontal and the regression line would be to the left or to the right of \(Y_{u1} = 0.5\). Would that mean that my predictions were wrong? Not necessarily, as this could be explained by the non-linearity that I described in the end of Section XY and in Figures XY and XY. In some cases, it might therefore become a little vague to judge whether the data actually supports my predictions or proves them wrong.

To sum up, I have argued that the first stage as well as the second stage (under the performance-based mechanism) only differ in a social dimension that I interpret as the exposure to reciprocity. Given the postulates, I suggest to run a simple OLS regression to estimate the average treatment effect of reciprocity on the agents’ working morale given any observed productivity level. I furthermore indicated which realizations of \(\alpha\) and, more importantly, of \(\beta\) would prove my predictions wrong and concluded that it is less clear-cut for some realizations of these parameters to judge whether they actually support the data.

If the mentioned identifying assumptions or postulates, however, are unreasonable, one has to apply the so called statistical solution and compare different subsets of the population of agents with each other. This section, broadly speaking, argues that one can compare agents that share similar, yet not identical, productivities with each other to make inferences about the average effect reciprocity has on their working morale. As before, I use an agent’s productivity as the main explanatory variable. The treatment variable, however, is defined differently since it indicates whether agents are expected to feel treated kindly or unkindly. The agents’ performance in Stage 2 will serve as the response variable in what follows.

Having that stated, an RDD intends to find a discontinuous jump around the defined threshold. Focusing on the subset of agents who faced the performance-based mechanism one expects the performance of agents with productivities marginally higher than one half to be discontinuously higher than of agents with productivities that is marginally lower. One can therefore say that the agents’ measured productivity assigns them into two treatment conditions which one can call perceived kindness and perceived unkindness. The identifying assumption then is that agents, even while having some influence, are unable to precisely manipulate variable that assigns them into one of the treatment groups.

I claim that agents do not have perfect control about their productivity. I therefore argue that it is reasonable to make inferences using a RDD strategy. As this is the most important claim concerning this strategy, it deserves some support: First, note that the threshold is arbitrary. Besides its property to assign agents to their treatment condition, it has no further meaning than any of the other values in the neighborhood of \(q\). Also, agents did not know about the importance of this particular value. They, consequently, had no incentive to deliberately manipulate their productivity correspondingly. Second, each participant worked on 25 screens with 35 boxes per screen so that \(q\) corresponds to \(437.5\) boxes that were clicked away in either \(275\) or \(175\) seconds. Due to the large number of boxes and the fact that they were ordered randomly19, it seems highly doubtful that participants were aware of their score during the task. Hence, even if participants intended to manipulate their productivity to end up just above the threshold, it would have been extremely difficult for them. One might then, however, argue that participants who clicked away \(438\) boxes differed in some latent or omitted characteristic from those who clicked away \(437\) boxes. But as clean as a lab environment might be, I believe that there are still some environmental factors that affected this quantum leap-sized difference. Take the computer mice, the tables’ textures, the sunlight or the air quality during the sessions as an example. On an individual level, the smallest lag of the computer, a sneezer or the sunlight that might interfere with the graphics on the screen at some corners of the laboratory can make out the difference between productive and unproductive agents. All these factors might make out the difference and cannot be controlled by the participants. So even if some are especially likely to have productivity values near one half, each of these agents would have approximately the same probability of being productive (slightly above \(q\)) or unproductive (slightly below \(q\)) – similar to a coin-flip experiment. As such, assignment into treatment is as good as random (around the threshold). Consequently, agents with a productivity of \(q \pm \varepsilon\) with \(\varepsilon \to 0\) are, on average, expected to be comparable – they should not differ systematically in any characteristic that could confound my analysis. Note that this assumption is not a postulate, that is, one can test whether it is reasonable. We can, for instance look at the covariates of those who are just below and just above the threshold. One also has to plot the distribution of \(Y_1\) to spot whether the values are distributed unevenly around \(q\). If all these variables are distributed smoothly, the identifying assumption is likely to be met.

Depending on what the data will look like eventually, a concern might be that a possible discontinuity around the threshold is unaccounted-for non-linearity. %: The jump in upper panel of Figure XY, is likely to disappear if one takes into account that the data is censored at zero. To contest such a concern one can run different specifications (including polynomials) or focus on the “discontinuity sample” – that is, focus on observations close to the threshold (Angrist and Lavy 1999) as explained above. As one does not need any polynomials or specifications that are complex in another sense, one can also describe the latter approach as non-parametric. Another robustness check I suggest is to run “placebo RDDs”. The idea here is to choose some random productivity values (maybe in advance to seeing the data) and to pretend these values to be the threshold. If the resulting RDDs detect discontinuities at these random values, one might doubt the original discontinuity to be caused by reciprocity.20 To run these checks, I programmed a ShinyApp.

In summary, this strategy deviates from the theoretical predictions as it assumes the causal effect of reciprocity not to be a smooth function. If this was true, and if there was a causal effect in the first place, it should be identified using a non-parametric RDD as described above. Compared to the linear OLS specification, it has the advantage that it allows us to not only focus on the agents who were exposed to the performance-based mechanism, but also to narrow in on the agents who faced the random mechanism. After all, the theory from Chapter XY also applies to those subjects: they should perceive the choice of the random-mechanism as kind (unkind) if they were unproductive (productive). A second advantage is that it can, in principle, be applied even if causal transience and temporal stability are unreasonable to assume. This comes, however, at the costs that the analysis might be labeled as explorative since the discontinuity clashes with the predictions I derived earlier. In addition, there have to be enough data points in the neighborhood of \(q\), which is not yet the case with our data.

References

Holland, Paul W. 1986. “Statistics and Causal Inference.” Journal of the American Statistical Association 81 (396). [American Statistical Association, Taylor & Francis, Ltd.]: 945–60.

Angrist, Joshua D, and Victor Lavy. 1999. “Using Maimonides’ Rule to Estimate the Effect of Class Size on Scholastic Achievement.” The Quarterly Journal of Economics 114 (2). MIT Press: 533–75.


  1. These postulates are the reason the section was coined as the “scientific solution” as the natural sciences proceeded far by making these assumptions. If you, for instance, throw a stone within an absolute vacuum to make inferences about the effect of the vacuum on some variable as the distance and then compare it to a comparable throw under “normal” conditions, you have to make these two assumptions. You assume that the stone would land at the same distance no matter at which point of time you throw it (moon phases do not affect anything here) and that throwing the stone in the vacuum does not change its flying characteristics for a later throw.

  2. These assumptions reflect what I called “separability of effort costs” before.

  3. The arrangement of boxes differed between the screens but was identical for all participants, given any specific screen.

  4. The placebo approach is often used in Diff-in-Diff Designs.