# Social Learning

These exercises were inspired by and adapted from Models of Learning by Jill O’Reilly and Hanneke den Ouden, NSCS 344 - Modeling the Mind by Robert C. Wilson, the Neuromatch Academy tutorials [CC BY 4.0], Lockwood, et al. (2016), and Lengersdorff, et al. (2020)

In this module, we will explore how to apply reinforcement learning and decision-making models to behavior in social contexts. In this framework, individuals rely on **prediction errors** (the difference between the actual outcome and the expected outcome) from sampling different options or actions that yield either positive or negative outcomes. In this way, individuals learn **value associations** between actions/options and their outcomes.

While the majority of studies examining reinforcement learning examine how individuals learn the association betwwen actions/options and outcomes that affect themselves alone, recent studies have begun to test how the this framework could be applied in social contexts (i.e., when people learn various associations **from others** (Lindström, et al., 2019) and **for others** (Lockwood, et al., 2016; Lengersdorff, et al., 2020); or when people learn and update their beliefs **about others** (Chang, et al., 2010; Siegel, et al., 2018).

We have read about these first two types of learning in social contexts for far (and will begin learning about the third soon). Now, we will examine a specific type and unpack how this learning might change across different contexts/situations.

## Learning to win for others

We will first review the model that best explained human behavior during a prosocial learning study, where individuals learned how their actions affected outcomes for **themselves**, a **stranger**, and **no one** (Lockwood, et al., 2016).

They tested the following three models:

### Rescorla-Wagner Model w/ six free parameters (3α3θ)

Expectations of future reward for action \(a\), (\(Q_{t+1}^{a}\)), should be a function of current expectations \(Q_{t}^{a}\) and a prediction error, which is discrepancy between the actual reward that has just been experienced on this trial \(r_t\) (coded as 1 or 0 for reward or no reward, respectively) and the expected reward for this trial \(t\), (\(Q_{t}^{a}\)):

\[Q_{t+1}^{a} = Q_{t}^{a} + \alpha \cdot (r_{t} - Q_{t}^{a})\]

where the learning rate \(\alpha\) controls the extent to which the current expected value is updated by new information. A low \(\alpha\) minimizes the influence of the prediction error and the amount that the value is updated. The probability that a subject chooses action \(a\) on trial \(t\), given the expected values of the available actions \(Q_{t}^{a}\), is given by the softmax function:

\[p_{t}(a | Q_{t}^{a}) = \frac{\exp(\theta \cdot Q_{t}^{a})}{\sum_{a} \exp(\theta \cdot Q_{t}^{a})}\]

The inverse temperature parameter \(\theta\) controls extent to which the subject decides to choose the most rewarding option versus exploring potentially more rewarding actions. The softmax function estimates the trial-by-trial probability of each action by weighting the ratio of expected values by \(\theta\). A low \(\theta\) would lead to similar action probabilities irrespective of the expected value of each action (resulting in random behavior). A high \(\theta\) would lead to consistent behavior, where the action with the higher expected value is invariably selected on each trial.

In this model, separate \(\alpha\) and \(\theta\) parameters were estimated for each of the self, prosocial, and no one conditions. This resulted in six free parameters: \(\alpha_{Self}\), \(\alpha_{Social}\), \(\alpha_{NoOne}\), \(\theta_{Self}\), \(\theta_{Social}\), \(\theta_{NoOne}\).

### (Simple) Rescorla-Wagner Model w/ two free parameters (1α1θ)

Expectations of future rewards followed the same criteria:

\[Q_{t+1}^{a} = Q_{t}^{a} + \alpha \cdot (r_{t} - Q_{t}^{a})\]

The probability that a subject chooses action \(a\) on trial \(t\), given the expected values of the available actions Q_{t}^{a}, is given by the softmax function:

\[p_{t}(a | Q_{t}^{a}) = \frac{\exp(\theta \cdot Q_{t}^{a})}{\sum_{a} \exp(\theta \cdot Q_{t}^{a})}\]

However, this model collapsed the \(\alpha\) and \(\theta\) parameters across each of the self, other, and no one conditions. This resulting in two free parameters: \(\alpha\), \(\theta\).

### (Null) Rescorla-Wagner Model (1θ)

This model assumed random responding across trials, and thus did not assume learning. The value for \(\alpha\) was set to 0 (equivalent to no learning), and the value for \(\theta\) was varied between \(0\) and \(\infty\).

**What other models are plausible and could have been tested?**

Let’s think of three additional models.

(0α1θ): random responding governed by 1 \(\theta\)

(1α1θ): 1 \(\alpha\) for all three recipients, 1 \(\alpha\) for all three recipients

(**α**): ?

(**α**θ): ?

(**α**θ): ?

(3α3θ): \(\alpha_{Self}\), 1 \(\alpha_{Social}\), 1 \(\alpha_{NoOne}\), 1 \(\theta_{Self}\), 1 \(\theta_{Social}\), 1 \(\theta_{NoOne}\)

**Why might some of these other models perform better than the 3α3θ model?**

## Time for an exercise!

Now, you will complete a new prosocial learning task (<10 minutes). This task is similar to the task above. However, it varies along a few different dimensions. Your instructor will provide every student with a custom link to try it out. It will consist of one practice round and then four blocks of 24 trials each.

During the task, you will play to maximize points for **yourself** and **another student** (randomly assigned) in the class. The points you earn be converted to **extra credit**, so be sure to try your best to maximize your earnings.

Once everyone completes the task, you will be split up into breakout to discuss the following. Then, each group will report their conclusions to the entire class.

## Reverse engineering the task

Please save your answers to the following questions. They will be important for your final Jupyter Notebook exercise.

What was different about this task? How did this difference affect learning and decision-making?

What were the strategies you used when making decisions during this task? Were these strategies different across members of your group? How so? If not, brainstorm some other strategies that people could use. Do any common themes emerge?

Choose one (or more) of these common themes, how can we manipulate the following model to accomodate behavior? Write down your new model. Does it make sense? Recall previous papers, tutorials, and lectures to come up with your **new model(s)**.

**Learning rule:**

\[Q_{t+1}^{a} = Q_{t}^{a} + \alpha \cdot (r_{t} - Q_{t}^{a})\]

**Decision rule:**

\[p_{t}(a | Q_{t}^{a}) = \frac{\exp(\theta \cdot Q_{t}^{a})}{\sum_{a} \exp(\theta \cdot Q_{t}^{a})}\]

**Hint:** Think about what influenced your behavior. Did you care about anything other than the prediction error or the value associated with each? Did the valence of the outcomes matter (remember Lengersdorff, et al. (2020))? Can additional parameters in your model help explain behavior?

## Implementation of your model

Now that you’ve discussed (1) the task, (2) possible explanations for behavior during the task, and (3) a mathematical explanation for your model(s), let’s now write a function in Python for your model.

**DELETE THIS TEXT AND DESCRIBE YOUR MODEL IN THIS MARKDOWN CELL**

def New_Model(params, choices, outcomes):
raise NotImplementedError("Student exercise: write a function to implement your new learning model based on discussions in your group. This function should assume three inputs: parameters to be estimated, choices across trials during the task, and outcomes related to those choices. It should return the negative loglikelihood determined by the choice probabilities across trials. You may use code from previous tutorials. Once you have completed this function, you may delete this line. Please email your instructor your code.")
return negLL

## Social Learning¶

These exercises were inspired by and adapted from Models of Learning by Jill O’Reilly and Hanneke den Ouden, NSCS 344 - Modeling the Mind by Robert C. Wilson, the Neuromatch Academy tutorials [CC BY 4.0], Lockwood, et al. (2016), and Lengersdorff, et al. (2020)

In this module, we will explore how to apply reinforcement learning and decision-making models to behavior in social contexts. In this framework, individuals rely on

prediction errors(the difference between the actual outcome and the expected outcome) from sampling different options or actions that yield either positive or negative outcomes. In this way, individuals learnvalue associationsbetween actions/options and their outcomes.While the majority of studies examining reinforcement learning examine how individuals learn the association betwwen actions/options and outcomes that affect themselves alone, recent studies have begun to test how the this framework could be applied in social contexts (i.e., when people learn various associations

(Lindström, et al., 2019) andfrom others(Lockwood, et al., 2016; Lengersdorff, et al., 2020); or when people learn and update their beliefsfor others(Chang, et al., 2010; Siegel, et al., 2018).about othersWe have read about these first two types of learning in social contexts for far (and will begin learning about the third soon). Now, we will examine a specific type and unpack how this learning might change across different contexts/situations.

## Learning to win for others¶

We will first review the model that best explained human behavior during a prosocial learning study, where individuals learned how their actions affected outcomes for

themselves, astranger, andno one(Lockwood, et al., 2016).They tested the following three models:

## Rescorla-Wagner Model w/ six free parameters (3α3θ)¶

Expectations of future reward for action \(a\), (\(Q_{t+1}^{a}\)), should be a function of current expectations \(Q_{t}^{a}\) and a prediction error, which is discrepancy between the actual reward that has just been experienced on this trial \(r_t\) (coded as 1 or 0 for reward or no reward, respectively) and the expected reward for this trial \(t\), (\(Q_{t}^{a}\)):

where the learning rate \(\alpha\) controls the extent to which the current expected value is updated by new information. A low \(\alpha\) minimizes the influence of the prediction error and the amount that the value is updated. The probability that a subject chooses action \(a\) on trial \(t\), given the expected values of the available actions \(Q_{t}^{a}\), is given by the softmax function:

The inverse temperature parameter \(\theta\) controls extent to which the subject decides to choose the most rewarding option versus exploring potentially more rewarding actions. The softmax function estimates the trial-by-trial probability of each action by weighting the ratio of expected values by \(\theta\). A low \(\theta\) would lead to similar action probabilities irrespective of the expected value of each action (resulting in random behavior). A high \(\theta\) would lead to consistent behavior, where the action with the higher expected value is invariably selected on each trial.

In this model, separate \(\alpha\) and \(\theta\) parameters were estimated for each of the self, prosocial, and no one conditions. This resulted in six free parameters: \(\alpha_{Self}\), \(\alpha_{Social}\), \(\alpha_{NoOne}\), \(\theta_{Self}\), \(\theta_{Social}\), \(\theta_{NoOne}\).

## (Simple) Rescorla-Wagner Model w/ two free parameters (1α1θ)¶

Expectations of future rewards followed the same criteria:

The probability that a subject chooses action \(a\) on trial \(t\), given the expected values of the available actions Q_{t}^{a}, is given by the softmax function:

However, this model collapsed the \(\alpha\) and \(\theta\) parameters across each of the self, other, and no one conditions. This resulting in two free parameters: \(\alpha\), \(\theta\).

## (Null) Rescorla-Wagner Model (1θ)¶

This model assumed random responding across trials, and thus did not assume learning. The value for \(\alpha\) was set to 0 (equivalent to no learning), and the value for \(\theta\) was varied between \(0\) and \(\infty\).

What other models are plausible and could have been tested?Let’s think of three additional models.

(0α1θ): random responding governed by 1 \(\theta\)

(1α1θ): 1 \(\alpha\) for all three recipients, 1 \(\alpha\) for all three recipients

(**α**): ?

(**α**θ): ?

(**α**θ): ?

(3α3θ): \(\alpha_{Self}\), 1 \(\alpha_{Social}\), 1 \(\alpha_{NoOne}\), 1 \(\theta_{Self}\), 1 \(\theta_{Social}\), 1 \(\theta_{NoOne}\)

Why might some of these other models perform better than the 3α3θ model?## Time for an exercise!¶

Now, you will complete a new prosocial learning task (<10 minutes). This task is similar to the task above. However, it varies along a few different dimensions. Your instructor will provide every student with a custom link to try it out. It will consist of one practice round and then four blocks of 24 trials each.

During the task, you will play to maximize points for

yourselfandanother student(randomly assigned) in the class. The points you earn be converted toextra credit, so be sure to try your best to maximize your earnings.Once everyone completes the task, you will be split up into breakout to discuss the following. Then, each group will report their conclusions to the entire class.

## Reverse engineering the task¶

Please save your answers to the following questions. They will be important for your final Jupyter Notebook exercise.

What was different about this task? How did this difference affect learning and decision-making?

What were the strategies you used when making decisions during this task? Were these strategies different across members of your group? How so? If not, brainstorm some other strategies that people could use. Do any common themes emerge?

Choose one (or more) of these common themes, how can we manipulate the following model to accomodate behavior? Write down your new model. Does it make sense? Recall previous papers, tutorials, and lectures to come up with your

new model(s).Learning rule:Decision rule:Hint:Think about what influenced your behavior. Did you care about anything other than the prediction error or the value associated with each? Did the valence of the outcomes matter (remember Lengersdorff, et al. (2020))? Can additional parameters in your model help explain behavior?## Implementation of your model¶

Now that you’ve discussed (1) the task, (2) possible explanations for behavior during the task, and (3) a mathematical explanation for your model(s), let’s now write a function in Python for your model.

DELETE THIS TEXT AND DESCRIBE YOUR MODEL IN THIS MARKDOWN CELL