{ "cells": [ { "cell_type": "markdown", "id": "automated-white", "metadata": {}, "source": [ "![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)\n", "\n", "# Models of Learning\n", "\n", "This tutorial was inspired by and adapted from [Wilson & Collins (2019)](https://elifesciences.org/articles/49547). Ten simple rules for the computational modeling of behavioral data. eLife.\n", "\n", "\n", "## Table of Contents\n", "1. [Model 1: Random Responding](#random-responding)\n", "2. [Model 2: Noisy Win-Stay Lose-Shift](#noisy-win-stay-lose-shift)\n", "3. [Model 3: Rescorla-Wagner](#rescorla-wagner)\n", "4. [Gain a better understanding of learning parameters](#model-parameters)\n", "5. [Fitting an RL model with MLE](#model-fitting)" ] }, { "cell_type": "code", "execution_count": null, "id": "requested-standing", "metadata": {}, "outputs": [], "source": [ "from scipy.optimize import minimize # finding optimal params in models\n", "from scipy import stats # statistical tools\n", "import numpy as np # matrix/array functions\n", "import pandas as pd # loading and manipulating data\n", "import ipywidgets as widgets # interactive display\n", "import matplotlib.pyplot as plt # plotting\n", "%matplotlib inline\n", "\n", "np.random.seed(2021) # set seed for reproducibility" ] }, { "cell_type": "markdown", "id": "earned-affairs", "metadata": {}, "source": [ "## Simulating Learning\n", "\n", "Recall the example from our reading [Wilson & Collins (2019)](https://elifesciences.org/articles/49547): participants make a series of $T$ choices between $K$ slot machines (multi-armed bandits) with the goal of maximizing their earnings. If played on trial $t$, each slot machine, $k$, pays out a reward, $r_t$, which is `1` with reward probability, $\\mu^k_t$, and otherwise `0`. The reward probabilities are different for each slot machine and are initially unknown to the participant. In the simplest version of the task, the reward probabilities are fixed over time.\n", "\n", "We have three experimental parameters for this task: \n", "1. the number of trials, $T$\n", "2. the number of slot machines, $K$\n", "3. the reward probabilities of the different options, $\\mu^k_t$, which may or may not change over time\n", "\n", "Let's try out the different models proposed in the paper and simulate some behavior based on these models. We will set `T = 100` (100 trials), `K = 2` (two slot machines), and `mu = [.2, .8]` (machine 1 pays out with 20% probability, machine 2 pays out with 80% probability)." ] }, { "cell_type": "code", "execution_count": null, "id": "chinese-surveillance", "metadata": {}, "outputs": [], "source": [ "T = 100\n", "K = 2\n", "mu = [0.2, 0.8]" ] }, { "cell_type": "markdown", "id": "noted-integer", "metadata": {}, "source": [ "## Random Responding\n", "\n", "In the first model, participants are assumed to not engage with the task at all. They simply select machines at random with a possible bias for one machine over the other. This bias is captured with a parameter $b$ (which is between 0 and 1) such that the probability of choosing the two options is\n", "\n", "$$\n", "p^1_t = b \\quad \\mbox{and}\\quad p^2_t = 1-b\n", "$$\n", "\n", "Thus, for two bandits, the random responding model has just one free parameter, controlling the overall bias for option 1 over option 2, $b$.\n", "\n", "***Note***: This kind of random behavior is not uncommon in behavioral experiments, especially when participants have no incentives for performing well. Modeling such behavior can be important to identify these \"checked-out\" participants (either for exclusion or to study the \"checked-out\" behavior itself)." ] }, { "cell_type": "code", "execution_count": null, "id": "reported-board", "metadata": {}, "outputs": [], "source": [ "# Specify the random responding model as a function\n", "def simulate_RandomModel(b, T, mu):\n", " c = np.zeros((T), dtype = int)\n", " r = np.zeros((T), dtype = int)\n", " \n", " for t in range(T):\n", "\n", " # compute choice probabilities\n", " p = [b, 1-b]\n", "\n", " # make choice according to choice probababilities\n", " c[t] = np.random.choice(range(len(mu)), p=p)\n", "\n", " # generate reward based on choice\n", " r[t] = np.random.rand() < mu[c[t]]\n", " \n", " return c, r" ] }, { "cell_type": "code", "execution_count": null, "id": "modern-participation", "metadata": {}, "outputs": [], "source": [ "# simulate the random responding model\n", "c1, r1 = simulate_RandomModel(b=.5, T=T, mu=mu)" ] }, { "cell_type": "code", "execution_count": null, "id": "stretch-tracker", "metadata": {}, "outputs": [], "source": [ "# plot the simulation\n", "plt.plot(range(T), r1, 'r--', alpha=.6)\n", "plt.plot(range(T), c1, '+', label='choice')\n", "plt.xlabel('trials')\n", "plt.ylabel('outcome (1=reward, 0=no reward)')\n", "plt.title(f'Random Responding Behavior')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "charged-sandwich", "metadata": {}, "source": [ "As we can see, the behavior here is fairly random." ] }, { "cell_type": "markdown", "id": "fatal-postcard", "metadata": {}, "source": [ "## Noisy Win-Stay-Lose-Shift\n", "\n", "The win-stay-lose-shift model is a simple model that assumes that participants adapt behavior according to feedback on the previous trial. The model repeats rewarded actions and switches away from unrewarded actions. In the noisy version of the model, the win-stay-lose-shift rule is applied probabilistically, such that the model applies the win-stay-lose-shift rule with probability $1-\\epsilon$, and chooses randomly with probability $\\epsilon$. \n", "\n", "In the two-bandit case, the probability of choosing option $k$ (where $k$ represents one of two bandits) is\n", "\n", "$$\n", "p^k_t = \\left\\{\n", " \\begin{array}{cc}\n", " 1- \\epsilon / 2 \n", " & \\mbox{if } (c_{t-1} = k \\mbox{ and } r_{t-1} = 1) \\mbox{ OR } (c_{t-1} \\ne k \\mbox{ and } r_{t-1} = 0)\\\\\n", " \\epsilon / 2 \n", " & \\mbox{if } (c_{t-1} \\ne k \\mbox{ and } r_{t-1} = 1) \\mbox{ OR } (c_{t-1} = k \\mbox{ and } r_{t-1} = 0)\n", " \\end{array}\n", " \\right.\n", "$$\n", "\n", "where $c_t=1,2$ is the choice at trial $t$, and $r_t=0,1$ the reward at trial $t$. This model also only has one free parameter, the overall level of randomness, $\\epsilon$.\n" ] }, { "cell_type": "code", "execution_count": null, "id": "optical-antique", "metadata": {}, "outputs": [], "source": [ "def simulate_WSLS(epsilon, T, mu):\n", " \n", " c = np.zeros((T), dtype = int)\n", " r = np.zeros((T), dtype = int)\n", " \n", " # last reward/action (initialize as nan)\n", " last_reward = np.nan\n", " last_choice = np.nan\n", " \n", "\n", " for t in range(T):\n", "\n", " # compute choice probabilities\n", " if np.isnan(last_reward):\n", "\n", " # first trial choose randomly\n", " p = [0.5, 0.5]\n", "\n", " else:\n", "\n", " # choice depends on last reward\n", " if last_reward == 1:\n", "\n", " # win stay (with probability 1-epsilon)\n", " p = [(epsilon/2) * i for i in [1, 1]]\n", " p[last_choice] = 1 - epsilon/2\n", " else:\n", "\n", " # lose shift (with probability 1-epsilon)\n", " p = [(1 - epsilon/2) * i for i in [1, 1]]\n", " p[last_choice] = epsilon / 2;\n", " \n", " # make choice according to choice probababilities\n", " c[t] = np.random.choice(range(len(mu)), p=p)\n", "\n", " # generate reward based on choice\n", " r[t] = np.random.rand() < mu[c[t]]\n", "\n", " last_choice = c[t]\n", " last_reward = r[t]\n", " \n", " return c, r" ] }, { "cell_type": "code", "execution_count": null, "id": "confirmed-atlas", "metadata": {}, "outputs": [], "source": [ "c2, r2 = simulate_WSLS(epsilon=.1, T=T, mu=mu)" ] }, { "cell_type": "code", "execution_count": null, "id": "marine-atmosphere", "metadata": {}, "outputs": [], "source": [ "# plot the simulation\n", "plt.plot(range(T), r2, 'r--', alpha=.6)\n", "plt.plot(range(T), c2, '+', label='choice')\n", "plt.xlabel('trials')\n", "plt.ylabel('outcome (1=reward, 0=no reward)')\n", "plt.title(f'Noisy Win-Stay-Lose-Shift Behavior')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "lightweight-guinea", "metadata": {}, "source": [ "Here, we can see that behavior switches (with noise) whenever there is no reward on the previous trial." ] }, { "cell_type": "markdown", "id": "demanding-fields", "metadata": {}, "source": [ "## Taking the Average?\n", "\n", "As discussed previously in class, another very simple thing we could do is to compute the average reward (then choose probabilistically according to the updated value of the machines). This model can be expressed as:\n", "\n", "$$\n", "V(t) = \\frac{1}{T} \\sum_{i=1}^{T} r_i\n", "$$\n", "\n", "However, we crucially learned that there's a big conceptual problem with the model as it's currently written. Even if humans or animals are computing the mean of the rewards it is unlikely they are doing it using this equation. Even with `T=50` trials, how easy do you think it would be to compute a sum like that directly? How about when `T=100`?\n", "\n", "The problem is that we would have to keep track of all the rewards we've seen so far to compute it. For example, when `T=100` we have to keep track of 100 rewards. If `T=10000`, then we need to keep track of a 10,000 rewards! This doesn't seem plausible. No one can remember a random string of a ten-thousand 0s and 1s (the world record for [remembering binary digits in 30 minutes is 7,485](https://www.guinnessworldrecords.com/world-records/360573-most-binary-digits-memorised-in-30-minutes#:~:text=The%20most%20binary%20digits%20memorized,4%20to%208%20December%202019.)).\n", "\n", "Maybe we can \"take the average\" in a different kind of way that doesn't involve keeping track of so many outcomes!" ] }, { "cell_type": "markdown", "id": "bearing-biology", "metadata": {}, "source": [ "## Rescorla-Wagner\n", "\n", "Enter **Rescorla-Wagner**. As discussed in class, Rescorla and Wagner (in one of great papers of the 20th Century) proposed that learning occurs when there is a prediction error. In their model, participants first learn the expected value of each slot machine based on the history of previous outcomes and then use these values to make a decision about what to do next. A simple model of learning is the Rescorla-Wagner learning rule (Rescorla & Wagner, 1972) whereby the value of option $k$, $Q^k_t$ is updated in response to reward $r_t$ according to:\n", "\n", "$$\n", "Q^k_{t+1} = Q^k_t + \\alpha (r_t - Q^k_t)\n", "$$\n", "\n", "where $\\alpha$ is the learning rate, which is bounded between 0 and 1 and captures the extent to which the prediction error, $(r_t - Q^k_t)$, updates the value (i.e., a higher $\\alpha$ value will put greater weight on the prediction error). For simplicity, we assume that the initial value, $Q^k_0=0$, although it is possible to treat the $Q^k_0$ as a free parameter of the model (this is also the intercept).\n", "\n", "A simple model of decision making is to assume that participants use the options' values to guide their decisions, choosing the most valuable option most frequently, but occasionally making 'mistakes' (or exploring) by choosing a low value option. One choice rule with these properties is known as the **'softmax' choice rule**, which chooses option $k$ with probability:\n", "\n", "$$\n", "p^k_t = \\frac{\\exp(\\theta Q^k_t)}{\\sum_{i=1}^K \\exp(\\theta Q^i_t)}\n", "$$\n", "\n", "where $\\theta$ is the 'inverse temperature' parameter that controls the level of stochasticity in the choice, ranging from $\\theta = 0$ for completely random responding and $\\theta = \\infty$ for deterministically choosing the highest value option. In other words, someone with a large $\\theta$ would almost always choose the option with the highest value ($Q^k_t$), while someone with a small $\\theta$ would explore other options more frequently.\n", "\n", "Combining the learning and decision rules gives a simple model of decision making in this task with two free parameters: the learning rate, $\\alpha$, and the inverse temperature $\\theta$." ] }, { "cell_type": "code", "execution_count": null, "id": "plastic-producer", "metadata": {}, "outputs": [], "source": [ "def simulate_RescorlaWagner(params, T, mu, noisy_choice=True):\n", "\n", " alpha, theta = params\n", " \n", " c = np.zeros((T), dtype = int)\n", " r = np.zeros((T), dtype = int)\n", "\n", " Q_stored = np.zeros((2, T), dtype = float)\n", " Q = [0.5, 0.5]\n", "\n", " for t in range(T):\n", "\n", " # store values for Q_{t+1}\n", " Q_stored[:,t] = Q\n", " \n", " # compute choice probabilities\n", " p0 = np.exp(theta*Q[0]) / (np.exp(theta*Q[0]) + np.exp(theta*Q[1]))\n", " p1 = 1 - p0\n", " \n", " # make choice according to choice probababilities\n", " # as weighted coin flip to make a choice\n", " # choose stim 0 if random number is in the [0 p0] interval\n", " # and 1 otherwise\n", " if noisy_choice:\n", " if np.random.random_sample(1) < p0: \n", " c[t] = 0\n", " else:\n", " c[t] = 1\n", " else: # make choice without noise\n", " c[t] = np.argmax([p0,p1])\n", " \n", " # generate reward based on reward probability\n", " r[t] = np.random.rand() < mu[c[t]]\n", "\n", " # update values\n", " delta = r[t] - Q[c[t]]\n", " Q[c[t]] = Q[c[t]] + alpha * delta\n", "\n", " return c, r, Q_stored" ] }, { "cell_type": "code", "execution_count": null, "id": "painful-virtue", "metadata": {}, "outputs": [], "source": [ "c3, r3, Q = simulate_RescorlaWagner([.1, 1.5], T=T, mu=mu)" ] }, { "cell_type": "code", "execution_count": null, "id": "velvet-partnership", "metadata": {}, "outputs": [], "source": [ "# plot the simulation\n", "plt.plot(range(T), r3, 'r--', alpha=.6)\n", "plt.plot(range(T), c3, '+', label='choice')\n", "plt.xlabel('trials')\n", "plt.ylabel('outcome (1=reward, 0=no reward)')\n", "plt.title(f'Rescorla-Wagner Learning')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "settled-worry", "metadata": {}, "source": [ "As we can observe, choices for the machine that yields less rewards become less frequent over trials. \n", "\n", "We can also plot the values of the machines over trials. Let's see what that looks like." ] }, { "cell_type": "code", "execution_count": null, "id": "confirmed-cleveland", "metadata": {}, "outputs": [], "source": [ "# plot the simulation\n", "plt.plot(range(T), Q[1,:], 'r--', alpha=.6, label='80% machine')\n", "plt.plot(range(T), Q[0,:], 'm-', alpha=.6, label='20% machine')\n", "plt.plot(range(T), c3, 'b+', label='choice')\n", "plt.xlabel('trials')\n", "plt.ylabel('value')\n", "plt.title(f'Rescorla-Wagner Learning')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "id": "stuffed-invalid", "metadata": {}, "source": [ "It's pretty cool that the value of the machines over trials slowly converges towards their reward outcomes probabilities (20% and 80%)!" ] }, { "cell_type": "markdown", "id": "crucial-definition", "metadata": {}, "source": [ "
\n", "import matplotlib.pyplot as plt\n", "import ipywidgets as widgets\n", "import numpy as np\n", "\n", "def simulate_RescorlaWagner(params, T, mu, noisy_choice=True):\n", " alpha, theta = params\n", " c = np.zeros((T), dtype = int)\n", " r = np.zeros((T), dtype = int)\n", "\n", " Q_stored = np.zeros((2, T), dtype = float)\n", " Q = [0.5, 0.5]\n", "\n", " for t in range(T):\n", "\n", " # store values for Q_{t+1}\n", " Q_stored[:,t] = Q\n", " \n", " # compute choice probabilities\n", " p0 = np.exp(theta*Q[0]) / (np.exp(theta*Q[0]) + np.exp(theta*Q[1]))\n", " p1 = 1 - p0\n", " \n", " # make choice according to choice probababilities\n", " # as weighted coin flip to make a choice\n", " # choose stim 0 if random number is in the [0 p0] interval\n", " # and 1 otherwise\n", " if noisy_choice:\n", " if np.random.random_sample(1) < p0: \n", " c[t] = 0\n", " else:\n", " c[t] = 1\n", " else: # make choice without noise\n", " c[t] = np.argmax([p0,p1])\n", " \n", " # generate reward based on reward probability\n", " r[t] = np.random.rand() < mu[c[t]]\n", "\n", " # update values\n", " delta = r[t] - Q[c[t]]\n", " Q[c[t]] = Q[c[t]] + alpha * delta\n", "\n", " return c, r, Q_stored\n", "\n", "def plot_values(beta_hat, theta_hat, T, mu):\n", " c, r, Q = simulate_RescorlaWagner([beta_hat, theta_hat], T, mu)\n", " \n", " plt.plot(range(T), Q[1,:], 'r--', alpha=.6, label=f'{mu[1]*100:.0f}% machine')\n", " plt.plot(range(T), Q[0,:], 'm-', alpha=.6, label=f'{mu[0]*100:.0f}% machine')\n", " plt.plot(range(T), c, 'b+', label='choice')\n", " plt.xlabel('trials')\n", " plt.ylabel('value')\n", " plt.title(f'Rescorla-Wagner Learning')\n", " plt.legend()\n", " plt.show()\n", "\n", "# interactive display\n", "%config InlineBackend.figure_format = 'retina'\n", "\n", "@widgets.interact(trials=widgets.IntSlider(200, min=20, max=200),\n", " probability=widgets.FloatSlider(.8, min=0, max=1),\n", " alpha_hat=widgets.FloatSlider(.1, min=0, max=1),\n", " theta_hat=widgets.FloatSlider(1.5, min=0, max=10))\n", "\n", "def plot_interactive(trials, probability, alpha_hat, theta_hat):\n", " plot_values(alpha_hat, theta_hat, int(trials), [1-probability, probability])\n", "" ] }, { "cell_type": "code", "execution_count": null, "id": "digital-radar", "metadata": { "tags": [ "hide-input", "hide-output" ] }, "outputs": [], "source": [ "def plot_values(beta_hat, theta_hat, T, mu):\n", " c, r, Q = simulate_RescorlaWagner([beta_hat, theta_hat], T, mu)\n", " \n", " plt.plot(range(T), Q[1,:], 'r--', alpha=.6, label=f'{mu[1]*100:.0f}% machine')\n", " plt.plot(range(T), Q[0,:], 'm-', alpha=.6, label=f'{mu[0]*100:.0f}% machine')\n", " plt.plot(range(T), c, 'b+', label='choice')\n", " plt.xlabel('trials')\n", " plt.ylabel('value')\n", " plt.title(f'Rescorla-Wagner Learning')\n", " plt.legend()\n", " plt.show()\n", "\n", "# interactive display\n", "%config InlineBackend.figure_format = 'retina'\n", "\n", "@widgets.interact(trials=widgets.IntSlider(200, min=20, max=200),\n", " probability=widgets.FloatSlider(.8, min=0, max=1),\n", " alpha_hat=widgets.FloatSlider(.1, min=0, max=1),\n", " theta_hat=widgets.FloatSlider(1.5, min=0, max=10))\n", "\n", "def plot_interactive(trials, probability, alpha_hat, theta_hat):\n", " plot_values(alpha_hat, theta_hat, int(trials), [1-probability, probability])" ] }, { "cell_type": "markdown", "id": "frozen-civilian", "metadata": {}, "source": [ "