Multi-Armed Bandit Task¶

HED Task ID: hedtsk_multi_armed_bandit

Also known as: MAB, Bandit Task, K-Armed Bandit

Repeated choice among options with unknown or changing reward distributions; choice sequences dissociate exploration from exploitation.

Description¶

Participants choose among multiple options (the “arms” of a slot machine), each yielding rewards with unknown and often changing probabilities. The central challenge is the explore-exploit dilemma: whether to exploit the currently best-known option or explore alternatives that might yield higher returns. Reward probabilities may be stationary or volatile (drifting over time), with the volatile version (restless bandit) requiring continuous updating of value estimates. This paradigm is the dominant tool in computational psychiatry for studying reinforcement learning, and its computational tractability — fitting with Bayesian, Kalman-filter, or upper-confidence-bound models — has made it central to understanding decision-making deficits in addiction, schizophrenia, depression, and anxiety.

Inclusion test¶

Procedure	Participants choose repeatedly among multiple options (arms) that deliver stochastic rewards drawn from different distributions. They must balance exploring unknown options with exploiting known good ones.
Manipulation	Number of arms; reward distributions (stationary vs. drifting); horizon length; information asymmetry.
Measurement	Total reward earned; exploration-exploitation ratio; fit to reinforcement learning models (learning rate, inverse temperature); regret.

Variations¶

Variation	Description	Justification
Two-Armed Stationary Bandit	Simplest version; two options with fixed reward probabilities.	Canonical two-option stationary reward structure; stable payoff distributions
Restless (Volatile) Bandit	Reward probabilities drift over time via Gaussian random walk; requires continuous updating.	Reward probabilities drift over time; tests tracking of non-stationary environments
Contextual Bandit	Reward probabilities depend on observable context features; tests generalization across contexts.	Context features predict optimal choice; adds feature-based learning
Horizon Task (Wilson et al.)	Short vs. long decision horizons to separately measure directed and random exploration.	Fixed horizon with exploration vs. exploitation trade-off manipulation
Four-Armed Bandit with Reversal	Multiple arms with occasional reward-probability reversals; combines bandit with reversal-learning demands.	Four arms with explicit reversal phase; tests reversal learning in bandit
Informative vs. Non-Informative Exploration	Designs that separate information-seeking exploration from random exploration (e.g., observed vs. chosen options).	Exploration choices yield differential information; changes exploration value
Social Bandit	Observing another agent’s choices and outcomes before making own decisions; adds social learning dimension.	Observe another’s choices; social learning component
Bandit with Effort Cost	Incorporating physical or cognitive effort cost into exploration decisions.	Effort cost added to choices; combines effort discounting with learning
Bandit with Partial Observability	Only the chosen arm’s outcome is observed (standard) vs. all arms’ outcomes are observed; separates learning from exploration.	Outcomes sometimes hidden; different information structure

Cognitive processes¶

This task engages the following cognitive processes:

Key references¶

{‘authors’: “Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J.”, ‘year’: 2006, ‘title’: ‘Cortical substrates for exploratory decisions in humans’, ‘venue’: ‘Nature’, ‘venue_type’: ‘journal’, ‘journal’: ‘Nature’, ‘volume’: ‘441’, ‘issue’: ‘7095’, ‘pages’: ‘876-879’, ‘doi’: ‘10.1038/nature04766’, ‘openalex_id’: None, ‘pmid’: None, ‘citation_string’: “Daw, N. D., O’Doherty, J. P., Dayan, P., Seymour, B., & Dolan, R. J. (2006). Cortical substrates for exploratory decisions in humans. Nature, 441, 876–879.”, ‘url’: ‘https://doi.org/10.1038/nature04766’, ‘source’: ‘crossref’, ‘confidence’: ‘high’, ‘verified_on’: ‘2026-04-20’}

Recent references¶

{‘authors’: ‘Gershman, S. J.’, ‘year’: 2018, ‘title’: ‘Deconstructing the human algorithms for exploration’, ‘venue’: ‘Cognition’, ‘venue_type’: ‘journal’, ‘journal’: ‘Cognition’, ‘volume’: ‘173’, ‘issue’: None, ‘pages’: ‘34-42’, ‘doi’: ‘10.1016/j.cognition.2017.12.014’, ‘openalex_id’: None, ‘pmid’: None, ‘citation_string’: ‘Gershman, S. J. (2018). Deconstructing the human algorithms for exploration. Cognition, 173, 34–42.’, ‘url’: ‘https://doi.org/10.1016/j.cognition.2017.12.014’, ‘source’: ‘crossref’, ‘confidence’: ‘high’, ‘verified_on’: ‘2026-04-20’}
{‘authors’: ‘Schulz, E., & Gershman, S. J.’, ‘year’: 2019, ‘title’: ‘The algorithmic architecture of exploration in the human brain’, ‘venue’: ‘Current Opinion in Neurobiology’, ‘venue_type’: ‘journal’, ‘journal’: ‘Current Opinion in Neurobiology’, ‘volume’: ‘55’, ‘issue’: None, ‘pages’: ‘7-14’, ‘doi’: ‘10.1016/j.conb.2018.11.003’, ‘openalex_id’: None, ‘pmid’: None, ‘citation_string’: ‘Schulz, E., & Gershman, S. J. (2019). The algorithmic architecture of exploration in the human brain. Current Opinion in Neurobiology, 55, 7–14.’, ‘url’: ‘https://doi.org/10.1016/j.conb.2018.11.003’, ‘source’: ‘crossref’, ‘confidence’: ‘high’, ‘verified_on’: ‘2026-04-20’}
{‘authors’: ‘Cogliati Dezza, I., Yu, A. J., Cleeremans, A., & Alexander, W.’, ‘year’: 2017, ‘title’: ‘Learning the value of information and reward over time when solving exploration-exploitation problems’, ‘venue’: ‘Scientific Reports’, ‘venue_type’: ‘journal’, ‘journal’: ‘Scientific Reports’, ‘volume’: ‘7’, ‘issue’: ‘1’, ‘pages’: None, ‘doi’: ‘10.1038/s41598-017-17237-w’, ‘openalex_id’: None, ‘pmid’: None, ‘citation_string’: ‘Cogliati Dezza, I., Yu, A. J., Cleeremans, A., & Alexander, W. (2017). Learning the value of information and reward over time when solving exploration–exploitation problems. Scientific Reports, 7, 16919.’, ‘url’: ‘https://doi.org/10.1038/s41598-017-17237-w’, ‘source’: ‘crossref’, ‘confidence’: ‘high’, ‘verified_on’: ‘2026-04-20’}
{‘authors’: ‘Chakroun, K., Mathar, D., Wiehler, A., Ganzer, F., & Peters, J.’, ‘year’: 2020, ‘title’: ‘Dopaminergic modulation of the exploration/exploitation trade-off in human decision-making’, ‘venue’: ‘eLife’, ‘venue_type’: ‘journal’, ‘journal’: ‘eLife’, ‘volume’: ‘9’, ‘issue’: None, ‘pages’: None, ‘doi’: ‘10.7554/elife.51260’, ‘openalex_id’: None, ‘pmid’: None, ‘citation_string’: ‘Chakroun, K., Mathar, D., Wiehler, A., Ganzer, F., & Peters, J. (2020). Dopaminergic modulation of the exploration/exploitation trade-off in human decision-making. eLife, 9, e51260.’, ‘url’: ‘https://doi.org/10.7554/elife.51260’, ‘source’: ‘crossref’, ‘confidence’: ‘high’, ‘verified_on’: ‘2026-04-20’}