Introduction

People make many decisions that have important ramifications for their lives and the lives of others. Previous research has found that people’s choices in important real-life domains, such as personal finance, health, and education, are sometimes bad for them in the long run (O’Donoghue & Rabin, 2015). Making good choices in those domains often requires long-term planning. Long-term planning is a complex problem because the number of possible action sequences grows exponentially with the number of steps and people’s cognitive resources are limited. Therefore, one of the reasons why people make decisions that are bad for them in the long run might be that they lack adequate cognitive strategies for long-term planning.

Previous research has developed different approaches to improving human decision-making. One approach is to outsource difficult decisions to computers or provide computer-powered decision support for a specific problem (Aronson et al., 2005). Another solution is to nudge people towards choices that society deems to be good, for instance, by exploiting people’s bias towards sticking with the default option (Johnson & Goldstein, 2003). Although these approaches are prominent, they leave people unsupported in the vast majority of situations for which there are no decision support systems or nudges. Here, we therefore pursue a third alternative: enabling people to make more far-sighted decisions for themselves by teaching them effective planning strategies that they can use in novel real-world situations where decision support systems and nudges are unavailable.

Our approach is an instance of improving people’s decision competencies, which is known as boosting (Hertwig & Grüne-Yanoff, 2017). In contrast to the first two approaches, boosting protects and enhances people’s freedom to make their own decisions based on their own goals, views, and preferences. This is especially important for personal choices that people do not want to give up. Recent work has shown that teaching people clever decision strategies is a promising way to improve their decision-making in the real world (Hafenbrädl et al., 2016). This suggests that it might be possible to enable people to make their own far-sighted decisions in a wide range of complex real-life situations by teaching them one or a few effective strategies for long-term planning.

Early work on boosting taught people the normative principles of probability theory, logic, and expected utility theory (Larrick, 2004). Although these educational interventions succeeded to improve people’s performance on simple textbook problems, they failed to improve people’s performance in the real world (Larrick, 2004). The likely reason is that the amount of mental computation that would be required to apply those normative principles to complex real-world problems far exceeds people’s cognitive capacities (Lieder & Griffiths, 2020b). For instance, applying the principle of expected utility theory to the game of chess would require people to evaluate more sequences of moves than there are atoms in the universe, and real life is much more complex than a game of chess. Thus, if a person seriously applied the normative principles of standard decision theory to planning their life, then they would spend all of their life planning without ever arriving at a decision for what to do first. This illustrates that normative principles that fail to consider people’s limited time and bounded cognitive resources are ill-suited as practical prescriptions for which decision procedure a person should use to do well in the real world. Instead, it is necessary for people to rely on heuristics that allow them to arrive at reasonable decisions in a reasonable amount of time.

For a simple heuristic to lead to good decisions it has to be well-adapted to the structure of a particular decision environment in which it is used (Simon, 1956; Gigerenzer & Selten, 2002; Todd & Gigerenzer, 2012). This means that to help people make better decisions in a particular class of situations we first have to discover adaptive decision strategies that efficiently exploit the structure of those kinds of situations. The main bottleneck of boosting is that we still do not know highly effective cognitive strategies for the vast majority of the important real-life decisions that people have to make for themselves. We know some reasonably effective fast-and-frugal heuristics for a few simple kinds of decisions (Hafenbrädl et al., 2016). But we still do not know any excellent heuristics for solving complex sequential decision problems. That is, we do not know which heuristics people should use to solve complex sequential decision problems as well as possible given their finite time and limited cognitive resources. Discovering and articulating clever decision strategies that work well in the real world are very challenging for people. That might be why boosting has yet to be applied to the domain of long-term planning.

To make boosting people’s long-term planning skills possible, we leverage machine learning to discover optimal heuristics that are superior to people’s intuitive strategies. We develop an intelligent tutor that teaches people to use those strategies and demonstrate that people’s performance improves as they adopt the automatically discovered heuristics. In doing so, we extend our previous work on improving human decision-making in small toy problems (Lieder et al., 2019, 2020) to substantially larger, more complex, and more naturalistic sequential decision problems. In addition to laying the scientific and technological foundations for improving human decision-making, this line of work also elucidates the algorithmic principles of optimal—rather than typical—human planning.

Because our goal is to improve the mental strategies that people use to make their own decisions, our strategy discovery method strives to find cognitive strategies that are optimal for the human mind rather than optimal plans or planning algorithms that only computers can execute. To achieve this, we apply the normative principle of resource-rationality (Lieder & Griffiths, 2020b) to mathematically define planning strategies that are optimally adapted to the problems people have to solve and the cognitive resources people can use to solve those problems (Callaway et al., 2018b, 2020). The principle of resource-rationality defines the sweet spot between using an astronomical amount of computation to guarantee arriving at the best possible decision (expected utility theory) and making bad decisions without giving them any thought. Unlike previous work that used the theory of resource-rationality to develop descriptive models of how people normally make decisions, this article applies it to derive prescriptions for how people should make decisions to perform better than they normally do. Deriving such prescriptions serves to improve human decision-making. Here, we pursue this goal by teaching people automatically discovered resource-rational heuristics using intelligent tutors.

Previous work has established that it is possible to compute resource-rational planning strategies (Callaway et al., 2018b) by modeling the problem of deciding how to plan as a metalevel Markov Decision Processes (metalevel MDP; Hay et al., 2014). The metalevel MDP defines a resource-rational model of human planning by modeling not only the problem to be solved but also the cognitive operations people have available to solve that problem and how costly those operations are Callaway et al. (2018b, 2020), Lieder et al. (2017), and Griffiths (2020). We refer to this approach as automatic strategy discovery. This approach frames planning strategies as policies for selecting planning operations. Its methods use algorithms from dynamic programming and reinforcement learning (Sutton and Barto, 2018) to compute the policy that maximizes the expected reward of executing the resulting plan minus the cost of the computations that the policy would perform to arrive at that plan (Callaway et al., 2018a; Lieder & Griffiths, 2020b; Griffiths et al., 2019).

Recent work used dynamic programming to discover optimal planning strategies for planning up to three steps ahead in small instances of the Mouselab-MDP paradigm (Callaway et al., 2017), a process-tracing paradigm that externals planning as information gathering. This work found that it is possible to improve human planning on those problems by teaching people the automatically discovered strategies (Lieder et al.2019, 2020). Subsequent work applied reinforcement learning to approximate strategies for planning up to six steps ahead in a task where each step entailed choosing between two options and there were only two possible rewards (Callaway et al., 2018a). Additionally, multiple experiments have shown that people are able to transfer the learned strategies to different environments and reward structures, demonstrating that people are able to generalize the learned strategies and benefit from the tutoring beyond the specific task they are being taught in Lieder et al. (2020). But none of the existing strategy discovery methods (Callaway et al., 2018a; Kemtur et al., 2020) is scalable enough to discover good planning strategies for more complex environments. This is because the run time of these methods grows exponentially with the size of the planning problem. This confined automatic strategy discovery methods to very small and very simple planning tasks. Discovering planning strategies that achieve—let alone exceed—the computational efficiency of human planning is still out of reach for virtually all practically relevant sequential decision problems.

To overcome this computational bottleneck, we developed a scalable method for discovering planning strategies that achieve a (super-)human level of computational efficiency on some of the planning problems that are too large for existing strategy discovery methods. Our approach draws inspiration from the hierarchical structure of human behavior (Botvinick, 2008; Miller et al., 1960; Carver and Scheier, 2001; Tomov et al., 2020). Research in cognitive science and neuroscience suggests that the brain decomposes long-term planning into goal-setting and planning at multiple hierarchically nested timescales (Carver & Scheier, 2001; Botvinick, 2008). Furthermore, Solway et al. (2014) found that human learners spontaneously discover optimal action hierarchies. Inspired by these findings, we extend the near-optimal strategy discovery method proposed in Callaway et al. (2018a) by incorporating hierarchical structure into the space of possible planning strategies. Concretely, the planning task is decomposed into first selecting one of the possible final destinations as a goal solely based on its own value and then planning the path to this selected goal.

We find that imposing hierarchical structure makes automatic strategy discovery methods significantly less computationally expensive without compromising the resource-rationality score of the discovered strategies. Our hierarchical decomposition leads to a substantial reduction in the computational complexity of the strategy discovery problem, which makes it possible to scale up automatic strategy discovery to many planning problems that were prohibitively large for previous strategy discovery methods. This allowed our method to discover planning strategies that achieve a superhuman level of computational efficiency on non-trivial planning problems in the sense that the discovered strategies had a higher resource-rationality score than the strategies used by people. The resource-rationality score is the difference between the sum of the rewards along the chosen path minus the cost of the information-gathering operations that were performed to select that path. That is, we use the term “computational efficiency” to refer to how well a planning strategy trades off the quality of its decisions against the cost of the time and computational resources it consumes to reach its decisions.

We demonstrate that this advance enables us to improve human decision-making in larger and more complex planning tasks that were previously intractable to solve by conducting three training experiments. Our findings suggest that people internalize the taught strategies and then spontaneously use them in subsequent decisions without any guidance. In addition, we have previously shown that people transfer the strategies taught by our intelligent tutors to new tasks and new domains (Lieder et al., 2019, 2020). We are therefore optimistic that it might be possible to leverage machine learning to help people learn general cognitive strategies that enable them to independently make their own far-sighted decisions in novel situations where decision support is unavailable.

The plan for this article is as follows: We start by introducing the frameworks and methods that our approach builds on. We then present our new reinforcement learning method for discovering hierarchical planning strategies. Next, we demonstrate that our method is able to discover near-optimal planning strategies for larger environments than previous methods were able to handle and characterize these optimal strategies. We then test whether the resulting advances are sufficient to improve human decision-making in complex planning problems and close by discussing the implications of our findings for the development of more intelligent agents, understanding human planning (Lieder and Griffiths, 2020b), and improving human decision-making (Lieder et al., 2019).

Background and Related Work

Before we introduce, evaluate, and apply our new method for discovering hierarchical planning strategies we now briefly introduce the concepts and methods that it builds on. We start by introducing the theoretical framework we use to define what constitutes a good planning strategy.

Resource-Rationality

Previous work has shown that people’s planning strategies are jointly shaped by the structure of the environment and the cost of planning (Callaway et al., 2018b, 2020). This idea has been formalized within the framework of resource-rational analysis (Lieder & Griffiths, 2020b). Resource-rational analysis is a cognitive modeling paradigm that derives process models of people’s cognitive strategies from the assumption that the brain makes optimal use of its finite computational resources. These computational resources are modeled as a set of elementary information processing operations. Each of these operations has a cost that reflects how much computational resources it requires. Those operations are assumed to be the building blocks of people’s cognitive strategies. To be resource-rational a planning strategy has to achieve the optimal tradeoff between the expected return of the resulting decision and the expected cost of the planning operation it will perform to reach that decision. Both depend on the structure of the environment. The degree to which a planning strategy (h) is resource-rational in a given environment (e) can be quantified by the sum of expected rewards achieved by executing the plan it generates (Rtotal) minus the expected computational cost it incurs to make those choices, that is

(1)

where λ is the cost of performing one planning operation and N is the number of planning operations that the strategy performs. Throughout this article, we refer to this measure as resource-rationality score and use it as our primary criterion for the performance of planning algorithms, automatically discovered strategies, and people.

Discovering Resource-Rational Planning Strategies by Solving Metalevel MDPs

Equation 1 specifies a criterion that optimal heuristics must meet. But it does not directly tell us what those optimal strategies are. Finding out what those optimal strategies are is known as strategy discovery.

Callaway et al. (2018b) developed a method to automatically derive resource-rational planning strategies by modeling the optimal planning strategy as a solution to a metalevel Markov Decision Process (metalevel MDP). In general, a metalevel MDP \(M = ({\mathscr{B}}, \mathcal {C}, T, r)\) is defined as an undiscounted MDP where \(b \in {\mathscr{B}}\) represents the belief state, \(T(b,c,b^{\prime })\) is the probability of transitioning from belief state b to belief state \(b^{\prime }\) by performing computation \(c \in \mathcal {C}\), and r(b,c) is a reward function that describes the costs and benefits of computation (Hay et al., 2014). It is important to note that the actions in a metalevel MDP are computations that are different from object-level actions—the former are planning operations and the latter are physical actions that move the agent through the environment. Previous methods for discovering near-optimal decision strategies (Lieder et al., 2017; Callaway et al., 2018b; Callaway et al., 2018a) have been developed for and evaluated in a planning task known as the Mouselab-MDP paradigm (Callaway et al., 2017, 2020).

The Mouselab-MDP Paradigm

The Mouselab-MDP paradigm was developed to make people’s elementary planning operations observable. This is achieved by externalizing the process of planning as information seeking. Concretely, the Mouselab-MDP paradigm illustrated in Fig. 1 shows the participant a map of an environment where each location harbors an occluded positive or negative reward. To find out which path to take the participant has to click on the locations they consider visiting to uncover their rewards. Each of these clicks is recorded and interpreted as the reflection of one elementary planning operation. The cost of planning is externalized by the fee that people have to pay for each click. People can stop planning and start navigating through the environment at any time. But once they have started to move through the environment they cannot resume planning. The participant has to follow one of the paths along the arrows to one of the outermost nodes.

Fig. 1
figure 1

Illustration of the Mouselab-MDP paradigm. Rewards are revealed by clicking with the mouse, prior to selecting a path using the keyboard. This figure shows one concrete task that can be created using this paradigm. Many other tasks can be created by varying the size and layout of the environment, the distributions that the rewards are drawn from, and the cost of clicking

To evaluate the resource-rational performance metric specified in Eq. 1 in the Mouselab-MDP paradigm, we measure Rtotal by the sum of rewards along the chosen path, set λ to the cost of clicking, measure N by the number of clicks that a strategy made on a given trial.

The structure of a Mouselab-MDP environment can be modeled as a directed acyclic graph (DAG), where each node is associated with a reward that is sampled from a probability distribution, and each edge represents a transition from one node to another. In this article, we refer to the agent’s initial position as the root node, the most distant nodes as goal nodes, and all other nodes as intermediate nodes.

Mouselab-MDP environments were selected in which the variance of each node’s reward distribution increases with the node’s depth.Footnote 1 This models that the values of distant states are more variable than the values of proximal states. Therefore, the goal nodes have a higher variance than the intermediate nodes.

Resource-Rational Planning in the Mouselab-MDP Paradigm

Using the Mouslab-MDP as a model of human planning allows to measure the effectiveness of resource-rational planning strategies. Alternatively to observing human planning, it is also possible to algorithmically find novel planning strategies that solve the Mouslab-MDP in a near-optimal way (Callaway et al., 2018a), which we term strategy discovery. Since the Mouslab-MDP models human planning operations, we believe devising algorithms that solve the Mouslab-MDP is a highly valuable cause. Recent work by Callaway et al. (2018a) has shown that it is possible to discover near-optimal planning strategies for small decision problems and Lieder et al. (2019, 2020) demonstrate that teaching these strategies to humans can improve their own planning strategies in the Mouslab-MDP paradigm.

To discover optimal planning strategies, we can draw on previous work that formalized resource-rational planning in the Mouselab-MDP paradigm as the solution to a metalevel MDP (Callaway et al., 2018b, 2020). In the corresponding metalevel MDP each belief state \(b \in {\mathscr{B}}\) encodes probability distributions over the rewards that the nodes might harbor. The possible computations are \(\mathcal {C} = \{ \xi _{1}, ..., \xi _{M}, c_{1,1}, ..., c_{M,N}, \perp \}\), where cg,n reveals the reward at intermediate node n on the path to goal g, and ξg reveals the value of the goal node g. For simplicity, we set the cost of each computation to 1. When the value of a node is revealed, the new belief about the value of the inspected node assigns a probability of one to the observed value. The metalevel operation ⊥ terminates planning and triggers the execution of the plan. The agent selects one of the paths to a goal state that has the highest expected sum of rewards according to the current belief state.

Methods for Solving Metalevel MDPs

In their seminal paper, Russell and Wefald (1991) introduced the theory of rational metareasoning. In Russell and Wefald (1992), they define the value of computation VOC(c,b) to be the expected improvement in decision quality achieved by performing computation c in belief state b and continuing optimally, minus the cost of computation c. Using this formalization, the optimal planning strategy \(\pi _{\text {meta}}^{*}\) is a selection of computations that maximizes the value of computation (VOC), that is

$$ \pi_{\text{meta}}^{*}(b) = \arg\max_{c} \text{VOC}(c,b). $$
(2)

When the VOC is non-positive for all available computations, the policy terminates (c =⊥) and executes the best object-level action according to the current belief state. Hence, VOC(⊥,b) = 0. In general, the VOC is computationally intractable but it can be approximated (Callaway et al., 2018a). Lin et al. (2015) estimated VOC by the myopic value of computation (VOI1), which is the expected improvement in decision quality that would be attained by terminating deliberation immediately after performing the computation. Hay et al. (2014) approximated rational metareasoning by solving multiple smaller metalevel MDPs that each defines the problem of deciding between one object-level action and its best alternative.

Bayesian Metalevel Policy Search

Inspired by research on how people learn how to plan Krueger et al. (2017), Callaway et al. (2018a) developed a reinforcement learning method for learning when to select which computation. This method uses Bayesian optimization to find a policy that maximizes the expected return of a metalevel MDP. The policy space is parameterized by weights that determine to which extent computations are selected based on the myopic VOC versus less short-sighted approximations of the value of computation. It thereby improves upon approximating the value of computation by the myopic VOC by considering the possibility that the optimal metalevel policy might perform additional computations afterward. Concretely, BMPS approximates the value of computation by interpolating between the myopic VOI and the value of perfect information, that is

$$ \begin{array}{@{}rcl@{}} \mathrm{\widehat{VOC}}(c,b; \mathbf{w}) &=& w_{1} \cdot \text{VOI}_{1}(c,b) + w_{2} \cdot \text{VPI}(b) + w_{3} \cdot \\ &&\mathrm{VPI_{sub}}(c,b) - w_{4} \cdot \text{cost}(c), \end{array} $$
(3)

where VPI(b) denotes the value of perfect information. VPI(b) assumes that all computations possible at a given belief state would take place. Furthermore, VPIsub(c,b) measures the benefit of having full information about the subset of parameters that the computation reasons about (e.g., the values of all paths that pass through the node evaluated by the computation), cost(c) is the cost of the computation c, and w = (w1,w2,w3,w4) is a vector of weights. Since the VOC and VPIsub are bounded by the VOI1 from below and by the VPI from above, the approximation of VOC (i.e., \(\mathrm {\widehat {VOC}}\)) is a convex combination of these features, and the weights associated with these features are constrained to a probability simplex set. Finally, the weight associated with the cost function w4 ∈ [1,h], where h is the maximum number of available computations to be performed. The values of these weights are computed using Bayesian Optimization (Mockus, 2012). Discovery of the optimized weights is analogous to discovering the optimal policy in the environment.

Alternative Approaches

Alternative methods to solve metalevel MDPs include works by Sezener and Dayan (2020) and Svegliato and Zilberstein (2018). Sezener and Dayan (2020) solves a multi-arm bandit problem using a Monte Carlo Tree Search based on the static and dynamic value of computations. In a bandit problem, unlike most models of planning, transitions depend purely on the chosen action and not on the current state. Svegliato and Zilberstein (2018) devised an approximate metareasoning algorithm using temporal difference (TD) learning to decide when to terminate the planning process.

Intelligent Cognitive Tutors

One of the reasons for the ineffectiveness of teaching people to make decisions according to normative principles that ignore practical constraints (i.e., limited time and bounded cognitive resources) is that the exhaustive planning that would be required to apply those principles to the real world would waste virtually all of a person’s time on the very first decision (e.g., “Should I get up or go back to sleep?”), thereby depriving them of the opportunities afforded by later decisions. By contrast, resource-rational heuristics allocate people’s limited time and bounded cognitive resources in such a way that they earn the highest possible sum of rewards across the very long series of decisions that constitutes life. This is why teaching resource-rational heuristics might succeed in improving people’s decisions in large and complex real-world problems where exhaustive planning would either be impossible or take a disproportionately large amount of time that could be better spent on other things.

Utilizing the resource-rational planning strategies discovered by solving metalevel MDPs, Lieder et al. (2019, 2020) have developed intelligent tutors that teach people the optimal planning strategies for a given environment. Most of the tutors let people practice planning in the Mouselab-MDP paradigm and provide them immediate feedback on each chosen planning operation. The feedback is given in two ways: (1) information about what the optimal planning strategy would have done; and (2) an affective element given as positive feedback (e.g., “Good job!”) or negative feedback. The negative feedback included a slightly frustrating time-out penalty during which participants were forced to wait idly for a duration that was proportional to how suboptimal their planning operation had been.

Lieder et al. (2020) found that participants were able to learn to use the automatically discovered strategies, remember them, and use them in novel environments with a similar structure. These findings suggest that automatic strategy discovery can be used to improve human decision-making if the discovered strategies are well-adapted to the situations where people might use them. Additionally, Lieder et al. (2020) also found that video demonstration of click sequences performed by the optimal strategy is an equally effective teaching method as providing immediate feedback. Here, we build on these findings to develop cognitive tutors that teach automatically discovered strategies by demonstrating them to people.

Discovering Hierarchical Planning Strategies

All previous strategy discovery methods evaluate and compare the utilities of all possible computations in each step. As such, these algorithms have to explore the entire metalevel MDP’s state space which grows exponentially with the number of nodes.Footnote 2 As a consequence, these methods do not scale well to problems with large state spaces and long planning horizons. This is especially true of the Bayesian Metalevel Policy Search algorithm (BMPS) (Callaway et al., 2018a) whose run time is exponential in the number of nodes of the planning problem. In contrast to the exhaustive enumeration of all possible planning operations performed by those methods, people would not even consider making detailed low-level motor plans for navigating to a specific distant location (e.g., Terminal C of San Juan Airport) until they arrive at a high-level plan that leads them to or through that location (Tomov et al., 2020). Here, we build on insights about human planning to develop a more scalable method for discovering efficient planning strategies.

Hierarchical Problem Decomposition

To efficiently plan over long horizons, people (Botvinick, 2008; Carver and Scheier, 2001; Tomov et al., 2020) and hierarchical planning algorithms (Kaelbling & Lozano-Pérez, 2010; Sacerdoti, 1974; Marthi et al., 2007; Wolfe et al., 2010) decompose the problem into first setting goals and then planning how to achieve them. This two-stage process breaks large planning problems down into smaller problems that are easier to solve. To discover hierarchical planning strategies automatically, our proposed strategy discovery algorithm decomposes the problem of discovering planning strategies into the sub-problems of discovering a strategy for selecting a goal and discovering a strategy for planning the path to the chosen goal. A pictorial representation is given in Fig. 2.

Fig. 2
figure 2

Flow diagram of the hierarchically discovered planning strategy. The high-level controller decides on goal computations. The low-level controller decides the computations to be performed within a goal

Formally, this is accomplished by decomposing the metalevel MDP defining the strategy discovery problem into two metalevel MDPs with smaller state and action spaces. Constructing metalevel MDPs for goal-setting and path planning is easy when there is a small set of candidate goals. Such candidate goals can often be identified based on prior knowledge or the structure of the domain (Schapiro et al., 2013; Solway et al., 2014). A low-level controller solves the goal-achievement MDP whereas the high-level controller solves the goal-setting MDP. When the controller is in control, a computation is selected from the corresponding metalevel MDP and performed. The meta-controller looks at the expected reward of the current goal with the expected reward of the next best goal and decides when control from the high-level controller should be switched to the low-level controller. Hence, when the low-level controller discovers that the current goal is not as valuable as expected, the meta-controller allows for goal-switching.

The metalevel MDP model of the sub-problem of goal selection (“Goal-Setting Metalevel MDP”) only includes computations for estimating the values of a small set of candidate goal states (V (g1),⋯ ,V (gM)). This means that goals are chosen without considering how costly it would be to achieve them. This makes sense when all goals are known to be achievable and the differences between the values of alternative goal states are substantially larger than the differences between the costs of reaching them. This is arguably true for many challenges people face in real life. For instance, when a high school student plans one’s career, the difference between the long-term values of studying computer science versus becoming a janitor is likely much larger than the difference between the costs of achieving either goal. This is to be expected when the time it will take to achieve the goals is short relative to a person’s lifetime.

The goal-achievement MDP (“Goal-Achievement Metalevel MDP”) only includes computations that update the estimated costs of alternative paths to the chosen goal by determining the costs or rewards of state-action pairs r(s,a) that lie on those paths. This selection of computations within a selected goal leads to a possible issue of ignoring some computations that can be irrelevant in the goal-achievement MDP but be highly valuable when considering the complete problem. One such example is when considering computations that reveal the value of nodes lying on an unavoidable path to the selected goal. This problem gets further accentuated if such a node has a possibility of having a highly positive or negative reward. To rectify this problem, a meta-controller has been introduced to facilitate goal-switching. A real-world example of the necessity to switch goals after discovering an unlikely highly negative event could be, for example, to switch from investing in the stock market to investing in real estate after discovering a likely stock market crash.

Decomposing the strategy discovery problem into these two components reduces the number of possible computations that the metareasoning method has to choose between from MN to M + N, where M is the number of possible final destinations (goals) and N is the number of steps to the chosen goal (see Appendix 4). Perhaps the most-promising metareasoning method for automatic strategy discovery is the Bayesian Metalevel Policy Search algorithm (BMPS; Callaway et al., 2018a; Kemtur et al., 2020). To solve the two types of metalevel MDPs introduced below more effectively, we also introduce an improvement of the BMPS algorithm in “Hierarchical Bayesian Metalevel Policy Search.” An additional computational optimization is described in Appendix 8.

Goal-Setting Metalevel MDP

The optimal strategy for setting the goal can be formalized as the solution to the metalevel MDP \(M^{\mathrm {H}} = ({\mathscr{B}}^{\mathrm {H}}, \mathcal {C}^{\mathrm {H}}, T^{\mathrm {H}}, R^{\mathrm {H}})\), where the belief state \(b^{\mathrm {H}}(g) \in {\mathscr{B}}^{\mathrm {H}}\) denotes the expected cumulative reward that the agent can attain starting from the goal state \(g \in \mathcal {G}\). The high-level computations are \(\mathcal {C}^{\mathrm {H}} = \{\xi _{1}, ..., \xi _{M}, \perp ^{\mathrm {H}} \}\), where ξg reveals the value V (g) of the goal node g. ⊥H terminates the high-level planning leading to the agent to select the goal with the highest value according to its current belief state. The reward function is RH(bH,cH) = −λH for cH ∈{ξ1,...ξM} and \( R^{\mathrm {H}}(b^{\mathrm {H}}, \perp ^{\mathrm {H}}) = \max \limits _{k \in \mathcal {G}}\mathbb {E}[b^{\mathrm {H}}(k)] \).

Goal-Achievement Metalevel MDP

Having set a goal to pursue, the agent has to find the optimal planning strategy to achieve the goal. This planning strategy is formalized as the solution to the metalevel MDP \(M^{\mathrm {L}} = ({\mathscr{B}}^{\mathrm {L}}, \mathcal {C}^{\mathrm {L}}, T^{\mathrm {L}}, R^{\mathrm {L}})\), where the belief state \(b \in {\mathscr{B}}^{\mathrm {L}}\) denotes the expected reward for each node. The agent can only perform a subset of meta-actions \(\mathcal {C}_{g,\mathrm {L}} = \{ c_{g,1}, ..., c_{g,N}, \perp ^{\mathrm {L}}\}\), where cg,n reveals the reward at node n in the goal set \(h_{g} \in {\mathscr{H}}\). A goal set \(h_{g} \in {\mathscr{H}}\) refers to all nodes, including the goal node, which lie on all paths leading to goal \(g \in \mathcal {G}\). Furthermore, ⊥L terminates planning and leads the agent to select the path with the highest expected sum of rewards according to the current belief state. The reward function is RL(b,cg) = −λL for cg ∈{cg,1,...,cg,N} and \( R^{\mathrm {L}}(b, \perp ^{\mathrm {L}}) = \max \limits _{p \in \mathcal {P}} {\sum }_{n \in p} \mathbb {E}[b_{n}] \), where \(\mathcal {P}\) is the set of all paths, and bn is the belief of the reward for node n.

Hierarchical Bayesian Metalevel Policy Search

Having introduced the hierarchical problem decomposition, we now present how this decomposition can be leveraged to make BMPS and other automatic strategy discovery methods more scalable. BMPS approximates the value of computation (VOC) according to Eq. 3. We propose to utilize BMPS to solve the goal selection metalevel MDP and the goal-achievement metalevel MDP separately. The meta-controller then decides when the policy discovered should run. A detailed analysis of the computational time is presented in Appendix 4.

High-level Policy Search

The VOC for the high-level policy is approximated using three features: (1) the myopic utility for performing a goal state evaluation (\(\mathrm {VO{I_{1}^{H}}}\)), (2) the value of perfect information about all goals (VPIH), and (3) the cost of the respective computation (costH).

$$ \begin{array}{@{}rcl@{}} \mathrm{\widehat{VOC}}^{\mathrm{H}}\left( c^{\mathrm{H}}, b^{\mathrm{H}}; \mathbf{w}^{\mathrm{H}}\right) &=& w_{1}^{\mathrm{H}} \cdot \mathrm{VO{I_{1}^{H}}}\left( c^{\mathrm{H}},b^{\mathrm{H}}\right) + w_{2}^{\mathrm{H}} \cdot \mathrm{VPI^{H}}\left( b^{\mathrm{H}}\right)\\ && - w_{3}^{\mathrm{H}} \cdot \mathrm{cost^{H}}\left( c^{\mathrm{H}}\right), \end{array} $$
(4)

where \(w_{1}^{\mathrm {H}}, w_{2}^{\mathrm {H}}\) are constrained to a probability simplex set, \(w_{3}^{\mathrm {H}} \in \mathbb {R}_{[1, M]}\), and M is the number of goals. Additionally, the cost costH(cH) is defined as

$$ \mathrm{cost^{H}}\left( c^{\mathrm{H}}\right)=\begin{cases} \lambda^{\mathrm{H}}, & \text{if}~c^{\mathrm{H}} \in \{\xi_{1},..., \xi_{M} \}.\\ 0, & \text{if}~c^{\mathrm{H}} = \perp^{\mathrm{H}}. \end{cases} $$
(5)

Low-level Policy Search

In a similar manner as for the high-level policy, the value of computation for the low-level policy is approximated by using a mixture of VOI features and the anticipated cost of the current computation and future computations, that is:

$$ \begin{array}{@{}rcl@{}} \mathrm{\widehat{VOC}}^{\mathrm{L}}\left( c,b, g; \mathbf{w}^{\mathrm{L}}\right) & =& w_{1}^{\mathrm{L}} \cdot \mathrm{VO{I_{1}^{L}}}\left( c,b,g\right) + w_{2}^{\mathrm{L}} \cdot \mathrm{VPI^{L}}\left( b,g\right) \\ && + w_{3}^{\mathrm{L}} \cdot \mathrm{VPI_{sub}^{L}}(c,b,g) - w_{4}^{\mathrm{L}} \\ &&\cdot \mathrm{cost^{L}}\left( c,g, \mathbf{w}^{\mathrm{L}}\right) \end{array} $$
(6)

where the weights \(w_{1}^{\mathrm {L}}\), \(w_{2}^{\mathrm {L}}\), \(w_{3}^{\mathrm {L}}\) are constrained to a probability simplex set, \(w_{4}^{\mathrm {L}} \in \mathbb {R}_{[1, |h_{g}|]}\), and |hg| is the number of nodes in goal set hg. The weight values for both levels are optimized in 100 iterations with Bayesian Optimization (Mockus, 2012) using the GPyOpt library (The GPyOpt Authors, 2016).

The cost feature of the original BMPS algorithm introduced by Callaway et al. (2018a) only considered the cost of a single computation whereas its VOI features consider the benefits of performing a sequence of computations. As a consequence, policies learned with the original version of BMPS are biased towards inspecting nodes that many paths converge on, even when the values of those nodes are irrelevant. To rectify this problem, we redefine the cost feature so that it considers the costs of all computations assumed by the VOI features (for an explanation of the features, see Appendix 1). Concretely, to compute the low-level policy, we define the cost feature of BMPS as the weighted average of the costs of generating the information assumed by the VOI features \(\mathcal {F}=\lbrace \mathrm {VO{I_{1}^{L}}}, \mathrm {VPI^{L}}, \mathrm {VPI_{sub}^{L}}\rbrace \), that is

$$ \mathrm{cost^{L}}\left( c,g, \mathbf{w}^{\mathrm{L}}\right) = \sum\limits_{f \in \mathcal{F}} w_{f}^{\mathrm{L}} \cdot \sum\limits_{n}^{|h_{g}|} \mathbb{I}(c,f,n) \cdot \text{cost}(c) $$
(7)

where \(\mathbb {I}(c,f,n)\) returns 1 if node n is relevant when computing feature f for computation c and 0 otherwise.

In the remainder of this article, we refer to the resulting strategy discovery algorithm as hierarchical BMPS and refer to the original version of BMPS as non-hierarchical BMPS.

Evaluating the Performance, Scalability, and Robustness of Our Method for Discovering Hierarchical Planning Strategies

We evaluated our new method for discovering resource-rational strategies for human planning in large and complex sequential decision problems in terms of the resource-rationality of the discovered strategies and the scope of its applicability. Concretely, we measure the resource-rationality of the discovered strategies by the resource-rationality score defined in Eq. 1 (see “Resource-Rationality” and “The Mouselab-MDP Paradigm”). In brief, we found that our new method makes it possible to discover resource-rational strategies for substantially larger, and hence more naturalistic, sequential decision problems than previous strategy discovery methods could handle. Despite the approximations that went into making our method so scalable, the discovered strategies are no worse than the strategies discovered by state-of-the-art strategy discovery methods and substantially more resource-rational than both the strategies that people use intuitively and standard planning algorithms that people could be thought to use instead. These findings hold even when the structure of the sequential decision problem violates the assumptions of our method. This section presents the details of this evaluation. Readers who are primarily interested in the effects of teaching the automatically discovered strategies to people may skip ahead to the next section.

To evaluate our method for discovering resource-rational strategies for human planning in large sequential decision problems, we benchmarked its performance in the two types of environments illustrated in Figs. 3 and 6. The first type of environment conforms to the structure that motivated our hierarchical problem decomposition (i.e., the variability of rewards increases from each step to the next) and the second type does not. In the second type of environment, we introduced a high-risk node on the path to each goal (see Fig. 6). This violates the assumption that motivated the hierarchical decomposition and makes goal-switching essential for a high resource-rationality score.

Fig. 3
figure 3

Mouselab-MDP environment with 2 goals. Nodes associated with each goal are denoted in green (circles) and red (triangles), respectively. The goal nodes have darker shades of green and red (diamonds), and the root node’s color is blue (square)

For each environment, we assess the degree to which the automatically discovered strategies are resource-rational against the resource-rationality of human planning, existing planning algorithms, and the strategies discovered by state-of-the-art strategy discovery methods. The Mouselab-MDP planning task and the corresponding metalevel MDP are set up in such a way that the scores of people and automatically discovered strategies measure their level of resource-rationality rather than just their performance. This is because the score in this task is the sum of the external rewards along the chosen path minus the cost of planning that was invested to select the path. We therefore refer to it as the resource-rationality score. A strategy will achieve the highest resource-rationality score if it selects the most informative planning operations (computations/clicks) to maximally improve its plan with as few computations as necessary. Each planning operation has an associated cost which is used to model the cost of thinking. A strategy is said to be the most computationally efficient if it achieves the highest possible resource-rationality score.

In addition to evaluating the resource-rationality of the discovered strategies, we also evaluated the scalability of our method to larger planning problems with longer horizons in terms of the largest environments for which it can discover resource-rational heuristics (see Appendix 2). In brief, our method can solve metalevel MDPs with 52484 times as many possible belief states than what previous methods were able to handle. This improvement increases the size of the largest problem for which resource-rational planning strategies can be discovered to planning 5 steps ahead when there are five possible actions in each state that each lead to a different outcome.

Evaluation in Environments That Conform to the Assumed Structure

We first evaluate the performance of our method in environments whose structure conforms to the assumptions that motivated the hierarchical problem decomposition. To do so, we compare the resource-rationality of the discovered strategies against the resource-rationality of existing planning algorithms, the strategies discovered by previous strategy discovery methods, and human resource-rationality in four increasingly challenging environments of this type with 2–5 candidate goals. The reward of each node is sampled from a normal distribution with mean 0. The variance of rewards available at non-goal nodes was 5 for nodes reachable within a single step (level 1) and doubled from each level to the next. The variance of the distribution from which the reward associated with the goal node was sampled starts from 100 and increases by 20 for every additional goal node, up to a maximum of 180, after which the variance resets to 100. The environment was partitioned into one sub-graph per goal. Each of those sub-graphs contains 17 intermediate nodes, forming 10 possible paths that reach the goal state in maximum 5 steps (see Fig. 3). The cost of planning is 1 point per click (λ = 1).

Table 1 Resource-rationality scores of various strategy discovery methods (S) and existing planning algorithms (P). The highest resource-rationality score for each environment setting (column) is formatted in bold. The four best methods performed significantly better than the other methods but the differences between them are not statistically significant

To estimate an upper bound on the resource-rationality score of existing planning algorithms on our benchmark problems, we selected Backward Search and Bidirectional Search (Russell and Norvig, 2002) because—unlike most planning algorithms—they start by considering potential final destinations, which is optimal for planning in our benchmark problems. These search algorithms terminate when they find a path whose expected return exceeds a threshold, called its aspiration value. The aspiration was selected using Bayesian Optimization (Mockus, 2012) to get the best possible performance from the selected planning algorithm. We also evaluated the resource-rationality score of a random-search algorithm, which chooses computations uniformly at random from the set of metalevel operations that have not been performed yet.

In addition to those planning algorithms, our baselines also include three state-of-the-art methods for automatic strategy discovery: the greedy myopic VOC strategy discovery algorithm (Lin et al., 2015), which approximates the VOC by its myopic utility (VOI1), BMPS (Callaway et al., 2018a), and the Adaptive Metareasoning Policy Search algorithm (AMPS) (Svegliato & Zilberstein, 2018) which uses approximate metareasoning to decide when to terminate planning. Our implementation of the AMPS algorithm uses a deep Q-network (Mnih et al., 2013) to estimate the difference between values to stop planning and continue planning, respectively. It learns this estimate based on the expected termination reward of the best path. The planning operations are selected by maximizing the myopic value of information (VOI1). When applying hierarchical BMPS to this environment, goal-switching is ineffective since the cumulative variance of the intermediate nodes was less than the variance of the goal nodes, rendering goal-switching unnecessary. Therefore, goal-switching does not occur in the automatically discovered strategies for solving this environment. To illustrate the versatility of our hierarchical problem decomposition, we also applied it to the greedy myopic VOC strategy discovery algorithm (Fig. 4).

Fig. 4
figure 4

Sequence of nodes revealed in particular environment. The numbers above the nodes indicate the sequence in which the nodes were revealed. The number in each revealed node indicates its reward

How Resource-Rational Are the Automatically Discovered Strategies?

Table 1 and Fig. 5a compare the resource-rationality score of the strategies discovered by hierarchical BMPS and the hierarchical greedy myopic VOC method against the resource-rationality score of the strategies discovered by the two state-of-the-art methods, two standard planning algorithms, and human resource-rationality scores (see “How Does the Performance of the Automatically Discovered Strategies Compare to Human Planning?”) on the benchmark problems described above (“Evaluation in Environments That Conform to the Assumed Structure”). These results show that the strategies discovered by our new hierarchical strategy discovery methodsFootnote 3 outperform extant planning algorithms and the strategies discovered by the AMPS algorithm across all of our benchmark problems (p < .01 for all pairwise Wilcoxon rank-sum tests). Critically, imposing hierarchical constraints on the strategy search of BMPS and the greedy myopic VOC method had no negative effect on the resource-rationality score of the resulting strategies (p > .770 for all pairwise Wilcoxon rank-sum tests).

Fig. 5
figure 5

(a) Resource-rationality scores of existing planning algorithms (striped bars) versus planning strategies discovered by various strategy discovery methods (bars without stripes). (b) Comparison of the mean time for various strategy discovery methods (in seconds)

As illustrated in Fig. 4, the planning strategy our hierarchical BMPS algorithm discovered for this type of environment is qualitatively different from all existing planning algorithms. In general, the strategy is as follows: it first evaluates the goal nodes until it finds a goal node with a sufficiently high reward. Then, it plans backward from the chosen goal to the current state. In evaluating candidate paths from the goal to the current state, it discards each path from further exploration as soon as it encounters a high negative reward on that path. This phenomenon is known as pruning and has previously been observed in human planning (Huys et al., 2012).Footnote 4 The non-hierarchical version of BMPS also discovered this type of planning strategy. This suggests that goal-setting with backward planning is the resource-rational strategy for this environment rather than an artifact of our hierarchical problem decomposition. Unlike this type of planning, most extant planning algorithms plan forward and the few planning algorithms that plan backward (e.g., Bidirectional Search and Backward Search) do not preemptively terminate a path exploration.

Evaluation in Risky Environments

To accommodate environments whose structure violates the assumption that more distant rewards are more variable than more proximal ones, the hierarchical strategies discovered by our method can alternate between goal selection and goal planning.Footnote 5 We now demonstrate the benefits of this goal-switching functionality by comparing the resource-rationality score of our method with versus without goal-switching. In particular, we demonstrate that switching goals leads to better performance if the assumption of increasing variance is violated and does not harm performance when that assumption is met. Firstly, we compare the resource-rationality score of the two algorithms in an environment where switching goals should lead to an improvement in performance. This environment has a total of 60 nodes split into four different goals, each consisting of 15 nodes in the low-level MDP. The difference to previously used environments is that one of the unavoidable intermediate nodes has a 10% probability to harbor a large loss of − 1500 (see Fig. 6). The cost of computation in this environment is 10 points per click (λ = 10). The optimal strategy for this environment selects a goal, checks this high-risk node on the path leading to the selected goal, and switches to a different goal if it uncovers the large loss. We compare the performance of hierarchical BMPS with goal-switching to the performance of hierarchical BMPS without goal-switching, the non-hierarchical BMPS method with tree contraction,Footnote 6 and human performance. The three strategy discovery algorithms were all trained in the same environment following the same training steps. Their resource-rationality scores are noted in Table 2.

Fig. 6
figure 6

Environment that demonstrates the utility of goal-switching. High-risk nodes (8, 23, 38 and 53) follow a categorical reward distribution of − 1500 with a probability of 0.1 and 0 with a probability of 0.9. Goal nodes (15, 30, 45 and 60) have a categorical uniform distribution over the possible reward values 0, 25, 75, and 100. The first node in each sub-tree (1, 16, 31 and 46) as well as the root node (0) have a fixed reward of 0. All other intermediate nodes follow a categorical uniform distribution over the possible values of − 10, − 5, 5, and 10

Table 2 Resource-rationality score (R-R Score) and standard deviation (Std) of the automatically discovered hierarchical planning strategy with and without goal-switching, as well as the strategy discovered by the original BMPS algorithm averaged over 5000 random instances of the high-risk environment with four goal states. A human baseline is gathered in an online experiment with a lower number of samples (see “How Does the Performance of the Automatically Discovered Strategies Compare to Human Planning?”)

All resource-rationality scores do not follow a normal distribution as tested with a Shapiro-Wilk test (p < .001 for each). The performance between the individual algorithms was compared with a Wilcoxon rank-Sum rest, adjusting the critical alpha value via Bonferroni correction. Comparing the score of goal-switching to both our method without goal-switching (W = 14.07,p < .001) and the original BMPS algorithm (W = 11.38, p < .001), shows a significant benefit of goal-switching. Comparing the performance of the original BMPS method to the no goal-switching algorithm, the original BMPS version performs significantly better (W = 18.7,p < .001).

To show that enabling our algorithm’s capacity for goal-switching has no negative effect on its performance even when the assumption of the hierarchical decomposition is met, we perform a second comparison on the two-goal environment with increasing variance as in “How Resource-Rational Are the Automatically Discovered Strategies?” Since in this environment the rewards are most variable at the goal nodes, switching goals should usually be unnecessary. Therefore, due to the environment structure, we do not expect the goal-switching strategy to perform better than the purely hierarchical strategy. By comparing the resource-rationality score in this environment we observe that both versions of the algorithm perform similarly well (see Table 3). A Wilcoxon rank-sum test (W = 0.03,p = .98) shows no significant difference between the two. This demonstrates that the addition of goal-switching to the algorithm does not impair the performance, even when goal-switching is not beneficial.

Table 3 Average resource-rationality scores (R-R Score) of the planning strategies discovered with the meta-controller method and the hierarchical problem decomposition and their standard deviation (Std) based on their performance on 5000 random instances of the increasing variance environment with two goal states

How Does the Performance of the Automatically Discovered Strategies Compare to Human Planning?

To be able to compare the performance of the automatically discovered planning strategies to the performance of people, we conduct experiments on Amazon Mechanical Turk (Litman et al., 2017).

Methods

We measured human resource-rationality scores in Flight Planning tasks that are analogous to the environments we use to evaluate our method (see Fig. 7).

Fig. 7
figure 7

Screenshot of the Flight Planning task used to assess people’s planning skills in Experiments 1. Participants can drag and zoom into the environment to show different portions

For the environments that conform to the increasing variance structure (e.g., Fig. 3), we recruited 78 participants for each of the four environments (average age 34.71 years, range: 19–70 years; 46 female). Participants were paid $2.00 plus a performance-dependent bonus (average bonus $1.52). The average duration of the experiment was 25.1 min. For the risky environment (see Fig. 6), we recruited 48 participants (average age 36.98 years, range: 19–70 years; 25 female). Participants were paid $1.75 plus a performance-dependent bonus (average bonus $0.34). The average duration of the experiment was 14.86 min. Following instructions that informed the participants about the range of possible reward values, participants were given the opportunity to familiarize themselves with the task in 5 practice trials of the Flight Planning task. After this, participants were evaluated on 15 test trials of the Flight Planning task for the first type of environments and 5 test trials for the second type of environments. To ensure high data quality, we applied the same pre-determined exclusion criterion throughout all presented experiments. We excluded participants who did not make a single click on more than half of the test trials because not clicking is highly indicative of a participant not engaging and speeding through the task. In the first environment type, we excluded 3 participants and in the second environment type, we excluded 22 participants.

Results

The results of this experiment are shown in the last row of Table 1. Surprisingly, we found that contrary to what has been observed in small sequential decision problems (Callaway et al., 2018b, 2020), human performance is far from resource-rational in large sequential decision problems.

Concretely, in the 15 increasing variance environments, human participants performed much worse than the strategy discovered by our hierarchical method regardless of the number of goals (p < .02 for all pairwise Wilcoxon rank-sum tests). Compared to the human baseline, the hierarchical strategies discovered by our method appear to achieve a higher level of computational efficiency on multiple instances of environments with an increasing variance reward structure. The computational efficiency of a strategy is defined by its resource-rationality score which combines the quality of the solution (the reward earned by moving through the environment) with the computational cost of planning.

In the high-risk environment, the average human resource-rationality score was only − 79.92 (see Table 2) whereas our method achieved a resource-rationality score of 41 on the same environment instances. A Shapiro-Wilk detected no significant violation of the assumption that participants’ average scores are normally distributed (p. = 33). We, therefore, compared the average human performance to our method in a one-sample t-test. We found that human participants performed significantly worse than the strategy discovered by our method (t(25) = − 13.06,p < .001). This suggests that the strategy discovered by our method performs on a superhuman level at achieving the trade off between gathering information and the associated time cost, resulting in a higher overall resource-rationality score.

Improving Human Decision-making by Teaching Automatically Discovered Planning Strategies

Having shown that our method discovers planning strategies that achieve a superhuman resource-rationality score (i.e., the strategy discovered by our method performs fewer clicks than the strategies that people use intuitively), we now evaluate whether we can improve human decision-making by teaching them the automatically discovered strategies. Building on the Mouselab-MDP paradigm introduced in “The Mouselab-MDP Paradigm,” we investigate this question in the context of the Flight Planning task illustrated in Fig. 7. Participants are tasked to plan the route of an airplane across a network of airports. Each flight gains a profit or a loss. Participants can find out how much profit or loss an individual flight would generate by clicking on its destination for a fee of $1. The participant’s goal is to maximize the sum of the flights’ profits minus the cost of planning. Participants can make as few or as many clicks as they like before selecting a route using their keyboard.

To teach people the automatically discovered strategies, we developed cognitive tutors (see “Intelligent Cognitive Tutors”) that show people step-by-step demonstrations of what the optimal strategies for different environments would do to reach a decision (see Fig. 8). In each step, the strategy selects one click based on which information has already been revealed. At some point, the tutor stops clicking and moves the airplane down the best route indicated by the revealed information. Moving forward we will refer to cognitive tutors teaching the hierarchical planning strategies discovered by hierarchical BMPS as hierarchical tutors and refer to the tutors teaching the strategies discovered by non-hierarchical BMPS as non-hierarchical tutors.

Fig. 8
figure 8

Screenshot of the cognitive tutor for demonstrating the non-hierarchical strategy evaluated in Experiment 1. The numbers beside the node indicate the sequence in which the clicks were performed

To evaluate the effectiveness of these demonstration-based cognitive tutors, we conduct two training experiments in which participants are taught the optimal strategies for flight planning problems equivalent to the two types of environments in which we evaluated our strategy discovery method in “Evaluating the Performance, Scalability, and Robustness of Our Method for Discovering Hierarchical Planning Strategies.” To assess the potential benefits of the hierarchical tutors enabled by our new scalable strategy discovery method, these experiments compare the performance of people who were taught by hierarchical tutors against the performance of people who were taught by non-hierarchical tutors, the performance of people who were taught original feedback-based tutor for small environments (Lieder et al., 2019, 2020), and the performance of people who practiced the task on their own. We developed the best version of each tutor possible given the limited scalability of the underlying strategy discovery method. The increased scalability of our new method enabled the hierarchical tutor to demonstrate the optimal strategy for the task participants faced whereas the other tutors could only show demonstrations on smaller versions of the task. We found that showing people a small number of demonstrations of the optimal planning strategy significantly improved their decision-making not only when the assumption underlying our method’s hierarchical problem decomposition is met (Experiment 1) but also when it is violated (Experiment 2).

In “Control Experiment,” we perform an additional experiment where we show that teaching our method through demonstrations leads to improved decision-making and closer similarity with the optimal strategy than teaching people with random demonstrations that reveal similar information about the environment’s structure to the participants.

Training Experiment 1: Teaching People the Optimal Strategy for an Environment with Increasing Variance

In Experiment 1, participants were taught the optimal planning strategy for an environment in which 10 final destinations can be reached through 9 different paths comprising between 4 and 6 steps each (see Fig. 7). The most important property of this environment is that the variance of available rewards doubles from each step to the next, starting from 5 in the first step. Therefore, in this environment, the optimal planning strategy is to first inspect the values of alternative goals, then commit to the best goal one could find, and then plan how to achieve it without ever reconsidering other goals.

Methods

We recruited 168 participants on Amazon Mechanical Turk (average age 34.9 years, range: 19–70 years; 98 female) (Litman et al., 2017). Participants were paid $2.50 plus a performance-dependent bonus (average bonus $2.86). The average duration of the experiment was 46.9 min. Participants could earn a performance-dependent bonus of 1 cent for every 10 points they won in the test trials.

All participants had to first agree to a consent form stating they were above 18, a US citizen residing in the USA, and fluent in English. After this, instructions about the range of possible rewards (− 250 to 250), cost of clicking ($1), and the movement keys were presented. Then, participants went through 5 trials to familiarize themselves with the experiment. Following this, participants were either given additional 10 practice trials or had 10 trials with the cognitive tutor depending on their experimental condition. Finally, the participant was given 15 test trials in the flight planning task with 10 possible final destinations illustrated in Fig. 7. Participants started with 50 points at the beginning of the test block.

To evaluate the efficacy of cognitive tutors, participants were assigned to 4 groups. In the experimental group, participants were taught by the hierarchical tutor. The first control group was taught by the non-hierarchical tutor. The second control group was taught by the feedback-based cognitive tutor (Lieder et al., 2019, 2020) illustrated in Fig. 9. The third control group practiced the Flight Planning task 10 times without feedback. The hierarchical tutor taught the strategy discovered by the hierarchical BMPS algorithm discovered for the task participants had to perform in the test block. It first demonstrated 3 trials with the goal selection strategy; it then showed three demonstrations of the goal planning strategy and finally presented 4 demonstrations of the complete strategy combining both parts. The non-hierarchical tutor showed 10 demonstrations of the strategy that non-hierarchical BMPS discovered for the largest version of the Flight Planning task it could handle (i.e., 2 goals instead of 10 goals). Computational bottlenecks confined the feedback-based tutor to a three-step planning task with six possible final destinations shown in Fig. 9 (Lieder et al., 2019, 2020). Participants received feedback on each of their clicks and their decision when to stop clicking as illustrated in Fig. 9. When the participant chose a suboptimal planning operation they were shown a message stating which planning operation the optimal strategy would have performed instead. In addition, they received a time-out penalty whose duration was proportional to how suboptimal their planning operation had been.

Fig. 9
figure 9

Screenshot of the feedback-based tutor against which we evaluated our scalable demonstration-based tutor. This tutor gives people feedback on the planning operations they perform in a smaller version of the environment. This small environment is the largest one for which optimal feedback can be computed with the currently available methods

Counterbalanced assignment ensured that participants were equally distributed across four experimental conditions (i.e., 42 participants per condition). To ensure high data quality, we applied a pre-determined exclusion criterion. We excluded 7 participants who did not make a single click on more than half of the test trials because not clicking is highly indicative of speeding through the experiment without engaging with the task.

On each trial, the participant’s expected resource-rationality score was calculated as the expected reward of the path they chose minus the cost of the clicks they had made.Footnote 7 We analyzed the data using the robust f1.ld.f1 function of the nparLD R package (Noguchi et al., 2012). The function performs a non-parametric ANOVA-type statistic with Box approximation (Box & et al., 1954) to determine the main effect and additional ANOVA-type statistics (ATS) with the denominator degrees of freedom set to infinity for post hoc comparisons (Noguchi et al., 2012).

Results

Table 4 shows the average resource-rationality scores of the four groups on the test trials. According to a Shapiro-Wilk test, participants’ scores on the test trials were not normally distributed in any of the four groups (all p < .001). Therefore, we tested our hypothesis using a non-parametric ANOVA-type statistic with Box approximation (Box & et al., 1954) to test if there are any significant differences between the groups in our repeated-measures design. We found that people’s performance differed significantly across the four experimental conditions (F(2.85,144.17) = 3.92, p = .0113). Pairwise post hoc ATS comparisons confirmed that teaching people strategies discovered by the hierarchical method significantly improved their performance (204.48 points/trial) compared to the no-feedback control condition (177.36 points/trial, F(1) = 7.57, p = .006), the feedback-based cognitive tutor (167.02 points/trial, F(1) = 11.89, p < .001), and the non-hierarchical demonstration (181.58 points/trial, F(1) = 5.02, p = .025). By contrast, neither the feedback-based cognitive tutor (F(1) = 0.57, p = .45) nor the non-hierarchical demonstration (F(1) = 0.19, p = .67) were more effective than letting people practice the task on their own.

Table 4 Experimental results for Training Experiment 1. For each condition we report the number of participants, the mean and standard deviation of the resource-rationality score (R-R Score), and the mean and standard deviation of the click agreement measure

We investigated which percentage of participants followed the general strategy of planning which goal to pursue and then planning a path to the chosen goal. We measured this by checking if the participant first clicked any number of goal nodes and then clicked one of the nodes leading to the best discovered goal. A participant is considered to have learned this aspect of the strategy, if the described behavior is shown in over half of the five test trials. The results in Table 5 showed that the participants from the non-hierarchical demonstration (95%) and hierarchical demonstration (97.56%) conditions learned goal planning to a high degree. Participants of the no-feedback (78.05%) and feedback-based tutor (76.92%) conditions still performed well in mastering this general aspect of the optimal strategy. We compared the proportion of participants who learned this behavior and received the hierarchical demonstration to other conditions using a z-test and corrected the p-values for multiple comparisons using the Benjamini-Hochberg method (Benjamini & Hochberg, 1995). This showed a significant difference between participants of the hierarchical demonstration condition versus the no-feedback condition (z = 2.7, p = .01) and the feedback-based tutor condition (z = 2.79, p = .01). We observed no significant difference between participants of the hierarchical demonstration condition and non-hierarchical demonstration condition (z = 0.61, p = .542). To test how well participants learn to follow the optimal strategy, we further calculated to what extent each participant’s planning operations match the near-optimal strategy discovered by our method. These additional agreement measures were calculated per planning operation the participant performed. For each planning operation, we reconstructed the current belief state and then used our method to calculate the set of optimal actions and compare them to the action taken by the participant. We found that participants in the hierarchical demonstration condition match the automatically discovered strategy better (across all test trials performed by participants of this condition, participants chose one of the optimal planning operations in 35.84% of their actions with a standard deviation of ± 15%) than participants in the no-feedback control condition (22.08% ± 12%), the feedback condition (27.44% ± 13%), and the non-hierarchical demonstrations condition (28.51% ± 13%). The ANOVA-type statistic with Box approximation (Box & et al., 1954) showed a significant effect of the condition on click agreement (F(3.00,156.70) = 21.73, p < .001). Post hoc ATS comparisons showed significant differences between hierarchical demonstrations and the control conditions: the no-feedback condition (F(1) = 64.00, p < .001), the feedback-based cognitive tutor (F(1) = 20.90, p < .001), and the non-hierarchical demonstration (F(1) = 16.59, p < .001). Additional significant effects are present between the no-feedback condition and the feedback-based cognitive tutor (F(1) = 11.44, p < .001) as well as the non-hierarchical demonstration (F(1) = 16.36, p < .001). No significant difference was found between the participants who were taught by the feedback-based cognitive and the non-hierarchical demonstration (F(1) = 0.35, p = .56), respectively. We further analyzed which aspects of the optimal strategy participants learned. Results of the extended analysis consisting of 6 additional measures can be found in Table 5. In this analysis, we split the click agreement measure into three sub-categories: goal agreement, path agreement, and termination agreement. Goal agreement and path agreement measure, how well participants’ clicks match the optimal strategy when planning which goal to pursue (goal agreement) and how to achieve the selected goal (path agreement), respectively. The higher their agreement with the optimal strategy was the better they performed. We further split the termination measure into overall termination agreement, the goal termination agreement, and the path termination agreement. The overall termination agreement measured, how well participants stop planning when it is no longer beneficial. Goal and path termination agreement measured, how well participants switch between planning which goal to pursue and how to achieve the selected goal. When calculating path agreement and path termination agreement we allowed for mistakes in goal selection (i.e., planning an optimal path to a suboptimally chosen goal will still result in a high path agreement score). The termination agreement measures were calculated using the balanced accuracy measure to equally account for both not terminating too early and not terminating too late. Participants of all conditions were able to match the optimal strategy for selecting goals roughly equally well with agreement scores ranging from 32% to 38%. Participants trained by the feedback-based tutor performed slightly better at selecting goals matching the optimal strategy with 38.08% than participants of the hierarchical demonstration condition (34.86%). These differences were not statistically significant (F(2.94,151.95) = 2.6, p = .056). The agreement measure for planning how to achieve the selected goal showed significant differences (F(2.95,152.54) = 28.45, p < .001). Participants trained by the hierarchical demonstration achieved a higher agreement score (44.45%) than participants of other conditions: no-feedback (17.82%; F(1) = 74.57, p < .001), feedback-based tutor (23.00%; F(1) = 42.39, p < .001), and non-hierarchical demonstration (30.37%; F(1) = 15.88, p < .001). Overall, path and goal agreement were relatively low across conditions with under 50% of participants’ clicks matching the resource-rational strategy demonstrated by our intelligent tutor. We identified the complicated reward structure of the goal nodes as a potential cause for the low overall agreement. The increase in the variance of goal rewards from 100 to 180 (see “Evaluation in Environments That Conform to the Assumed Structure”) resulted in an optimal strategy that first investigates goals 5 and 10, while investigating other goal nodes first was slightly suboptimal. This detail of the optimal strategy was not well understood by participants: while participants learned to click a goal node as the first planning action in the vast majority of trials (93.66% for the no-feedback condition, > 99% for all other conditions), participants learned to click either goal 5 or goal 10 as the first action in only slightly above half of the trials (50.73% for participants of the no-feedback condition, 56.58% for participants of the feedback-based tutor condition, 57.67% for participants of the non-hierarchical demonstration condition, and 62.11% for participants of the hierarchical demonstration condition). Lastly, we compared the termination agreement for overall termination, goal termination, and path planning termination. Across all termination agreement measures and conditions participants learned when to terminate quite well (> 60% agreement). Participants trained by the hierarchical demonstration matched the optimal strategy slightly more closely than participants of the other conditions for all three measures. For termination agreement, conditions differed significantly (F(2.93,151.19) = 3.29, p = .023): the post hoc ATS showed that participants of the hierarchical demonstration condition achieved a significantly higher agreement than participants of the non-hierarchical demonstration condition (F(1) = 11.22, p < .001), but not the no-feedback condition or the feedback-based tutor condition (p > .1 for both). We did not detect a significant difference between the conditions for either goal termination agreement (F(2.90,148.66) = 2.14, p = .099) or path termination agreement (F(2.84,144.14) = 2.62, p = .057). Overall, these results showed that the hierarchical method is better able to discover and teach resource-rational strategies for environments in which previous methods failed.

Table 5 Extended strategy analysis for Training Experiment 1. For each condition we report the percentage of participants following a goal planning strategy and the mean and standard deviation of the goal, path, and termination agreement measures

Training Experiment 2: Teaching People the Optimal Strategy for a Risky Environment

In Experiment 2, participants were taught the strategy our method discovered for the 8-step decision problem illustrated in Fig. 6. Critically, in this environment, each path contains one risky node that harbors an extreme loss with a probability of 10%. Therefore, the optimal strategy for this environment inspects the risky node while planning how to achieve the selected goal and then switches to another goal when it encounters a large negative reward on the path to the initially selected goal.

Method

To test whether our approach can also improve people’s performance in environments with this more complex structure, we created two demonstration-based cognitive tutors that teach the strategies discovered by hierarchical BMPS with goal-switching and hierarchical BMPS without goal-switching, respectively, and a feedback-based tutor that teaches the optimal strategy for a 3-step version of the risky environment.

This experiment used a Flight Planning Task that is analogous to the environment described in “Evaluation in Risky Environments” (see Fig. 6). Specifically, the environment comprises 4 goal nodes and 60 intermediate nodes (i.e., 15 per goal). Although each goal can be reached through multiple paths all of those paths lead through an unavoidable node that has a 10% risk of harboring a large loss of − 1500. The aim of this experiment is to verify that we are still able to improve human planning even when the environment requires a more complex strategy that occasionally switches goals during planning. To test this hypothesis we showed the participants demonstrations of the strategy discovered by our method in the experimental condition, and compared their performance to the resource-rationality scores of three control groups. The first control group was shown demonstrations of the strategy discovered by the version of our method without goal-switching; the second control group discovered their own strategy in five training trials; the third control group practiced planning on a three-step task with a feedback-based tutor (see “Intelligent Cognitive Tutors”) (Lieder et al., 2019, 2020). The environment used by the feedback-based tutor mimicked the high-risk environment. To achieve this we changed the reward distribution of the intermediate nodes so that there is a 10% chance of a negative reward of − 96, a 30% chance of − 4, a 30% chance of + 4, and a 30% chance of + 8. We then recomputed the optimal feedback using dynamic programming (Lieder et al., 2019, 2020).

We recruited 201 participants (average age 34.01 years, range:19–70 years; 101 female) on Amazon’s Mechanical Turk (Litman et al., 2017) over three consecutive days. All but two of them completed the assignment. Applying the same pre-determined exclusion criterion as we used in Experiment 1 (i.e., excluding participants who do not engage with the environment in more than half of the test trials) led to the exclusion of 30 participants (15%), leaving us with 169 participants. Participants were paid 1.30$ and a performance-dependent bonus of up to 1$. The average bonus was 0.56$ and the average time of the experiment was 16.28 min. Participants were randomly assigned to one of four experimental conditions determining their training in the planning task. All groups were tested in five identical test trials. The data was analyzed using the robust f1.ld.f1 function of the nparLD R package (Noguchi et al., 2012). As in Training Experiment 1, the participant’s resource-rationality score for each trial was calculated as the expected reward of the path they chose minus the cost of the clicks they had made.Footnote 8

Results

The results of the experiment are summarized in Table 6. Since the Shapiro-Wilk test showed that none of the four conditions is normally distributed (p < .001 for all), we again used the non-parametric ANOVA-type statistic with Box approximation (Box & et al., 1954) to evaluate the data. The test showed significant differences between the four groups (F(2.79,150.28) = 20.19, p < .001). Pairwise robust ATS post hoc comparisons showed that participants trained with demonstrations of the strategy discovered by hierarchical BMPS with goal-switching significantly outperform all other conditions: participants trained by purely hierarchical demonstrations without goal-switching (F(1) = 58.72, p < .001), participants who did not receive demonstrations (F(1) = 31.13, p < .001), and participants who had practiced planning with optimal feedback on a smaller analogous environment (F(1) = 32.07, p < .001). The performance of the three control groups was statistically indistinguishable (all p ≥ 0.18). Participants trained by purely hierarchical demonstrations did not perform significantly better than participants that trained with optimal feedback (F(1) = 1.82, p = .18) or participants that did not receive demonstrations (F(1) = 0.34, p = .56). Additionally, there was no significant difference between the optimal feedback condition and the no-feedback condition (F(1) = 0.28, p = .6).

Table 6 Experimental results for Training Experiment 2 using the high-risk environment shown in Fig. 6. For each condition we report the number of participants, the mean and standard deviation of the resource-rationality score (R-R score), and the mean and standard deviation of the click agreement measure

The participants from the goal-switching demonstration condition learned goal planning to a high degree (i.e., 60%). We again used a z-test and applied the Benjamini-Hochberg method (Benjamini & Hochberg, 1995) to compare the goal planning performance between conditions. This showed that participants of the goal-switching demonstration condition learned goal planning better than the other conditions: no-feedback (35.71%; z = 2.265, p = .023), feedback-based tutor (20.93%; z = 3.725, p < .001), and hierarchical demonstration (25.64%; z = 3.164, p = .002). We repeated our statistical analysis for the click agreement measure, finding a significant effect of the experimental condition (F(2.58,132.19) = 47.45, p < .001). Comparing participants’ strategies to the strategy discovered by our method using the post hoc ATS showed that the planning operations of participants trained with our goal-switching strategy matched those of the optimal strategy more often (59.94% ± 29%) than those of participants in the no-feedback condition (24.33% ± 24%; F(1) = 56.62, p < .001), the feedback tutor condition (17.22% ± 14%; F(1) = 164.80, p < .001), and the hierarchical strategy without goal-switching (42.98% ± 23%; F(1) = 10.82, p = .001). We found additional significant differences between the no goal-switching hierarchical demonstration condition and the other two control conditions: optimal feedback (F(1) = 82.61, p < .001) and no-feedback (F(1) = 22.80, p < .001). Comparing the optimal feedback and no-feedback condition revealed no significant difference (F(1) = 3.01, p = .082). We again analyzed which aspects of the optimal strategy participants learned using the additional click agreement measures reported in “Results.” The results for the additional measures can be found in Table 7. We found significant differences in how well participants planned which goal to pursue (F(2.41,120.22) = 32.33, p < .001). Participants in the goal-switching demonstration condition (78.21%) and the hierarchical demonstration condition (77.73%) performed best at selecting goals according to the optimal strategy and did not differ significantly (F(1) = 0.05, p = .83). Participants in the goal-switching demonstration condition had a significantly higher goal agreement than participants in the no-feedback condition (57.76%; F(1) = 11.36, p < .001), and participants of the feedback tutor condition (36.01%; F(1) = 110.5, p < .001). The path planning agreement also showed significant differences for the different conditions (F(2.45,117.43) = 19.12, p < .001). Participants trained by the goal-switching demonstration achieved a agreement score of 72.71%, whereas the other conditions achieved a significantly lower agreement: no-feedback (24.47%; F(1) = 42.53, p < .001), hierarchical demonstration (21.50%; F(1) = 36.96, p < .001), and feedback tutor (30.12%; F(1) = 37.89, p < .001). Similarly, the participants trained by the goal-switching demonstration achieved higher termination agreement scores, matching the optimal strategy’s termination rule more closely than participants of the no-feedback, feedback-based tutor, and hierarchical demonstration conditions across all three termination agreement measures. For all three termination measures, the differences between conditions are statistically significant: termination agreement (F(2.96,161.39) = 10.69, p < .001), goal termination agreement (F(2.77,145.41) = 17.62, p < .001), and path termination agreement (F(2.86,152.32) = 10.54, p < .001). When deciding when to stop planning (termination agreement), participants who received goal-switching demonstrations matched the optimal strategy to a higher percentage (76.35%) than participants of the no-feedback condition (69.19%; F(1) = 5.65, p = .017), the hierarchical demonstration condition (68.35%; F(1) = 7.41, p = .006), and the feedback tutor condition (63.05%; F(1) = 34.31, p < .001). When deciding when to stop planning which goal to pursue (goal termination agreement), participants of the goal-switching demonstration condition (84%) performed better than participants who received no-feedback (71.75%; F(1) = 8.62, p = .003) and participants who were in the feedback-tutor condition (57.31%; F(1) = 61.80, p < .001), but there was no significant difference to participants who received hierarchical demonstrations (78.19%; F(1) = 2.65, p = .103). Lastly, participants who received the goal-switching demonstration (79.66%) also performed significantly better than all other conditions when deciding when to stop planning how to achieve the pursued goal (path term agreement): no-feedback (61.85%; F(1) = 12.26, p < .0.001), hierarchical demonstration (54.91%; F(1) = 21.35, p < .001), and feedback-tutor (58.21%; F(1) = 26.83, p < .001).

Table 7 Extended strategy analysis for Training Experiment 2. For each condition we report the percentage of participants following a goal planning strategy and the mean and standard deviation of the goal, path, and termination agreement measures

The results of this experiment showed that we can significantly improve human decision-making by showing them demonstrations of the automatically discovered hierarchical planning strategy with goal-switching. This is a unique advantage of our new method because none of the other approaches was able to improve people’s decision-making in this large and risky environment. By comparing human performance to the optimal performance of our algorithm in the same environment (see Table 2) we can see that even though we were able to improve human performance, participants still did not fully understand the strategy based on the demonstrations alone. This reveals the limitations of teaching planning strategies purely with demonstrations, especially for more complex strategies. Improving upon the pedagogy of our purely demonstration-based hierarchical tutor is an important direction for future work. The different click agreement measures further showed that while participants who received demonstrations of our method did not entirely mastering the optimal strategy, they were able to find the optimal action in a significantly higher amount of trials.

Control Experiment

To investigate a potential confound we ran an additional experiment using the 8-step decision problem shown in Fig. 6 (i.e., the environment used in Experiment 2). The goal of this experiment was to investigate whether participants truly benefit from demonstrations of the near-optimal strategy or if they merely learn about the environment’s structure and then use this knowledge to inform their own strategy.

Methods

The hypothesis we want to verify in this experiment is that participants trained by our tutor would learn to follow the optimal strategy more closely than participants who learn the same information about the environment structure (e.g., potential rewards of different nodes) but do not see any demonstrations of the strategy discovered by our method. To test this hypothesis, we compared the performance of participants who were shown demonstrations of our method (analogous to the goal-switching demonstration condition in Experiment 2) to a new control condition. Participants in the control condition were shown random demonstrations where the demonstrator clicks random nodes and then moves along a random path through the environment. The actions of the random demonstrations were matched to the types of nodes clicked in the demonstration of our method (e.g., if the demonstration of our method revealed two goal nodes and two risky nodes, the random demonstration in that environment would also click two random goal nodes and two random risky nodes). Additionally, the random demonstrations also contained 4-6 randomly selected additional clicks that were not included in our method’s demonstration, arguably teaching participants more information about the structure of the environment than the demonstration of our method.

We recruited 100 participants (average age 25.06 years, range: 18–48, 53 female) on Prolific (http://www.prolific.co) [2021]. Applying the exclusion criterion used in Experiment 1 (participants who didn’t engage with the task) led to the exclusion of 20 participants (20%). Two additional participants were removed because they indicated they did not try their best to achieve a high score. Participants were paid 2.00£ plus a performance-dependent bonus of up to 1.00£. The average bonus was 0.50£ and the average completion time 18.9 min. Participants were randomly assigned one of the two conditions. The data was analyzed analogous to Experiment 1 and Experiment 2 using the nparLD R package (Noguchi et al., 2012).

Results

The results for this experiment can be found in Table 8 and Table 9. We used the non-parametric ANOVA-type statistic with Box approximation (Box & et al., 1954) to analyze the data. Our analysis showed a significant difference between the resource-rationality scores of the two conditions (F(1,74.86) = 5.53,p = .021). Participants of the goal-switching demonstration condition achieved a higher resource-rationality score than participants who observed random demonstrations. With 59.52%, a significantly higher percentage of participants of the goal-switching demonstration condition learned to follow the general goal planning strategy of first selecting a goal and then planning how to achieve the goal than participants of the control condition (22.22%, z = 3.324, p < .001). We again compared how well the participants matched the optimal strategy (click agreement) and observed a higher match for the goal-switching demonstration condition (48.71% ± 28%) than the random demonstration condition (25.28% ± 23%). Analogous to the analysis of the previous Experiments, we used the ANOVA-type statistic to confirm the statistical significance of this difference (F(1,74.92) = 29.43,p < .001). Comparing the detailed click agreement measures reported in Table 9, participants who received the goal-switching demonstration of the optimal strategy achieved a significantly higher match with the optimal strategy across the following measures: goal agreement (F(1,64.17) = 8.09, p = .006), path agreement (F(1,75.59) = 5.38, p = .023), and termination agreement (F(1,72.97) = 4.73, p = .033). We did not detect a significant difference between the two conditions for the goal termination agreement (F(1,71.91) = 1.77, p = .188), and path termination agreement (F(1,74.26) = 0.39, p = .53).

Table 8 Experimental results for the control experiment using the high-risk environment shown in Fig. 6. For each condition we report the number of participants, the mean and standard deviation of the resource-rationality score (R-R score), and the mean and standard deviation of the click agreement measure
Table 9 Extended strategy analysis for the control experiment. For each condition we report the percentage of participants following a goal planning strategy and the mean and standard deviation of the goal, path, and termination agreement measures

The results of this experiment showed that teaching participants the discovered strategy leads to significantly better resource-rationality scores than purely teaching them about the reward structure of the environment. While participants of the random demonstration condition received even more information about the environment’s rewards, they were unable to achieve the same level of resource-rationality. The click agreement measures further showed that the participants of the goal-switching demonstration condition learned the optimal strategy to a higher extent. We observed that participants in the goal-switching demonstration condition of this experiment performed worse than participants of the same condition in Training Experiment 2 (see Table 6). Apart from some randomness due to sampling from a different online crowd-sourcing platform we believe this has mostly been caused by a small number of outliers. Comparing the median expected scores shows a much smaller discrepancy between the experiments: − 57.5 (interquartile range: 145) for this experiment and − 20 (interquartile range: 140) for Training Experiment 2. Importantly, in both Experiments we observe a statistically significant improvement in expected resource-rationality score and click agreement.

General Discussion

To make good decisions in complex situations people and machines have to use efficient planning strategies because planning is costly. Efficient planning strategies can be discovered automatically. But computational challenges confined previous strategy discovery methods to tiny problems. To overcome this problem, we devised a more scalable machine learning approach to automatic strategy discovery. The central idea of our method is to decompose the strategy discovery problem into discovering goal-setting strategies and discovering goal-achievement strategies. In addition, we made a substantial algorithmic improvement to the state-of-the-art method for automatic strategy discovery (Callaway et al., 2018a) by introducing the tree contraction method. We found that this hierarchical decomposition of the planning problem, together with our tree contraction method, drastically reduces the time complexity of automatic strategy discovery without compromising on the quality of the discovered strategies in many cases. Furthermore, by introducing the tree contraction method we have extended the set of environment structures that automatic strategy discovery can be applied to from trees to directed acyclic graphs. These advances significantly extend the range of strategy discovery problems that can be solved by making the algorithm faster, more scalable, and applicable to environments with more complex structures. This is an important step towards discovering efficient planning strategies for real-world problems.

Recent findings suggest that teaching people automatically discovered efficient planning strategies is a promising way to improve their decisions (Lieder et al., 2019, 2020). Due to computational limitations, this approach was previously confined to sequential decision problems with at most three steps (Lieder et al., 2019, 2020). The strategy discovery methods developed in this article make it possible to scale up this approach to larger and more realistic planning tasks. As a proof-of-concept, we showed that our method makes it possible to improve people’s decisions in planning tasks with up to 7 steps and up to 10 final goals. We evaluate the effectiveness of showing people demonstrations of the strategies discovered by our method in two separate experiments where the environments were so large that previous methods were unable to discover planning strategies within a time budget of 8 h. Thus, the best one could do at training people with previous methods was to construct cognitive tutors that taught people the optimal strategy for a smaller environment with a similar optimal strategy or having people practice without feedback. Evaluating our method against these alternative approaches we found that our approach was the only one that was significantly more beneficial than having people practice the task on their own. To the best of our knowledge, this makes our algorithm the only strategy discovery method that can improve human performance on sequential decision problems of this size. This suggests that our approach makes it possible to leverage reinforcement learning to improve human decision-making in problems that were out of reach for previous intelligent tutors.

Our empirical findings have several important implications for the debate about human rationality. First, we found that across many instances of two different classes of large sequential decision-making problems the resource-rationality score of human planning was substantially lower than that of the approximately resource-rational planning strategies discovered by our model. Second, we found that teaching people the automatically discovered planning strategies significantly improved their resource-rationality score in both environments. These findings suggest that, unlike in small problems (Callaway et al., 2018b, 2020), the planning strategies people use in large sequential decision problems might be far from resource-rational. This interpretation should be taken with a grain of salt since some of the discrepancies could be due to unaccounted cognitive costs. But people choosing their planning operations suboptimally likely plays an important role as well. Our findings add three important nuances to the resource-rational perspective on human rationality (Lieder & Griffiths 2020a, b). First, it suggests that human cognition might systematically deviate from the principles of resource-rationality. Second, it suggests that the magnitude of these deviations depends on the complexity of the task. Third, it suggests that those suboptimalities can be ameliorated by teaching people resource-rational strategies.

Our method’s hierarchical decomposition of the planning problem exploits that people can typically identify potential mid- or long-term goals that might be much more valuable than any of the rewards they could attain in the short run. This corresponds to the assumption that the rewards available in more distant states are more variable than the rewards available in more proximal states. When this assumption is satisfied, our method discovers planning strategies much more rapidly than previous methods and the discovered strategies are as good as or better than those discovered with the best previous methods. When this assumption is violated, the goal-switching mechanism of our method can compensate for that mismatch. This allows the discovered strategies to perform almost as well as the strategies discovered by BMPS. Our method relies on this mechanism more the more strongly its assumption is violated. In doing so, it automatically trades off its computational speed-up against the quality of the resulting strategy. This shows that our method is robust to violations of its assumptions about the structure of the environment; it exploits simplifying structure only when it exists.

Some aspects of our work share similarities with recent work on goal-conditioned planning (Nasiriany et al., 2019; Pertsch et al., 2020), although the problem we solved is conceptually different. For comparison, both aforementioned methods optimize the route to a given final location, whereas our method learns a strategy for solving sequential decision problems where the strategy chooses the final state itself. Furthermore, while Nasiriany et al. (2019) specified a fixed strategy for selecting the sequence of goals, our method learns such a strategy itself. Critically, while policies learned by Nasiriany et al. (2019) select physical actions (e.g., move left), the metalevel policies learned by our method select planning operations (i.e., simulate the outcome of taking action a in state s and update the plan accordingly). Finally, our method explicitly considers the cost of planning to find algorithms that achieve the optimal trade-off between the cost of planning and the quality of the resulting decisions.

Our method’s scalability has its price. Since our approach decomposes the full sequential decision problem into two sub-problems (goal selection and goal planning), its accuracy can be limited by the fact that it never considers the whole problem space at once. This is unproblematic when the environment’s structure matches our method’s assumption that the rewards of potential goals are more variable than more proximal rewards. But it could be problematic when this assumption is violated too strongly. We mitigated this potential problem by allowing the strategy discovery algorithm to switch goals. Even with this adaptation, the discovered strategy is not optimal in all cases: Since the representation of the alternative goal reward is defined as its average expected reward, the algorithm will only switch goals if the current goal’s reward is below average. However, if the current goal’s expected return is above average, the discovered strategy will not explore other goals even when that would lead to a higher reward. On balance, we think that the scalability of our method to large environments outweighs this minor loss in resource-rationality score.

The advances presented in this article open up many exciting avenues for future work. For instance, our approach could be extended to plans with potentially many levels of hierarchically nested subgoals. Future work might also extend our method so that any state can be selected as a goal.

In its current form, our algorithm always selects only the environment’s most distant states (leaf nodes) as candidate goals. Future versions might allow the set of candidate goals to be chosen more flexibly such that some leaf nodes can be ignored and some especially important intermediate nodes in the tree can be considered as potential sub-goals. A more flexible definition and potentially a dynamic selection of goal nodes could increase the strategy discovery algorithm’s performance and possibly allow us to solve a wider range of more complex problems. This would mitigate the limitations of the increasing variance assumption by considering all potentially valuable states as (sub)goals regardless of where they are located. Another important direction for future work is to define more realistic decision problems within the meta-MDP framework. Since we show in this work that our method is able to improve human planning in a number of abstract example environments, a logical next step is to apply our method to a wider range of more realistic scenarios and validate that people are still able to learn and benefit from the discovered strategies in these environments.

The advances reported in this article have potential applications in artificial intelligence, cognitive science, and human-computer interaction. First, since the hierarchical structure exploited by our method exists in many real-world problems, it may be worthwhile to apply our approach to discovering planning algorithms for other real-world applications of artificial intelligence where information is costly. This could be a promising step towards AI systems with a (super)human level of computational efficiency. Second, our method also enables cognitive scientists to scale up the resource-rational analysis methodology for understanding the cognitive mechanisms of decision-making (Lieder and Griffiths, 2020b) to increasingly more naturalistic models of the decision problems people face in real life. Third, future work will apply the methods developed in this article to train and support people in making real-world decisions they frequently face. Our approach is especially relevant when acquiring information that might improve a decision is costly or time-consuming. This is the case in many real-world decisions. For instance, when a medical doctor plans how to treat a patient’s symptoms acquiring an additional piece of information might mean ordering an MRI scan that costs $1000. Similarly, a holiday planning app would have to be mindful of the user’s time when deciding which series of places and activities the user should evaluate to efficiently plan their road trip or vacation. Similar trade-offs exist in project planning, financial planning, and time management. Furthermore, our approach can also be applied to support the information collection process of hiring decisions, purchasing decisions, and investment decisions. Such complicated tasks require a hierarchical approach in discovering the optimal strategy. Our approach could be used to help train people to be better decision-makers with intelligent tutors (Lieder et al., 2019, 2020). Alternatively, the strategies could be conveyed by decision support systems that guide people through real-life decisions by asking a series of questions. In this case, each question the system asks would correspond to an adaptively chosen information-gathering operation. For example, more powerful versions of our approach could be used to help people plan what they want to do to achieve a high quality of life until the age of 80. Such problems require multi-step planning and are affected by one’s decisions early on. Even if people do not engage in multi-step decision-making, the problem persists and our approach would be able to improve human performance where the outcome depends on a long series of actions. In summary, the reinforcement learning method developed in this article is an important step towards intelligent systems with a (super)human-level computational efficiency, understanding how people make decisions, and leveraging artificial intelligence to improve human decision-making in the real world. At a high level, our findings support the conclusion that incorporating cognitively informed hierarchical structure into reinforcement learning methods can make them more useful for real-world applications.