Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (2024)

11institutetext: Shanghai Key Laboratory of Trustworthy Computing, ECNU, Shanghai, China11email: dhdu@sei.ecnu.edu.cn

Jihui Nie11  Dehui Du(πŸ–‚)11  Jiangnan Zhao11

Abstract

Intelligent Cyber-Physical Systems (ICPS) represent a specialized form of Cyber-Physical System (CPS) that incorporates intelligent components, notably Convolutional Neural Networks (CNNs) and Deep Reinforcement Learning (DRL), to undertake multifaceted tasks encompassing perception, decision-making, and control. The utilization of DRL for decision-making facilitates dynamic interaction with the environment, generating control actions aimed at maximizing cumulative rewards. Nevertheless, the inherent uncertainty of the operational environment and the intricate nature of ICPS necessitate exploration within complex and dynamic state spaces during the learning phase. DRL confronts challenges in terms of efficiency, generalization capabilities, and data scarcity during decision-making process. In response to these challenges, we propose an innovative abstract modeling approach grounded in spatial-temporal value semantics, capturing the evolution in the distribution of semantic value across time and space. A semantics-based abstraction is introduced to construct an abstract Markov Decision Process (MDP) for the DRL learning process. Furthermore, optimization techniques for abstraction are delineated, aiming to refine the abstract model and mitigate semantic gaps between abstract and concrete states. The efficacy of the abstract modeling is assessed through the evaluation and analysis of the abstract MDP model using PRISM. A series of experiments are conducted, involving diverse scenarios such as lane-keeping, adaptive cruise control, and intersection crossroad assistance, to demonstrate the effectiveness of our abstracting approach.

Keywords:

Markov Decision Process Spatio-temporal Value Semantics Deep Reinforcement Learning Abstract Modeling PRISM.

1 Introduction

The Cyber-Physical System (CPS) integrates computing, networking, and physical environments, expertly coordinated by computer and communication components with a joint monitoring mechanism[42]. The evolution of the Intelligent Cyber-Physical System (ICPS) as a mainstream paradigm is marked by the integration of AI-enabled components such as controllers or sensors[35]. Notably, the utilization of Deep Reinforcement Learning (DRL) in decision-making, as emphasized by Brunke et al.[8], is particularly promising due to its innate ability for dynamic interaction within the environment.

However, the efficacy of DRL encounters a formidable challenge in the form of an expansive state space, leading to prolonged algorithmic convergence and heightened complexities in the formal verification of the learning process. Furthermore, the inherent black-box nature of DRL poses challenges when applied to safety-critical ICPS scenarios, such as Autonomous Driving Systems (ADSs) and robots. To surmount the challenge posed by the vast state space in ICPS tasks, strategic compression becomes imperative. Dense DRL is apromising field to address these issues[14].An effective abstraction modeling in this context involves leveraging prior knowledge for generalization from concrete to abstract states[25].

The existing abstraction modeling methods can be roughly divided into three categories. In the first category, the focus lies on abstracting similar states, thereby addressing the challenge of sparse rewards in DRL[13, 29, 26]. Strategic games, such as Go[37], utilize hierarchical organization based on the significance of empty points on the board’s corners and edges. This approach, inspired by consistent patterns in Go, allows the amalgamation of game elements into a unified abstract state, offering a potential solution for the slow convergence issue. Real-time strategy games[28] similarly represent states through collections of game elements and positions. The second category involves temporal abstraction, extending decision-making over time[21, 5, 17]. This concept extends to natural language processing, where units of state are often identified as words or phrases through corpus analysis[7, 33]. This offers a promising avenue for effectively compressing the state space, thereby facilitating improved algorithmic convergence in DRL. In the realm of ICPS, the third category focuses on state-action abstraction, a method extensively applied in addressing the intricate challenges unique to these systems. Notably, empirical evidence establishes a polynomial relationship between the number of ICPS samples and the state space’s size[1], emphasizing the nuanced nature of the challenge. For instance, the operation of ADS unfolds within open and dynamic environments characterized by inherent complexity and unpredictability. This complexity manifests in two key aspects. Firstly, the expansive state space explored by DRL incurs substantial exploration costs due to its high dimensions, contributing to discernible scalability limitations, as extensively discussed in existing literature[41]. Secondly, the state-action abstraction approach, while effective in enhancing abstract efficiency and addressing gaps in the abstract model, tends to overlook the consistency of the constructed model in the formal method, leading to potential distortions in the model’s representation.

In the delineated classification, it becomes evident that traditional approaches to abstraction typically concentrate on minimizing model size while concurrently preserving model accuracy. However, when confronted with the complexities of ICPS, characterized by abundant randomness and uncertainty in the state space, these models frequently fall short of fulfilling the requisites for formal verification. Establishing a verifiable abstract model for ICPS is an imperative undertaking, underscoring the necessity to meticulously tailor the granularity of abstraction to the specific requirements and intricacies of the application context. Striking a delicate balance between mitigating state explosion and sustaining optimal model performance is crucial, as expounded in the work by Schmidt et al.[36].

Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (1)

The main contributions in our work include:

  1. 1.

    We propose the spatio-temporal value metric to measure the similarity between states. It helps to make semantic-preserving abstraction effectively.

  2. 2.

    We propose a novel abstraction modeling approach to construct the abstract MDP model based on spatio-temporal value metric. Moreover, we optimize the abstract model to make the abstract state space closer to the original state space.

  3. 3.

    We use compression ratio and mean absolute error to measure the accuracy of the model. We innovatively use PRISM to check for semantic gaps between abstract and real models. We use abstract MDP to build a dense DRL framework to generate joint policies in online environments.

Paper organization. The remainder of the paper is organized as follows. In Section2, we provide a review of pertinent background and preliminary information. In Section3, we present the overarching framework of our approach. In Section4, we conduct an assessment of our approach through several case studies related to ADS. Finally, in Section5, we review related work, and our conclusions are summarized in Section6.

2 Preliminaries

2.1 Deep Reinforcement Learning

DRL is an amalgamation of reinforcement learning (RL) and deep learning, training agents to achieve goals in simulated or real environments through a sequence of decision-making actions. By learning through continuous trial and error, DRL enables agents to explore environments and discover optimal strategies. At each time step, a DRL agent receives state information from the environment, which is used as input to a deep neural network (DNN). Subsequently, the agent selects an action from a set of possible actions and receives real-time rewards, aiming to maximize cumulative rewards. The process of DRL can be formalized using MDPs.

2.2 Markov Decision Process

Definition 1 (Markov Decision Process)

A Markov Decision Process is denoted by a tuple M=(S,s0,A,R,P,Ξ³)𝑀𝑆subscript𝑠0𝐴𝑅𝑃𝛾M=(S,s_{0},A,R,P,\gamma)italic_M = ( italic_S , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_A , italic_R , italic_P , italic_Ξ³ ), where S𝑆Sitalic_S represents a finite non-empty state set, s0∈Ssubscript𝑠0𝑆s_{0}\in Sitalic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ∈ italic_S denotes the initial system state, A𝐴Aitalic_A stands for a finite non-empty action set, P:SΓ—AΓ—Sβ†’[0,1]:𝑃→𝑆𝐴𝑆01P:S\times A\times S\rightarrow[0,1]italic_P : italic_S Γ— italic_A Γ— italic_S β†’ [ 0 , 1 ] is the transition probability function such that for s∈S𝑠𝑆s\in Sitalic_s ∈ italic_S and a∈Aπ‘Žπ΄a\in Aitalic_a ∈ italic_A, βˆ‘sβ€²βˆˆSP⁒(s,a,sβ€²)=1subscriptsuperscriptπ‘ β€²π‘†π‘ƒπ‘ π‘Žsuperscript𝑠′1\sum_{s^{\prime}\in S}P(s,a,s^{\prime})=1βˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_P ( italic_s , italic_a , italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) = 1. R:SΓ—A→ℝ:𝑅→𝑆𝐴ℝR:S\times A\rightarrow\mathbb{R}italic_R : italic_S Γ— italic_A β†’ blackboard_R represents the reward assigned to the current state-action pair and γ∈(0,1)𝛾01\gamma\in(0,1)italic_Ξ³ ∈ ( 0 , 1 ) is the discount factor. The discount factor γ𝛾\gammaitalic_Ξ³ determines the importance of immediate rewards relative to future rewards. A policy for the MDP denoted as Ο€:Sβ†’A:πœ‹β†’π‘†π΄\pi:S\rightarrow Aitalic_Ο€ : italic_S β†’ italic_A, maps states to actions.

The MDP M𝑀Mitalic_M describes the evolution of the initial system state over discrete time steps. In DRL, the interaction between the policy Ο€πœ‹\piitalic_Ο€ and the environment, resulting in state transitions and immediate rewards forms the foundation of the learning process. The evaluation and optimization of the policy involve state value function and action value function.

Definition 2 (State Value Function V⁒(s)𝑉𝑠V(s)italic_V ( italic_s ))

The state value function V⁒(s)𝑉𝑠V(s)italic_V ( italic_s ) represents the expected return achievable under a policy Ο€πœ‹\piitalic_Ο€ from state s𝑠sitalic_s. The Bellman expectation equation for V⁒(s)𝑉𝑠V(s)italic_V ( italic_s ) in recursive form is:

V⁒(s)=βˆ‘a∈Aπ⁒(a|s)⁒[R⁒(s,a)+Ξ³β’βˆ‘sβ€²βˆˆSP⁒(sβ€²|s,a)⁒V⁒(sβ€²)].𝑉𝑠subscriptπ‘Žπ΄πœ‹conditionalπ‘Žπ‘ delimited-[]π‘…π‘ π‘Žπ›Ύsubscriptsuperscript𝑠′𝑆𝑃conditionalsuperscriptπ‘ β€²π‘ π‘Žπ‘‰superscript𝑠′V(s)=\sum_{a\in A}\pi(a|s)[R(s,a)+\gamma\sum_{s^{\prime}\in S}P(s^{\prime}|s,a%)V(s^{\prime})].italic_V ( italic_s ) = βˆ‘ start_POSTSUBSCRIPT italic_a ∈ italic_A end_POSTSUBSCRIPT italic_Ο€ ( italic_a | italic_s ) [ italic_R ( italic_s , italic_a ) + italic_Ξ³ βˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT | italic_s , italic_a ) italic_V ( italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) ] .(1)
Definition 3 (Action Value Function Q⁒(s,a)π‘„π‘ π‘ŽQ(s,a)italic_Q ( italic_s , italic_a ))

The action value function Q⁒(s,a)π‘„π‘ π‘ŽQ(s,a)italic_Q ( italic_s , italic_a ) represents the expected return achievable from taking action aπ‘Žaitalic_a in state s𝑠sitalic_s and following policy Ο€πœ‹\piitalic_Ο€. The Bellman expectation equation for Q⁒(s,a)π‘„π‘ π‘ŽQ(s,a)italic_Q ( italic_s , italic_a ) in recursive form is:

Q⁒(s,a)=R⁒(s,a)+Ξ³β’βˆ‘sβ€²βˆˆSP⁒(sβ€²|s,a)β’βˆ‘aβ€²βˆˆAπ⁒(aβ€²|sβ€²)⁒Q⁒(sβ€²,aβ€²).π‘„π‘ π‘Žπ‘…π‘ π‘Žπ›Ύsubscriptsuperscript𝑠′𝑆𝑃conditionalsuperscriptπ‘ β€²π‘ π‘Žsubscriptsuperscriptπ‘Žβ€²π΄πœ‹conditionalsuperscriptπ‘Žβ€²superscript𝑠′𝑄superscript𝑠′superscriptπ‘Žβ€²Q(s,a)=R(s,a)+\gamma\sum_{s^{\prime}\in S}P(s^{\prime}|s,a)\sum_{a^{\prime}\inA%}\pi(a^{\prime}|s^{\prime})Q(s^{\prime},a^{\prime}).italic_Q ( italic_s , italic_a ) = italic_R ( italic_s , italic_a ) + italic_Ξ³ βˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ∈ italic_S end_POSTSUBSCRIPT italic_P ( italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT | italic_s , italic_a ) βˆ‘ start_POSTSUBSCRIPT italic_a start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ∈ italic_A end_POSTSUBSCRIPT italic_Ο€ ( italic_a start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT | italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) italic_Q ( italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT , italic_a start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) .(2)

With the rapid development in the field of DRL, many online, model-free learning algorithms have been proposed to fulfill various requirements[8], such as Deep Deterministic Policy Gradient (DDPG), Twin Delayed Deep Deterministic Policy Gradient (TD3), Actor-Critic (A2C), Proximal Policy Optimization (PPO), and more. The use of DRL controllers in place of traditional controllers holds great promise in large-scale systems with complex dynamics.

2.3 Abstract Markov Decision Process

Definition 4 (Abstract Markov Decision Process)

Let M=(S,s0,A,R,P,Ξ³)𝑀𝑆subscript𝑠0𝐴𝑅𝑃𝛾M=(S,s_{0},A,R,\\P,\gamma)italic_M = ( italic_S , italic_s start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_A , italic_R , italic_P , italic_Ξ³ ) denotes the true MDP, and M^=(S^,s^0,A^,P^,R^,Ξ³,Ο€^)^𝑀^𝑆subscript^𝑠0^𝐴^𝑃^𝑅𝛾^πœ‹\hat{M}=(\hat{S},\hat{s}_{0},\hat{A},\hat{P},\hat{R},\gamma,\hat{\pi})over^ start_ARG italic_M end_ARG = ( over^ start_ARG italic_S end_ARG , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , over^ start_ARG italic_A end_ARG , over^ start_ARG italic_P end_ARG , over^ start_ARG italic_R end_ARG , italic_Ξ³ , over^ start_ARG italic_Ο€ end_ARG ) denotes the abstract MDP. Ξ¦:Sβ†’S^:Φ→𝑆^𝑆\Phi:S\rightarrow\hat{S}roman_Ξ¦ : italic_S β†’ over^ start_ARG italic_S end_ARG is the state abstraction function, where Φ⁒(s)∈S^Φ𝑠^𝑆\Phi(s)\in\hat{S}roman_Ξ¦ ( italic_s ) ∈ over^ start_ARG italic_S end_ARG represents the abstract state, and its inverse image is denoted as Ξ¦βˆ’1⁒(s^)superscriptΞ¦1^𝑠\Phi^{-1}(\hat{s})roman_Ξ¦ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_s end_ARG ), where s^∈S^^𝑠^𝑆\hat{s}\in\hat{S}over^ start_ARG italic_s end_ARG ∈ over^ start_ARG italic_S end_ARG. S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG is the basic state set corresponding to the abstraction function ΦΦ\Phiroman_Ξ¦. Ξ¨:Aβ†’A^:Ψ→𝐴^𝐴\Psi:A\rightarrow\hat{A}roman_Ξ¨ : italic_A β†’ over^ start_ARG italic_A end_ARG is the action abstraction function, where Ψ⁒(a)∈A^Ξ¨π‘Ž^𝐴\Psi(a)\in\hat{A}roman_Ξ¨ ( italic_a ) ∈ over^ start_ARG italic_A end_ARG represents the abstract action, and its inverse image is denoted as Ξ¨βˆ’1⁒(a^)superscriptΞ¨1^π‘Ž\Psi^{-1}(\hat{a})roman_Ξ¨ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_a end_ARG ), where a^∈A^^π‘Ž^𝐴\hat{a}\in\hat{A}over^ start_ARG italic_a end_ARG ∈ over^ start_ARG italic_A end_ARG. A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG is the basic action set corresponding to the abstraction function ΨΨ\Psiroman_Ξ¨.

State transition and reward functions are defined as follows:

R^⁒(s^,a^)=βˆ‘sβˆˆΞ¦βˆ’1⁒(s^),aβˆˆΞ¨βˆ’1⁒(a^)w⁒(s)⁒v⁒(a)⁒R⁒(s,a),^𝑅^𝑠^π‘Žsubscriptformulae-sequence𝑠superscriptΞ¦1^π‘ π‘ŽsuperscriptΞ¨1^π‘Žπ‘€π‘ π‘£π‘Žπ‘…π‘ π‘Ž\hat{R}(\hat{s},\hat{a})=\sum_{s\in\Phi^{-1}(\hat{s}),\,a\in\Psi^{-1}(\hat{a})%}w(s)v(a)R(s,a),over^ start_ARG italic_R end_ARG ( over^ start_ARG italic_s end_ARG , over^ start_ARG italic_a end_ARG ) = βˆ‘ start_POSTSUBSCRIPT italic_s ∈ roman_Ξ¦ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_s end_ARG ) , italic_a ∈ roman_Ξ¨ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_a end_ARG ) end_POSTSUBSCRIPT italic_w ( italic_s ) italic_v ( italic_a ) italic_R ( italic_s , italic_a ) ,(3)
P^s^⁒s^β€²a^=βˆ‘sβˆˆΞ¦βˆ’1⁒(s^),aβˆˆΞ¨βˆ’1⁒(a^)βˆ‘sβ€²βˆˆΞ¦βˆ’1⁒(sβ€²^)w⁒(s)⁒v⁒(a)⁒Ps⁒sβ€²a,superscriptsubscript^𝑃^𝑠superscript^𝑠′^π‘Žsubscriptformulae-sequence𝑠superscriptΞ¦1^π‘ π‘ŽsuperscriptΞ¨1^π‘Žsubscriptsuperscript𝑠′superscriptΞ¦1^superscriptπ‘ β€²π‘€π‘ π‘£π‘Žsuperscriptsubscript𝑃𝑠superscriptπ‘ β€²π‘Ž\hat{P}_{\hat{s}\hat{s}^{\prime}}^{\hat{a}}=\sum_{s\in\Phi^{-1}(\hat{s}),\,a%\in\Psi^{-1}(\hat{a})}\sum_{s^{\prime}\in\Phi^{-1}(\hat{s^{\prime}})}w(s)v(a)P%_{ss^{\prime}}^{a},over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG end_POSTSUPERSCRIPT = βˆ‘ start_POSTSUBSCRIPT italic_s ∈ roman_Ξ¦ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_s end_ARG ) , italic_a ∈ roman_Ξ¨ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_a end_ARG ) end_POSTSUBSCRIPT βˆ‘ start_POSTSUBSCRIPT italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ∈ roman_Ξ¦ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT end_ARG ) end_POSTSUBSCRIPT italic_w ( italic_s ) italic_v ( italic_a ) italic_P start_POSTSUBSCRIPT italic_s italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_a end_POSTSUPERSCRIPT ,(4)

where w:Sβ†’[0,1]:𝑀→𝑆01w:S\rightarrow[0,1]italic_w : italic_S β†’ [ 0 , 1 ], βˆ‘sβˆˆΞ¦βˆ’1⁒(s^)w⁒(s)=1subscript𝑠superscriptΞ¦1^𝑠𝑀𝑠1\sum_{s\in\Phi^{-1}(\hat{s})}w(s)=1βˆ‘ start_POSTSUBSCRIPT italic_s ∈ roman_Ξ¦ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_s end_ARG ) end_POSTSUBSCRIPT italic_w ( italic_s ) = 1 represents the weigh function for state, and v:Aβ†’[0,1]:𝑣→𝐴01v:A\rightarrow[0,1]italic_v : italic_A β†’ [ 0 , 1 ], βˆ‘aβˆˆΞ¨βˆ’1⁒(a^)v⁒(a)=1subscriptπ‘ŽsuperscriptΞ¨1^π‘Žπ‘£π‘Ž1\sum_{a\in\Psi^{-1}(\hat{a})}v(a)=1βˆ‘ start_POSTSUBSCRIPT italic_a ∈ roman_Ξ¨ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_a end_ARG ) end_POSTSUBSCRIPT italic_v ( italic_a ) = 1 represents the weight function for action. R^⁒(s^,a^)^𝑅^𝑠^π‘Ž\hat{R}(\hat{s},\hat{a})over^ start_ARG italic_R end_ARG ( over^ start_ARG italic_s end_ARG , over^ start_ARG italic_a end_ARG ) represents the immediate reward of transitioning from abstract state s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG to s^β€²superscript^𝑠′\hat{s}^{\prime}over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT after taking the abstract action a^^π‘Ž\hat{a}over^ start_ARG italic_a end_ARG. P^s^⁒s^β€²a^superscriptsubscript^𝑃^𝑠superscript^𝑠′^π‘Ž\hat{P}_{\hat{s}\hat{s}^{\prime}}^{\hat{a}}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT over^ start_ARG italic_s end_ARG over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT end_POSTSUBSCRIPT start_POSTSUPERSCRIPT over^ start_ARG italic_a end_ARG end_POSTSUPERSCRIPT represents the probability of transitioning from abstract state s^^𝑠\hat{s}over^ start_ARG italic_s end_ARG to s^β€²superscript^𝑠′\hat{s}^{\prime}over^ start_ARG italic_s end_ARG start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT after taking the abstract action a^^π‘Ž\hat{a}over^ start_ARG italic_a end_ARG. The abstract policy Ο€^^πœ‹\hat{\pi}over^ start_ARG italic_Ο€ end_ARG is generated based on the abstract MDP.

2.4 PRISM

PRISM serves as an open-source probabilistic model checker for the formal modeling and analyzing of probabilistic systems[22, 32]. Widely applied in diverse application domains, PRISM has been instrumental in analyzing systems ranging from communication and multimedia protocols to randomized distributed algorithms, security protocols, biological systems, and beyond. PRISM is proficient in constructing and scrutinizing a variety of probabilistic models, encompassing: Discrete-time and continuous-time Markov chains (DTMCs and CTMCs), MDPs and probabilistic automata (PA). Additionally, PRISM supports extensions of these models that incorporate cost and reward considerations. It facilitates automated analysis of a broad spectrum of quantitative properties inherent in these models. For instance, users can inquire about the probability of a system shutdown within 4 hours due to a failure, the worst-case probability of a protocol terminating in error across all potential initial configurations, the anticipated size of a message queue after 30 minutes, or the worst-case expected time for an algorithm to conclude. The property specification language integrates temporal logic such as PCTL, CSL, LTL, and PCTL*, alongside extensions for quantitative specifications, costs, and rewards.

3 Spatio-temporal Value Semantics based Abstraction for DRL

For ensuring the safety and reliability of ICPS, the design and optimization of controllers play a pivotal role. However, the hybrid behavior of the system and the uncertainties in the environment contribute to a vast state space for ICPS, rendering the design and optimization of controllers using DRL a complex task.To address this complexity, we propose an abstraction modeling approach based on spatio-temporal value semantics, aiming to efficiently abstract the state space and actions of ICPS, thereby facilitating the optimization of controller design through DRL. The key aspect of state abstraction revolves around ensuring semantic consistency, which involves measuring the similarity between different states and determining their belongingness to the same abstract state. To achieve this, we introduce a novel measurement approach termed spatio-temporal value semantics. Fig.2 illustrates the process of abstracting DRL based on spatio-temporal value semantics.

Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (2)

3.1 Action Abstraction

The controller in ICPS usually selects a specific value in a continuous range as the specific value of the action. Taking ADS as an example, the acceleration range is usually in [βˆ’8⁒m/s2,3⁒m/s2]8π‘šsuperscript𝑠23π‘šsuperscript𝑠2[-8m/s^{2},3m/s^{2}][ - 8 italic_m / italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT , 3 italic_m / italic_s start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ], and the accelerator and brake of the vehicle make the vehicle achieve the expected acceleration. However, the triggering action of MDP is usually a concrete value in this range. It is obviously unrealistic to realize the construction of an MDP for this infinite number of concrete actions in ICPS.

To address this issue, we propose a technique for abstracting continuous action spaces. The fundamental idea is to finely segment the continuous action space so that the abstract action is analogous to the action on the true MDP transition, with analogous successor states and rewards gained.Inspired by[18], we introduce the interval action box as follows:

Definition 5 (Interval Box)

For a continuous action space A𝐴Aitalic_A of dimension d𝑑ditalic_d, each variable in each dimension has its own effective range, i.e., the variable aisubscriptπ‘Žπ‘–a_{i}italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in the iβˆ’t⁒hπ‘–π‘‘β„Ži-thitalic_i - italic_t italic_h dimension (i∈[0⁒…⁒d]𝑖delimited-[]0…𝑑i\in[0…d]italic_i ∈ [ 0 … italic_d ]) is in the range [li,ui]subscript𝑙𝑖subscript𝑒𝑖[l_{i},u_{i}][ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ]. The interval box method divides this range uniformly into unit intervals Ii=[li,ui]/gisubscript𝐼𝑖subscript𝑙𝑖subscript𝑒𝑖subscript𝑔𝑖I_{i}=[l_{i},u_{i}]/g_{i}italic_I start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = [ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ] / italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, which is a partition of the continuous action space A𝐴Aitalic_A. For an action aπ‘Žaitalic_a, based on the interval box abstraction, the action a^=[k1,k2,…,kd]^π‘Žsubscriptπ‘˜1subscriptπ‘˜2…subscriptπ‘˜π‘‘\hat{a}=[k_{1},k_{2},…,k_{d}]over^ start_ARG italic_a end_ARG = [ italic_k start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_k start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_k start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT ] of the abstract action space A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG is obtained, where ki=ai/gisubscriptπ‘˜π‘–subscriptπ‘Žπ‘–subscript𝑔𝑖k_{i}=a_{i}/g_{i}italic_k start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, gisubscript𝑔𝑖g_{i}italic_g start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the abstraction granularity of the iβˆ’t⁒hπ‘–π‘‘β„Ži-thitalic_i - italic_t italic_h dimension.

According to Definition5, the abstraction granularity must be adjusted in the abstract MDP to maintain the successor state and reward gained by performing the abstract action close to the successor state and the reward obtained by executing the actual action in the true MDP. Each dimension’s level of granularity needs to be specified by the specific environment.Different cases call for varying levels of granularity since fine granularity is closer to the original action, while too small granularity would amplify data jitters inaccuracy.

3.2 Semantics-based StateAbstraction

Existing state abstraction approaches struggle to handle the high dimensionality and continuous state space of ICPS. While model-irrelevance abstraction is an ideal solution in theory, it is challenging to implement in practice. To address these issues, we propose a semantic-based abstraction approach to introduce value information, spatio-temporal information, and probability information into the abstraction process, which enables the evaluation of the similarity between states. We first introduce the semantic-based abstraction model.

Definition 6 (Semantic-based Abstraction MDP)

An abstraction model is denoted by a tuple (S^,A^,s^0,Ξ·,Θ)^𝑆^𝐴subscript^𝑠0πœ‚Ξ˜(\hat{S},\hat{A},\hat{s}_{0},\eta,\Theta)( over^ start_ARG italic_S end_ARG , over^ start_ARG italic_A end_ARG , over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_Ξ· , roman_Θ ), where S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG is the set of abstract states, A^^𝐴\hat{A}over^ start_ARG italic_A end_ARG is the set of abstract actions, Ξ·:S^Γ—A^Γ—S^β†’[0,1]:πœ‚β†’^𝑆^𝐴^𝑆01\eta:\hat{S}\times\hat{A}\times\hat{S}\rightarrow[0,1]italic_Ξ· : over^ start_ARG italic_S end_ARG Γ— over^ start_ARG italic_A end_ARG Γ— over^ start_ARG italic_S end_ARG β†’ [ 0 , 1 ] represents the transition distribution, s^0subscript^𝑠0\hat{s}_{0}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT is the initial state set, and ΘΘ\Thetaroman_Θ represents the mapping to the semantic space.

It is worth noting that the core of semantic abstraction lies in measuring the similarity of states in abstract MDP. The similar states must satisfy two conditions: (1) the available action sets of similar states should be similar, and (2) the multi-step state transition models and multi-step rewards of similar states should be similar.

3.2.1 Semantic Interval Abstraction

To satisfy the specified conditions, we implement Semantic Interval Abstraction. Taking Adaptive Cruise Control (ACC) as an example, where the goal is to maintain a safe distance from the lead vehicle. The concrete state s𝑠sitalic_s of the ego vehicle, represented as a multidimensional vector (v,a⁒c⁒c,x,y,d⁒i⁒s⁒t⁒a⁒n⁒c⁒e,…)π‘£π‘Žπ‘π‘π‘₯π‘¦π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’β€¦(v,acc,x,y,distance,\ldots)( italic_v , italic_a italic_c italic_c , italic_x , italic_y , italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e , … ), includes parameters like vehicle speed v𝑣vitalic_v, acceleration a⁒c⁒cπ‘Žπ‘π‘accitalic_a italic_c italic_c, spatial coordinates (xπ‘₯xitalic_x, y𝑦yitalic_y), and relative distance.

Based on existing causal discovery algorithms, we employ PC (Peter-Clark) algorithm[39] and FCI (Fast Causal Inference) algorithm[40] to construct causal graphs on the autonomous driving dataset. The union of these two causal graphs is considered as the final causal graph. By examining the causal relationships within the graph, we identify the relationships among different dimensions of the state. Through abstraction based on these causal relationships, we determine the semantic values.

Due to the continuous nature of each dimension, the combination of specific dimensions results in a vast state space. In the context of ICPS, information from the state vector exhibits approximate similarity within short time intervals. Thus, the concrete state s𝑠sitalic_s undergoes semantic abstraction, yielding a condensed representation d=(r⁒e⁒lvelocity,r⁒e⁒langle,r⁒e⁒ldistance,…)π‘‘π‘Ÿπ‘’subscript𝑙velocityπ‘Ÿπ‘’subscript𝑙angleπ‘Ÿπ‘’subscript𝑙distance…d=(\text{$rel_{\text{velocity}}$},\text{$rel_{\text{angle}}$},\text{$rel_{%\text{distance}}$},\ldots)italic_d = ( italic_r italic_e italic_l start_POSTSUBSCRIPT velocity end_POSTSUBSCRIPT , italic_r italic_e italic_l start_POSTSUBSCRIPT angle end_POSTSUBSCRIPT , italic_r italic_e italic_l start_POSTSUBSCRIPT distance end_POSTSUBSCRIPT , … ), where r⁒e⁒lvelocityπ‘Ÿπ‘’subscript𝑙velocityrel_{\text{velocity}}italic_r italic_e italic_l start_POSTSUBSCRIPT velocity end_POSTSUBSCRIPT, r⁒e⁒la⁒n⁒g⁒l⁒eπ‘Ÿπ‘’subscriptπ‘™π‘Žπ‘›π‘”π‘™π‘’rel_{angle}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_a italic_n italic_g italic_l italic_e end_POSTSUBSCRIPT, and r⁒e⁒ld⁒i⁒s⁒t⁒a⁒n⁒c⁒eπ‘Ÿπ‘’subscriptπ‘™π‘‘π‘–π‘ π‘‘π‘Žπ‘›π‘π‘’rel_{distance}italic_r italic_e italic_l start_POSTSUBSCRIPT italic_d italic_i italic_s italic_t italic_a italic_n italic_c italic_e end_POSTSUBSCRIPT represent relative velocity, relative angle, and relative distance, respectively. In summary, semantic interval abstraction maps multiple dimensions of a concrete state into several semantic values, achieving abstraction of state dimensions.

Since the dimensions of the semantic value ΞΈπœƒ\thetaitalic_ΞΈ may have different scales, the data needs to be normalized to a uniform scale. Therefore, we divide the J𝐽Jitalic_J-dimensional space into ∏j=1JKjsuperscriptsubscriptproduct𝑗1𝐽subscript𝐾𝑗\prod_{j=1}^{J}K_{j}∏ start_POSTSUBSCRIPT italic_j = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_J end_POSTSUPERSCRIPT italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT segments, with each dimension having Kjsubscript𝐾𝑗K_{j}italic_K start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT intervals i.e.dij=[lij,uij]superscriptsubscript𝑑𝑖𝑗superscriptsubscript𝑙𝑖𝑗superscriptsubscript𝑒𝑖𝑗d_{i}^{j}=[l_{i}^{j},u_{i}^{j}]italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = [ italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT , italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ]. In this context, dijsuperscriptsubscript𝑑𝑖𝑗d_{i}^{j}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT represents the i𝑖iitalic_i-th interval on the j𝑗jitalic_j-th dimension, lijsuperscriptsubscript𝑙𝑖𝑗l_{i}^{j}italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and uijsuperscriptsubscript𝑒𝑖𝑗u_{i}^{j}italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT are the lower and upper bounds of this interval. Consequently, the spatial partitioning problem is transformed into an optimization problem, specifically formulated as follows:

{max⁑(uijβˆ’lij)s.t.⁒dMINj≀uijβˆ’lij≀dMAXj|sij^|β‰₯nMINjMEAN⁒{ΞΈsj^βˆ’E⁒[ΞΈsj^]}<eMEANjMAX⁒(ΞΈsj^βˆ’E⁒[ΞΈsj^])<eMAXjcasessuperscriptsubscript𝑒𝑖𝑗superscriptsubscript𝑙𝑖𝑗otherwises.t.missing-subexpressionsuperscriptsubscript𝑑MIN𝑗superscriptsubscript𝑒𝑖𝑗superscriptsubscript𝑙𝑖𝑗superscriptsubscript𝑑MAX𝑗missing-subexpression^superscriptsubscript𝑠𝑖𝑗superscriptsubscript𝑛MIN𝑗missing-subexpressionMEAN^superscriptsubscriptπœƒπ‘ π‘—πΈdelimited-[]^superscriptsubscriptπœƒπ‘ π‘—superscriptsubscript𝑒MEAN𝑗missing-subexpressionMAX^superscriptsubscriptπœƒπ‘ π‘—πΈdelimited-[]^superscriptsubscriptπœƒπ‘ π‘—superscriptsubscript𝑒MAX𝑗otherwise\displaystyle\begin{cases}\max\left(u_{i}^{j}-l_{i}^{j}\right)\\\text{s.t.}\begin{aligned} &d_{\text{MIN}}^{j}\leq u_{i}^{j}-l_{i}^{j}\leq d_{%\text{MAX}}^{j}\\&\left|\hat{s_{i}^{j}}\right|\geq n_{\text{MIN}}^{j}\\&\text{MEAN}\left\{\hat{\theta_{s}^{j}}-E\left[\hat{\theta_{s}^{j}}\right]%\right\}<e_{\text{MEAN}}^{j}\\&\text{MAX}\left(\hat{\theta_{s}^{j}}-E\left[\hat{\theta_{s}^{j}}\right]\right%)<e_{\text{MAX}}^{j}\end{aligned}\end{cases}{ start_ROW start_CELL roman_max ( italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ) end_CELL start_CELL end_CELL end_ROW start_ROW start_CELL s.t. start_ROW start_CELL end_CELL start_CELL italic_d start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≀ italic_u start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT - italic_l start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ≀ italic_d start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL | over^ start_ARG italic_s start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG | β‰₯ italic_n start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL MEAN { over^ start_ARG italic_ΞΈ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG - italic_E [ over^ start_ARG italic_ΞΈ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG ] } < italic_e start_POSTSUBSCRIPT MEAN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_CELL end_ROW start_ROW start_CELL end_CELL start_CELL MAX ( over^ start_ARG italic_ΞΈ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG - italic_E [ over^ start_ARG italic_ΞΈ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_ARG ] ) < italic_e start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT end_CELL end_ROW end_CELL start_CELL end_CELL end_ROW(5)

where dMINjsuperscriptsubscript𝑑MIN𝑗d_{\text{MIN}}^{j}italic_d start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and dMAXjsuperscriptsubscript𝑑MAX𝑗d_{\text{MAX}}^{j}italic_d start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT are the minimum and maximum lengths of intervals on the j𝑗jitalic_j-th semantic dimension. s^ij={s|ΞΈsj∈dij}superscriptsubscript^𝑠𝑖𝑗conditional-set𝑠superscriptsubscriptπœƒπ‘ π‘—superscriptsubscript𝑑𝑖𝑗\hat{s}_{i}^{j}=\{s|\theta_{s}^{j}\in d_{i}^{j}\}over^ start_ARG italic_s end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT = { italic_s | italic_ΞΈ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT ∈ italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT } represents the set of concrete states with semantic values ΞΈsjsuperscriptsubscriptπœƒπ‘ π‘—\theta_{s}^{j}italic_ΞΈ start_POSTSUBSCRIPT italic_s end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT falling within the interval dijsuperscriptsubscript𝑑𝑖𝑗d_{i}^{j}italic_d start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT. nMINjsuperscriptsubscript𝑛MIN𝑗n_{\text{MIN}}^{j}italic_n start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT is the minimum number of concrete states in the j𝑗jitalic_j-th dimension interval, and eMEANjsuperscriptsubscript𝑒MEAN𝑗e_{\text{MEAN}}^{j}italic_e start_POSTSUBSCRIPT MEAN end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT and eMAXjsuperscriptsubscript𝑒MAX𝑗e_{\text{MAX}}^{j}italic_e start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j end_POSTSUPERSCRIPT are predefined expected error and maximum error for abstract on the j𝑗jitalic_j-th dimension. Eq.5 ensures that each interval contains enough concrete states while maintaining low abstract errors.

1:Concrete state set S𝑆Sitalic_S, Semantic value set ΞΈπœƒ\thetaitalic_ΞΈ, Maximum interval lengths dMAXsubscript𝑑MAXd_{\text{MAX}}italic_d start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT, Minimum interval lengths dMINsubscript𝑑MINd_{\text{MIN}}italic_d start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT, Minimum number of concrete states in intervals nMINsubscript𝑛MINn_{\text{MIN}}italic_n start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT, Expected error eMEANsubscript𝑒MEANe_{\text{MEAN}}italic_e start_POSTSUBSCRIPT MEAN end_POSTSUBSCRIPT, Maximum error eMAXsubscript𝑒MAXe_{\text{MAX}}italic_e start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT, R⁒E⁒D⁒U⁒C⁒T⁒I⁒O⁒N⁒_⁒L⁒E⁒V⁒E⁒Lπ‘…πΈπ·π‘ˆπΆπ‘‡πΌπ‘‚π‘_𝐿𝐸𝑉𝐸𝐿REDUCTION\_LEVELitalic_R italic_E italic_D italic_U italic_C italic_T italic_I italic_O italic_N _ italic_L italic_E italic_V italic_E italic_L rdsubscriptπ‘Ÿπ‘‘r_{d}italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT

2:Intervalized abstract space SI^^subscript𝑆𝐼\hat{S_{I}}over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG

3:refined ←←\leftarrow← False, SI^β†βˆ…β†^subscript𝑆𝐼\hat{S_{I}}\leftarrow\emptysetover^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG ← βˆ… // Initialize abstraction process

4:whilerefined = Falsedo // Iterative refinement

5:forj∈{1,…,J}𝑗1…𝐽j\in\{1,...,J\}italic_j ∈ { 1 , … , italic_J }do // Partition state set based on Eq.5

6:dj←P⁒A⁒R⁒T⁒I⁒T⁒I⁒O⁒N⁒(S,ΞΈj,dMAX,dMIN,nMIN)←subscript𝑑𝑗𝑃𝐴𝑅𝑇𝐼𝑇𝐼𝑂𝑁𝑆subscriptπœƒπ‘—subscript𝑑MAXsubscript𝑑MINsubscript𝑛MINd_{j}\leftarrow PARTITION(S,\theta_{j},d_{\text{MAX}},d_{\text{MIN}},n_{\text{%MIN}})italic_d start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT ← italic_P italic_A italic_R italic_T italic_I italic_T italic_I italic_O italic_N ( italic_S , italic_ΞΈ start_POSTSUBSCRIPT italic_j end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT )

7:endfor

8:D←d1,…,dJ←𝐷subscript𝑑1…subscript𝑑𝐽D\leftarrow d_{1},...,d_{J}italic_D ← italic_d start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_d start_POSTSUBSCRIPT italic_J end_POSTSUBSCRIPT // Form set of intervalized dimensions

9:SI^←S⁒T⁒A⁒T⁒E⁒_⁒M⁒A⁒P⁒P⁒I⁒N⁒G⁒(D)←^subscript𝑆𝐼𝑆𝑇𝐴𝑇𝐸_𝑀𝐴𝑃𝑃𝐼𝑁𝐺𝐷\hat{S_{I}}\leftarrow STATE\_MAPPING(D)over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG ← italic_S italic_T italic_A italic_T italic_E _ italic_M italic_A italic_P italic_P italic_I italic_N italic_G ( italic_D ) // Map intervals to abstract space

10:em⁒e⁒a⁒n,em⁒a⁒x←C⁒O⁒M⁒P⁒U⁒T⁒E⁒_⁒E⁒R⁒R⁒O⁒R⁒(SI^,S)←subscriptπ‘’π‘šπ‘’π‘Žπ‘›subscriptπ‘’π‘šπ‘Žπ‘₯πΆπ‘‚π‘€π‘ƒπ‘ˆπ‘‡πΈ_𝐸𝑅𝑅𝑂𝑅^subscript𝑆𝐼𝑆e_{mean},e_{max}\leftarrow COMPUTE\_ERROR(\hat{S_{I}},S)italic_e start_POSTSUBSCRIPT italic_m italic_e italic_a italic_n end_POSTSUBSCRIPT , italic_e start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ← italic_C italic_O italic_M italic_P italic_U italic_T italic_E _ italic_E italic_R italic_R italic_O italic_R ( over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG , italic_S ) // Compute current error

11:rcur←C⁒O⁒M⁒P⁒U⁒T⁒E⁒_⁒R⁒E⁒D⁒U⁒C⁒T⁒I⁒O⁒N⁒_⁒L⁒E⁒V⁒E⁒L⁒(SI^,S)←subscriptπ‘ŸcurπΆπ‘‚π‘€π‘ƒπ‘ˆπ‘‡πΈ_π‘…πΈπ·π‘ˆπΆπ‘‡πΌπ‘‚π‘_𝐿𝐸𝑉𝐸𝐿^subscript𝑆𝐼𝑆r_{\text{cur}}\leftarrow COMPUTE\_REDUCTION\_LEVEL(\hat{S_{I}},S)italic_r start_POSTSUBSCRIPT cur end_POSTSUBSCRIPT ← italic_C italic_O italic_M italic_P italic_U italic_T italic_E _ italic_R italic_E italic_D italic_U italic_C italic_T italic_I italic_O italic_N _ italic_L italic_E italic_V italic_E italic_L ( over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG , italic_S ) // Compute reduction level

12:ifemean>eMEANo⁒remax>eMAXo⁒rrcur>rdformulae-sequencesubscript𝑒meansubscript𝑒MEANπ‘œπ‘Ÿformulae-sequencesubscript𝑒maxsubscript𝑒MAXπ‘œπ‘Ÿsubscriptπ‘Ÿcursubscriptπ‘Ÿπ‘‘e_{\text{mean}}>e_{\text{MEAN}}\ \ or\ \ e_{\text{max}}>e_{\text{MAX}}\ \ or\ %\ r_{\text{cur}}>r_{d}italic_e start_POSTSUBSCRIPT mean end_POSTSUBSCRIPT > italic_e start_POSTSUBSCRIPT MEAN end_POSTSUBSCRIPT italic_o italic_r italic_e start_POSTSUBSCRIPT max end_POSTSUBSCRIPT > italic_e start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT italic_o italic_r italic_r start_POSTSUBSCRIPT cur end_POSTSUBSCRIPT > italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT then

13:update dMAX,dMIN,nMINsubscript𝑑MAXsubscript𝑑MINsubscript𝑛MINd_{\text{MAX}},d_{\text{MIN}},n_{\text{MIN}}italic_d start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT , italic_d start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT , italic_n start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT // Update interval parameters if needed

14:else

15:refined ←←\leftarrow← True // End refinement if conditions are met

16:endif

17:endwhile

18:Return SI^^subscript𝑆𝐼\hat{S_{I}}over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG // Return intervalized abstract space

Alg.1 has been devised to orchestrate the transformation of a given concrete state set S𝑆Sitalic_S into an intervalized abstract space denoted as SI^^subscript𝑆𝐼\hat{S_{I}}over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG. This transformation takes into account semantic values and adheres to specific constraints. The essential parameters guiding this process encompass the semantic value set ΞΈπœƒ\thetaitalic_ΞΈ, the maximum and minimum interval lengths dMAXsubscript𝑑MAXd_{\text{MAX}}italic_d start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT and dMINsubscript𝑑MINd_{\text{MIN}}italic_d start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT, the minimum number of concrete states in intervals nMINsubscript𝑛MINn_{\text{MIN}}italic_n start_POSTSUBSCRIPT MIN end_POSTSUBSCRIPT, the expected error threshold eMEANsubscript𝑒MEANe_{\text{MEAN}}italic_e start_POSTSUBSCRIPT MEAN end_POSTSUBSCRIPT, and the limit for reduction level rdsubscriptπ‘Ÿπ‘‘r_{d}italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT.

The term REDUCTION_LEVEL[38] serves as an indicator of the state compression ratio. The optimization iterations are designed to terminate when the abstract effect meets the predefined requirements of rdsubscriptπ‘Ÿdr_{\text{d}}italic_r start_POSTSUBSCRIPT d end_POSTSUBSCRIPT. A reasonable range is to make the number of compressed states 10% to 30% of the original number of states.

3.2.2 Semantic-based (Ξ΅πœ€\varepsilonitalic_Ξ΅,d)-Abstraction

While the semantic interval abstraction can ensure model accuracy, abstraction simplicity is related to the granularity of abstraction. To address this, we propose (Ξ΅,d)βˆ’a⁒b⁒s⁒t⁒r⁒a⁒c⁒t⁒i⁒o⁒nπœ€π‘‘π‘Žπ‘π‘ π‘‘π‘Ÿπ‘Žπ‘π‘‘π‘–π‘œπ‘›(\varepsilon,d)-abstraction( italic_Ξ΅ , italic_d ) - italic_a italic_b italic_s italic_t italic_r italic_a italic_c italic_t italic_i italic_o italic_n based on spatio-temporal value semantics.

Definition 7 (Spatio-temporal Value Semantics)

For a concrete state s∈S𝑠𝑆s\in Sitalic_s ∈ italic_S and spatio-temporal value semantics ΞΈβˆˆβ„nπœƒsuperscriptℝ𝑛\theta\in\mathbb{R}^{n}italic_ΞΈ ∈ blackboard_R start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT, ΞΈ=Θ⁒{V⁒(s),Q⁒(s,a),R⁒(s,a),P⁒(s,sβ€²),…}πœƒΞ˜π‘‰π‘ π‘„π‘ π‘Žπ‘…π‘ π‘Žπ‘ƒπ‘ superscript𝑠′…\theta=\Theta\{V(s),Q(s,a),R(s,a),\\P(s,s^{\prime}),\ldots\}italic_ΞΈ = roman_Θ { italic_V ( italic_s ) , italic_Q ( italic_s , italic_a ) , italic_R ( italic_s , italic_a ) , italic_P ( italic_s , italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) , … }. Here, ΞΈπœƒ\thetaitalic_ΞΈ represents the semantic value, and the function Θ:Sβ†’ΞΈ:Ξ˜β†’π‘†πœƒ\Theta:S\rightarrow\thetaroman_Θ : italic_S β†’ italic_ΞΈ maps states to their corresponding semantic values. The multidimensional mapping Θ⁒{V⁒(s),Q⁒(s,a),R⁒(s,a),P⁒(s,sβ€²),…}Ξ˜π‘‰π‘ π‘„π‘ π‘Žπ‘…π‘ π‘Žπ‘ƒπ‘ superscript𝑠′…\Theta\{V(s),Q(s,a),R(s,a),P(s,s^{\prime}),\ldots\}roman_Θ { italic_V ( italic_s ) , italic_Q ( italic_s , italic_a ) , italic_R ( italic_s , italic_a ) , italic_P ( italic_s , italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ) , … } captures various aspects of the state s𝑠sitalic_s, including the state value function V⁒(s)𝑉𝑠V(s)italic_V ( italic_s ), action value function Q⁒(s,a)π‘„π‘ π‘ŽQ(s,a)italic_Q ( italic_s , italic_a ), reward function R⁒(s,a)π‘…π‘ π‘ŽR(s,a)italic_R ( italic_s , italic_a ), and transition function P⁒(s,sβ€²)𝑃𝑠superscript𝑠′P(s,s^{\prime})italic_P ( italic_s , italic_s start_POSTSUPERSCRIPT β€² end_POSTSUPERSCRIPT ), among others. The semantic mapping ΘΘ\Thetaroman_Θ translates state and environmental information into the semantic space, thus depicting the spatio-temporal value of the state.

Spatio-temporal value semantics encapsulate state evolution information across time and space, reflecting the state’s distribution and evolution. It considers factors like state value function and transition function, providing insights into both current and future states. Using spatio-temporal value semantics, we assess semantic equivalence between abstract states. When the semantic distance is below a specified threshold, states are considered equivalent, simplifying model abstraction with enhanced precision.

Definition 8 ((Ξ΅πœ€\varepsilonitalic_Ξ΅,d)-Abstraction)

(Ξ΅πœ€\varepsilonitalic_Ξ΅,d)-Abstraction is denoted by a mapping ΦΡ,d:Sβ†’S^:subscriptΞ¦πœ€π‘‘β†’π‘†^𝑆\Phi_{\varepsilon,d}:S\rightarrow\hat{S}roman_Ξ¦ start_POSTSUBSCRIPT italic_Ξ΅ , italic_d end_POSTSUBSCRIPT : italic_S β†’ over^ start_ARG italic_S end_ARG that satisfies the following condition:

d⁒(s1,s2)β‰€Ξ΅β’βˆ€s^∈S^,s1,s2∈ΦΡ,dβˆ’1⁒(s^).formulae-sequence𝑑subscript𝑠1subscript𝑠2πœ€for-all^𝑠^𝑆subscript𝑠1subscript𝑠2superscriptsubscriptΞ¦πœ€π‘‘1^𝑠d(s_{1},s_{2})\leq\varepsilon\quad\forall\hat{s}\in\hat{S},\quad s_{1},s_{2}%\in\Phi_{\varepsilon,d}^{-1}(\hat{s}).italic_d ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≀ italic_Ξ΅ βˆ€ over^ start_ARG italic_s end_ARG ∈ over^ start_ARG italic_S end_ARG , italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ roman_Ξ¦ start_POSTSUBSCRIPT italic_Ξ΅ , italic_d end_POSTSUBSCRIPT start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ( over^ start_ARG italic_s end_ARG ) .(6)

Ξ¦:Sβ†’S^:Φ→𝑆^𝑆\Phi:S\rightarrow\hat{S}roman_Ξ¦ : italic_S β†’ over^ start_ARG italic_S end_ARG represents the abstract mapping function that maps the original state space S𝑆Sitalic_S into an abstract state space S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG. The mapping function ΦΦ\Phiroman_Ξ¦ transforms a true MDP into an abstract MDP. Let P⁒o⁒w⁒(S)π‘ƒπ‘œπ‘€π‘†Pow(S)italic_P italic_o italic_w ( italic_S ) denote the power set of S𝑆Sitalic_S, and Ξ¦βˆ’1:S^β†’P⁒o⁒w⁒(S):superscriptΞ¦1β†’^π‘†π‘ƒπ‘œπ‘€π‘†\Phi^{-1}:\hat{S}\rightarrow Pow(S)roman_Ξ¦ start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT : over^ start_ARG italic_S end_ARG β†’ italic_P italic_o italic_w ( italic_S ) represents the inverse mapping function. The core of state abstraction is measuring the similarity of state abstractions and performing nearest-neighbor abstractions based on state similarity. d𝑑ditalic_d represents the state distance metric, and Ξ΅πœ€\varepsilonitalic_Ξ΅ is the abstraction threshold.

Grounded in the state value functions (Eq.2) and action value functions (Eq.3) in MDP, when two states possess akin transition models and rewards, their expected cumulative rewards will be similar. This observation provides a simplification approach for Semantic-based (Ξ΅πœ€\varepsilonitalic_Ξ΅,d)-Abstraction: the reward function and transition probabilities can compose the Spatio-temporal Value Metric of that state. Thus, in the process of abstraction based on value semantics, we aim to maintain the optimal value function of the abstract MDP as close as possible to the true MDP, ensuring semantic equivalence.

Definition 9 (Spatio-temporal Value Metric)

For βˆ€s1,s2∈Sfor-allsubscript𝑠1subscript𝑠2𝑆\forall s_{1},s_{2}\in Sβˆ€ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_S,

d⁒(s1,s2)𝑑subscript𝑠1subscript𝑠2\displaystyle d(s_{1},s_{2})italic_d ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )=d(ΞΈs1,ΞΈs2)β‰œmaxa^∈A^⁒(s1)∩A^⁒(s2){cR|R(s1,a^)βˆ’R(s2,a^)|\displaystyle=d(\theta_{s_{1}},\theta_{s_{2}})\triangleq\max_{\hat{a}\in\hat{A%}(s_{1})\cap\hat{A}(s_{2})}\big{\{}c_{R}\lvert R(s_{1},\hat{a})-R(s_{2},\hat{a%})\rvert= italic_d ( italic_ΞΈ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT end_POSTSUBSCRIPT , italic_ΞΈ start_POSTSUBSCRIPT italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT end_POSTSUBSCRIPT ) β‰œ roman_max start_POSTSUBSCRIPT over^ start_ARG italic_a end_ARG ∈ over^ start_ARG italic_A end_ARG ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∩ over^ start_ARG italic_A end_ARG ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_R ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG ) - italic_R ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG ) |(7)
+cPdP(P(β‹…|s1,a^),P(β‹…|s2,a^))}+cDD[A^(s1),A^(s2)]\displaystyle\quad+c_{P}d_{P}(P(\cdot|s_{1},\hat{a}),P(\cdot|s_{2},\hat{a}))%\big{\}}+c_{D}D[\hat{A}(s_{1}),\hat{A}(s_{2})]+ italic_c start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT ( italic_P ( β‹… | italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG ) , italic_P ( β‹… | italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , over^ start_ARG italic_a end_ARG ) ) } + italic_c start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_D [ over^ start_ARG italic_A end_ARG ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over^ start_ARG italic_A end_ARG ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ]
+cT⁒D⁒[s1,s2],subscript𝑐𝑇𝐷subscript𝑠1subscript𝑠2\displaystyle\quad+c_{T}D[s_{1},s_{2}],+ italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT italic_D [ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ] ,

where cRsubscript𝑐𝑅c_{R}italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and cPsubscript𝑐𝑃c_{P}italic_c start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are positive constants, dPsubscript𝑑𝑃d_{P}italic_d start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT represents a measure function of the similarity between two probability distributions. cRsubscript𝑐𝑅c_{R}italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT and cPsubscript𝑐𝑃c_{P}italic_c start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT are weights for measuring rewards and transition probability density functions, respectively. cDsubscript𝑐𝐷c_{D}italic_c start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT and cTsubscript𝑐𝑇c_{T}italic_c start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT are sufficiently large positive constants, and A^⁒(s1)^𝐴subscript𝑠1\hat{A}(s_{1})over^ start_ARG italic_A end_ARG ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) is considered equivalent to A^⁒(s2)^𝐴subscript𝑠2\hat{A}(s_{2})over^ start_ARG italic_A end_ARG ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) if cD⁒D⁒[A^⁒(s1),A^⁒(s2)]≀ϡsubscript𝑐𝐷𝐷^𝐴subscript𝑠1^𝐴subscript𝑠2italic-Ο΅c_{D}D[\hat{A}(s_{1}),\hat{A}(s_{2})]\leq\epsilonitalic_c start_POSTSUBSCRIPT italic_D end_POSTSUBSCRIPT italic_D [ over^ start_ARG italic_A end_ARG ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , over^ start_ARG italic_A end_ARG ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ≀ italic_Ο΅. The spatio-temporal value metric satisfies mutual simulatability and uniqueness, i.e., d⁒(s1,s2)=0𝑑subscript𝑠1subscript𝑠20d(s_{1},s_{2})=0italic_d ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = 0 implies s1=s2subscript𝑠1subscript𝑠2s_{1}=s_{2}italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

1:Intervalized abstract space SI^={D1,D2,…,Dn}^subscript𝑆𝐼subscript𝐷1subscript𝐷2…subscript𝐷𝑛\hat{S_{I}}=\{D_{1},D_{2},\ldots,D_{n}\}over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG = { italic_D start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_D start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_D start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }, Optimal state number determination function K𝐾Kitalic_K, Spatiotemporal Value Metric d𝑑ditalic_d

2:Semantic-based (Ξ΅πœ€\varepsilonitalic_Ξ΅,d)-Abstraction space S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG, Abstraction Model ΦΦ\Phiroman_Ξ¦

3:Number of clusters kβ†β†π‘˜absentk\leftarrowitalic_k ← K⁒(SI^)𝐾^subscript𝑆𝐼K(\hat{S_{I}})italic_K ( over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG )

4:Initialize cluster centroids randomly: c1,c2,…,ck←←subscript𝑐1subscript𝑐2…subscriptπ‘π‘˜absentc_{1},c_{2},\ldots,c_{k}\leftarrowitalic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT ← random points from SI^^subscript𝑆𝐼\hat{S_{I}}over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG

5:repeat

6:fori=1𝑖1i=1italic_i = 1 to n𝑛nitalic_ndo

7:Assign each data point Disubscript𝐷𝑖D_{i}italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT to the nearest centroid:

8: cj⁒(i)=d⁒(Di,ck)subscript𝑐𝑗𝑖𝑑subscript𝐷𝑖subscriptπ‘π‘˜c_{j(i)}=d(D_{i},c_{k})italic_c start_POSTSUBSCRIPT italic_j ( italic_i ) end_POSTSUBSCRIPT = italic_d ( italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT )

9:endfor

10:fork=1π‘˜1k=1italic_k = 1 to kπ‘˜kitalic_kdo

11:Update each centroid as the mean of the assigned data points:

12: ck=1|{i:j⁒(i)=k}|β’βˆ‘i:j⁒(i)=kDisubscriptπ‘π‘˜1conditional-setπ‘–π‘—π‘–π‘˜subscript:π‘–π‘—π‘–π‘˜subscript𝐷𝑖c_{k}=\frac{1}{\lvert\{i:j(i)=k\}\rvert}\sum_{i:j(i)=k}D_{i}italic_c start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = divide start_ARG 1 end_ARG start_ARG | { italic_i : italic_j ( italic_i ) = italic_k } | end_ARG βˆ‘ start_POSTSUBSCRIPT italic_i : italic_j ( italic_i ) = italic_k end_POSTSUBSCRIPT italic_D start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT

13:endfor

14:untilConvergence

15:Return S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG, ΦΦ\Phiroman_Ξ¦c1,c2,…,cKsubscript𝑐1subscript𝑐2…subscript𝑐𝐾c_{1},c_{2},\ldots,c_{K}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT

The spatio-temporal value metric is employed in the Semantic-based (Ξ΅πœ€\varepsilonitalic_Ξ΅,d)-Abstraction algorithm, as delineated in Alg.2. The procedure outlined in Alg.2 takes the intervalized abstract space SI^^subscript𝑆𝐼\hat{S_{I}}over^ start_ARG italic_S start_POSTSUBSCRIPT italic_I end_POSTSUBSCRIPT end_ARG as its input. Key steps in the algorithmic execution involve the determination of the number of clusters based on the optimal state number determination function K𝐾Kitalic_K (Line 3), initializing cluster centroids randomly (Line 4), and iteratively assigning data points to the nearest centroids while updating centroids until convergence (Lines 5-14). The ultimate outcome comprises the Semantic-based (Ξ΅πœ€\varepsilonitalic_Ξ΅,d)-Abstraction space S^^𝑆\hat{S}over^ start_ARG italic_S end_ARG and the Abstraction Model ΦΦ\Phiroman_Ξ¦, where the centroids c1,c2,…,cKsubscript𝑐1subscript𝑐2…subscript𝑐𝐾c_{1},c_{2},\ldots,c_{K}italic_c start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_c start_POSTSUBSCRIPT italic_K end_POSTSUBSCRIPT encapsulate the final representation of the abstract space (Line 15).

The time complexity of Alg.2 is determined by its iterative clustering, which involves data point assignments to centroids and centroid updates until convergence. The assignment step has a complexity of O⁒(nβ‹…kβ‹…m)π‘‚β‹…π‘›π‘˜π‘šO(n\cdot k\cdot m)italic_O ( italic_n β‹… italic_k β‹… italic_m ), where n𝑛nitalic_n is the number of data points, kπ‘˜kitalic_k the number of clusters, and mπ‘šmitalic_m the dimensionality. Centroid updates have a complexity of O⁒(kβ‹…m)π‘‚β‹…π‘˜π‘šO(k\cdot m)italic_O ( italic_k β‹… italic_m ). The algorithm’s effectiveness is influenced by initial centroid positions and the fixed number of clusters, requiring adjustments based on dataset characteristics for optimal semantic-based (Ξ΅πœ€\varepsilonitalic_Ξ΅,d)-abstraction.

3.3 Construction of Abstract MDP

In this subsection, we introduce a comprehensive methodology for constructing an MDP, essential for formulating decision-making frameworks in stochastic environments. Fig.2 illustrates the procedural details of the approach. Initiated by gathering detailed trajectory data comprising observed states and actions over time, this approach tackles the inherent complexity and high dimensionality of the raw data. Central to our approach is the systematic abstraction of this data, necessary for effective MDP modeling. The methodology encompasses two primary abstraction processes: action abstraction through interval box and state abstraction through spatio-temporal value semantics. Action abstraction involves discretizing the continuous and diverse real-world actions into distinct intervals, each representing a group of similar actions. This process simplifies the action space, enhancing tractability and computational feasibility. Simultaneously, state abstraction condenses the state space by Alg.1 and Alg.2, thereby capturing their essential characteristics. These abstractions are pivotal in reducing complexity, allowing for more efficient computation and analysis.

The next critical step in our methodology is the statistical computation of transition probabilities, derived from the frequency of state transitions in the trajectory data. These probabilities reflect the likelihood of moving from one state to another given a specific action, mirroring the dynamics of real-world scenarios. When calculating transition probabilities, we employ Hoeffding’s inequality to reduce errors, enhancing the accuracy of our probability estimates. After abstracting states and actions and incorporating these probabilities, we formulate the initial MDP model. This model may include a reward system based on state transitions. However, recognizing that the initial model may not fully align with real-world contexts, we engage in iterative refinement. This involves adjusting abstractions, recalculating probabilities, and redefining states and actions to enhance the model’s empirical alignment with observed data.

4 Implementation and Evaluation

4.1 Case Study

We conduct experiments in three representative ADS scenarios, encompassing diverse driving environments with varying control specifications.

Lane Keeping Assist (LKA) is an advanced driving assistance module[24] crucial for automated driving. LKA evaluates the lateral offset dtsubscript𝑑td_{\text{t}}italic_d start_POSTSUBSCRIPT t end_POSTSUBSCRIPT and relative yaw angle ΞΈtsubscriptπœƒt\theta_{\text{t}}italic_ΞΈ start_POSTSUBSCRIPT t end_POSTSUBSCRIPT to adjust the front wheel steering angle ΞΈsteersubscriptπœƒsteer\theta_{\text{steer}}italic_ΞΈ start_POSTSUBSCRIPT steer end_POSTSUBSCRIPT. Its objective is to minimize lateral deviation and yaw angle, aligning them close to zero, and ensuring the vehicle stays within the lane.

Adaptive Cruise Control (ACC) is an intelligent module[24] adjusting the car’s speed based on the distance from the preceding vehicle. It manages the vehicle’s acceleration aegosubscriptπ‘Žegoa_{\text{ego}}italic_a start_POSTSUBSCRIPT ego end_POSTSUBSCRIPT to maintain a safe relative distance drelsubscript𝑑reld_{\text{rel}}italic_d start_POSTSUBSCRIPT rel end_POSTSUBSCRIPT greater than dsafesubscript𝑑safed_{\text{safe}}italic_d start_POSTSUBSCRIPT safe end_POSTSUBSCRIPT. ACC targets the user-set cruise speed vsetsubscript𝑣setv_{\text{set}}italic_v start_POSTSUBSCRIPT set end_POSTSUBSCRIPT, adapting to the preceding vehicle’s movement controlled by aleadsubscriptπ‘Žleada_{\text{lead}}italic_a start_POSTSUBSCRIPT lead end_POSTSUBSCRIPT. The safe distance dynamically adjusts according to the relative velocity.

Intersection Crossroad Assistance (ICA) enhances safety in complex intersections[24], integrating LKA and ACC features. ICA determines optimal speed and direction, demonstrating randomness for adaptive control and versatility for flexible adaptation. It navigates intersecting roads, one intelligent vehicle, and multiple environmental vehicles, aiming to traverse the intersection successfully with left, straight, or right turns while avoiding deviations or collisions.

4.2 Research Questions

To assess the effectiveness of the abstract model based on spatio-temporal value semantics, we investigate the following research questions:

Research Question 1 (RQ1): How does the performance of the abstract model, grounded in spatio-temporal value semantics, fare in terms of both simplicity and accuracy?

Research Question 2 (RQ2): Does the abstract MDP model exhibit decision-making performance that approximates that of the true model? Moreover, is there a semantic equivalence between the abstract and true models?

Research Question 3 (RQ3): Can abstract models effectively guide the learning process in DRL, specifically addressing issues of low data utilization and poor generalization, consequently leading to accelerated training

4.3 Experiment Setup

4.3.1 Metrics for Comparison

Euclidean Metricis used to measure the straight-line distance between two states in Euclidean space.For states s1⁒(p1,p2,…,pn)subscript𝑠1subscript𝑝1subscript𝑝2…subscript𝑝𝑛s_{1}(p_{1},p_{2},...,p_{n})italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ( italic_p start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_p start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_p start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) and s2⁒(q1,q2,…,qn)subscript𝑠2subscriptπ‘ž1subscriptπ‘ž2…subscriptπ‘žπ‘›s_{2}(q_{1},q_{2},...,q_{n})italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ( italic_q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , … , italic_q start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) in S𝑆Sitalic_S, the Euclidean metric is defined as:

d⁒(s1,s2)=βˆ‘i=1n(qiβˆ’pi)2.𝑑subscript𝑠1subscript𝑠2superscriptsubscript𝑖1𝑛superscriptsubscriptπ‘žπ‘–subscript𝑝𝑖2d(s_{1},s_{2})=\sqrt{\sum_{i=1}^{n}(q_{i}-p_{i})^{2}}.italic_d ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) = square-root start_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT ( italic_q start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_p start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT end_ARG .(8)

Multi-Step Metric:For any s1,s2∈Ssubscript𝑠1subscript𝑠2𝑆s_{1},s_{2}\in Sitalic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ∈ italic_S, the multi-step metric is defined as:

dM(s1,s2)β‰œmaxo∈O⁒(s1)∩O⁒(s2){cR|R(s1,a)βˆ’R(s2,o)|\displaystyle d_{\mathrm{M}}\left(s_{1},s_{2}\right)\triangleq\max_{o\in O%\left(s_{1}\right)\cap O\left(s_{2}\right)}\left\{c_{R}\left|R\left(s_{1},a%\right)-R\left(s_{2},o\right)\right|\right.italic_d start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) β‰œ roman_max start_POSTSUBSCRIPT italic_o ∈ italic_O ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) ∩ italic_O ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) end_POSTSUBSCRIPT { italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT | italic_R ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a ) - italic_R ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_o ) |(9)
+cPdp(P(β‹…βˆ£s1,o),P(β‹…βˆ£s2,o))}\displaystyle\left.\quad+c_{P}d_{p}\left(P\left(\cdot\mid s_{1},o\right),P%\left(\cdot\mid s_{2},o\right)\right)\right\}+ italic_c start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT italic_d start_POSTSUBSCRIPT italic_p end_POSTSUBSCRIPT ( italic_P ( β‹… ∣ italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_o ) , italic_P ( β‹… ∣ italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , italic_o ) ) }
+c𝔻⁒𝔻⁒[A⁒(s1),A⁒(s2)],subscript𝑐𝔻𝔻𝐴subscript𝑠1𝐴subscript𝑠2\displaystyle\quad+c_{\mathbb{D}}\mathbb{D}[A(s_{1}),A(s_{2})],+ italic_c start_POSTSUBSCRIPT blackboard_D end_POSTSUBSCRIPT blackboard_D [ italic_A ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_A ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ] ,

where cR,cPsubscript𝑐𝑅subscript𝑐𝑃c_{R},c_{P}italic_c start_POSTSUBSCRIPT italic_R end_POSTSUBSCRIPT , italic_c start_POSTSUBSCRIPT italic_P end_POSTSUBSCRIPT, and cℍsubscript𝑐ℍc_{\mathbb{H}}italic_c start_POSTSUBSCRIPT blackboard_H end_POSTSUBSCRIPT are positive constants. A⁒(s)𝐴𝑠A(s)italic_A ( italic_s ) is the set of available actions for each state s𝑠sitalic_s.The function 𝔻⁒[x,y]=0𝔻π‘₯𝑦0\mathbb{D}[x,y]=0blackboard_D [ italic_x , italic_y ] = 0 if x=yπ‘₯𝑦x=yitalic_x = italic_y, and 1 otherwise.c𝔻subscript𝑐𝔻c_{\mathbb{D}}italic_c start_POSTSUBSCRIPT blackboard_D end_POSTSUBSCRIPT is a sufficiently large constant such that dM⁒(s1,s2)≀ϡsubscript𝑑Msubscript𝑠1subscript𝑠2italic-Ο΅d_{\mathrm{M}}\left(s_{1},s_{2}\right)\leq\epsilonitalic_d start_POSTSUBSCRIPT roman_M end_POSTSUBSCRIPT ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) ≀ italic_Ο΅ implies that A⁒(s1)𝐴subscript𝑠1A\left(s_{1}\right)italic_A ( italic_s start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) equals A⁒(s2)𝐴subscript𝑠2A\left(s_{2}\right)italic_A ( italic_s start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT )[15].

4.3.2 DRL Setup

During the data collection phase, we employ curiosity-driven TD3 with Random Network Distillation (RND)[9] to derive control strategies for LKA and ACC. Additionally, curiosity-driven DQN is utilized to generate control strategies for ICA, exploring the case environment and gathering system trajectories.

In each scenario, we simulate the curiosity-driven RL controller 1000 times to accumulate experience. This experience is then partitioned into a modeling set and a validation set in an 8:2:828:28 : 2 ratio. The former is utilized for constructing the abstract MDP, while the latter is employed to scrutinize the semantic gaps between the abstract MDP and the concrete MDP.The hyperparameter configurations for the deep learning networks in the three cases are detailed in Table1.

Case StudyAlgorithmActivate functionSizeLearning rateγ𝛾\gammaitalic_Ξ³Ο΅italic-Ο΅\epsilonitalic_Ο΅Soft tau
CriticActor
LKATD3ReLU2Γ—12821282\times 1282 Γ— 1281.00E-031.00E-030.95/1.00E-02
ACCTD3ReLU2Γ—12821282\times 1282 Γ— 1281.00E-041.00E-040.95/1.00E-02
ICADQNReLU2Γ—12821282\times 1282 Γ— 1282.00E-030.980.01/

The rewards for LKA, ACC, and ICA are specified as follows:

LKA Reward:rt=1βˆ’dt2βˆ’cos2⁑θt,superscriptπ‘Ÿπ‘‘1superscriptsubscript𝑑𝑑2superscript2subscriptπœƒπ‘‘\displaystyle\quad r^{t}=1-d_{t}^{2}-\cos^{2}\theta_{t},italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 1 - italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT - roman_cos start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT italic_ΞΈ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
ACC Reward:rt=0.05β‹…vt,superscriptπ‘Ÿπ‘‘β‹…0.05subscript𝑣𝑑\displaystyle\quad r^{t}=0.05\cdot v_{t},italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 0.05 β‹… italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT ,
ICA Reward:rt=0.05β‹…vtβˆ’0.0005β‹…dgt.superscriptπ‘Ÿπ‘‘β‹…0.05subscript𝑣𝑑⋅0.0005subscriptsuperscript𝑑𝑑𝑔\displaystyle\quad r^{t}=0.05\cdot v_{t}-0.0005\cdot d^{t}_{g}.italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT = 0.05 β‹… italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT - 0.0005 β‹… italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT .
  • β€’

    rtsuperscriptπ‘Ÿπ‘‘r^{t}italic_r start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT is the reward at time step t𝑑titalic_t.

  • β€’

    dtsubscript𝑑𝑑d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the lateral offset of the vehicle’s current position from the lane center. A smaller dtsubscript𝑑𝑑d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT leads to a higher reward.

  • β€’

    ΞΈtsubscriptπœƒπ‘‘\theta_{t}italic_ΞΈ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the angle between the current direction of the vehicle and the lane center direction. A smaller ΞΈtsubscriptπœƒπ‘‘\theta_{t}italic_ΞΈ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT leads to a higher reward.

  • β€’

    vtsubscript𝑣𝑑v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT represents the velocity of the vehicle at time step t𝑑titalic_t.

  • β€’

    dgtsubscriptsuperscript𝑑𝑑𝑔d^{t}_{g}italic_d start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT represents the distance from the target g𝑔gitalic_g at time t𝑑titalic_t.

4.3.3 Abstraction Setup

The hyper-parameters for the abstraction process are shown in Table2. We define the average semantic error, denoted as eMEANsubscript𝑒MEANe_{\text{MEAN}}italic_e start_POSTSUBSCRIPT MEAN end_POSTSUBSCRIPT, to be 0.005, and the maximum semantic error, denoted as eMAXsubscript𝑒MAXe_{\text{MAX}}italic_e start_POSTSUBSCRIPT MAX end_POSTSUBSCRIPT, to be 0.01. These values represent 0.25% of the overall value range. Simultaneously, we set the REDUCTION_LEVEL (rdsubscriptπ‘Ÿπ‘‘r_{d}italic_r start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT) to 0.5%. The parameter kπ‘˜kitalic_k is determined through the elbow method, average silhouette method, and gap statistic method[20]. We conduct a comparative analysis of abstractions using the traditional Euclidean distance method, the Multi-Step distance abstraction based on the state-of-the-art method proposed by Guo et al.[15], and the abstraction employing the spatio-temporal value metric.

Case studySemanticsdminsubscript𝑑mind_{\text{min}}italic_d start_POSTSUBSCRIPT min end_POSTSUBSCRIPTdmaxsubscript𝑑maxd_{\text{max}}italic_d start_POSTSUBSCRIPT max end_POSTSUBSCRIPTnminsubscript𝑛minn_{\text{min}}italic_n start_POSTSUBSCRIPT min end_POSTSUBSCRIPT
LKAdtsubscript𝑑𝑑d_{t}italic_d start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT0.0010.0051%
ΞΈtsubscriptπœƒπ‘‘\theta_{t}italic_ΞΈ start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT0.0010.0050.1%
ACCvtsubscript𝑣𝑑v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT0.0100.0500.5%
ICAvtsubscript𝑣𝑑v_{t}italic_v start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT0.0100.0500.5%
dgtsuperscriptsubscript𝑑𝑔𝑑d_{g}^{t}italic_d start_POSTSUBSCRIPT italic_g end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_t end_POSTSUPERSCRIPT0.0010.0051%

4.3.4 Two Indices

To address RQ1, we evaluated the abstraction models with a focus on simplicity and accuracy using two indices: Compression Ratio (CR) and Mean Absolute Error (MAE). The formulas for CR and MAE are as follows:

CR=|S^||S|,CR^𝑆𝑆\displaystyle\textit{CR}=\frac{{\left\lvert\hat{S}\right\rvert}}{{\left\lvert S%\right\rvert}},CR = divide start_ARG | over^ start_ARG italic_S end_ARG | end_ARG start_ARG | italic_S | end_ARG ,(10)
MAE=1nβ’βˆ‘i=1nMEAN⁒|yβˆ’y^|,MAE1𝑛superscriptsubscript𝑖1𝑛MEAN𝑦^𝑦\displaystyle\textit{MAE}=\frac{1}{n}\sum_{i=1}^{n}\text{MEAN}\left\lvert y-%\hat{y}\right\rvert,MAE = divide start_ARG 1 end_ARG start_ARG italic_n end_ARG βˆ‘ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT MEAN | italic_y - over^ start_ARG italic_y end_ARG | ,(11)

where |S^|^𝑆\left\lvert\hat{S}\right\rvert| over^ start_ARG italic_S end_ARG | represents the number of abstract states, |S|𝑆\left\lvert S\right\rvert| italic_S | represents the number of original concrete states, y𝑦yitalic_y is the prediction output of the abstract model, y^^𝑦\hat{y}over^ start_ARG italic_y end_ARG is the output of the true model, and MEAN⁒|yβˆ’y^|MEAN𝑦^𝑦\text{MEAN}\left\lvert y-\hat{y}\right\rvertMEAN | italic_y - over^ start_ARG italic_y end_ARG | measures the average deviation from the reference value. n𝑛nitalic_n denotes the number of abstract states generated in a single experiment. CR assesses the simplicity of the abstraction model, indicating the quality of the abstraction effect, while MAE reveals the accuracy of the abstraction model in preserving the original semantic information.

4.3.5 Abstract MDP-guided Training for DRL

To investigate the potential guiding impact of an abstract MDP on DRL, we systematically modify the environmental variables within three ICPS. We aim to foster a more secure DRL learning process under the guidance of the abstract MDP. We introduce the influence of the abstract MDP on action selection within the DRL framework, with the overarching objective of expediting the convergence of DRL and establishing a joint strategy that is both safe and generalized.

The influence of the abstract MDP on action selection is incorporated into the DRL by leveraging the abstract MDP’s action output as a variable influencing the output of the neural network (NN). This integration is expressed mathematically as:

a=Ξ±β‹…aN⁒N+Ξ²β‹…aM⁒D⁒P,a⋅𝛼subscriptπ‘Žπ‘π‘β‹…π›½subscriptπ‘Žπ‘€π·π‘ƒ\displaystyle\textit{a}=\alpha\cdot a_{NN}+\beta\cdot a_{MDP},a = italic_Ξ± β‹… italic_a start_POSTSUBSCRIPT italic_N italic_N end_POSTSUBSCRIPT + italic_Ξ² β‹… italic_a start_POSTSUBSCRIPT italic_M italic_D italic_P end_POSTSUBSCRIPT ,(12)

where a represents the joint action output, aN⁒Nsubscriptπ‘Žπ‘π‘a_{NN}italic_a start_POSTSUBSCRIPT italic_N italic_N end_POSTSUBSCRIPT denotes the action output from the NN, and aM⁒D⁒Psubscriptπ‘Žπ‘€π·π‘ƒa_{MDP}italic_a start_POSTSUBSCRIPT italic_M italic_D italic_P end_POSTSUBSCRIPT signifies the action output from the abstract MDP. The coefficients α𝛼\alphaitalic_Ξ± and β𝛽\betaitalic_Ξ² are set to 0.5 based on existing empirical values, reflecting the joint contributions of the NN and the abstract MDP in the synthesized action.

4.4 Experimental Results and Analysis

RQ1: The Simplicity and Accuracy of Abstract MDP Our investigation into the performance of abstract models, guided by spatio-temporal value semantics, is revealed through a combination of visual and quantitative data analyses. Fig.3 offers a striking visual narrative of this examination. Fig.3 shows we transform detailed raw data into a coherent, abstract grid. Within this grid, distinct hues represent individual abstract states, effectively simplifying the complex state space without compromising its comprehensive nature. This approach exemplifies how abstract modeling can achieve a harmonious blend of simplicity in design with precision in data representation.

The spatio-temporal value metric’s efficacy in capturing the essence of the state space without overcomplication is quantitatively reinforced by Tab.3. Here, the CR and MAE serve as the principal indices for evaluation. The CR index, reflecting the proportionate reduction in state-space size, alongside the MAE index, indicating the average error magnitude, collectively substantiates the spatio-temporal value metric’s superior performance across various ICPS. Notably, in the ACC scenario, the spatio-temporal approach achieves CRs of 10.12% and 13.02%, underscoring a substantial simplification of the model. Concurrently, the approach’s consistently lower MAE values across all scenarios, compared to Euclidean and Multi-Step approaches, highlight its enhanced accuracy.

Response to RQ1: Spatio-temporal value metric simplifies the state space without sacrificing semantic accuracy, ensuring the abstract model’s effectiveness in algorithmic analysis and decision-making. The model’s simplicity is achieved without a concomitant increase in error, demonstrating an optimal balance between the two desired attributes of computational efficiency and semantic precision. These findings articulate the spatio-temporal value metric’s pivotal role in generating abstract models that are not only operationally feasible but also semantically representative of the real-world systems they aim to emulate.

RQ2: Semantic Equivalence in Abstract MDPTo address RQ2, we leveraged the PRISM model checker for encoding the abstraction model, ensuring the retention of critical state information such as rewards, transition probabilities, and metrics pertaining to lane adherence and collision incidents. We crafted specific properties that resonate with rewards and safety considerations, thereby facilitating a comprehensive evaluation of the extent to which the abstract model parallels the true MDP in semantic terms. This approach enabled a thorough exploration of the decision-making implications inherent in the abstract model, shedding light on its strengths and constraints.

In Tab4, we presented the encoded abstraction models for diverse scenarios using PRISM, emphasizing properties that encapsulate semantic information. Through PRISM, the semantic equivalence between the abstract and the true models was quantitatively assessed. For instance, in scenarios such as a four-way intersection, various properties like R⁒m⁒i⁒n=?⁒[C<=60]π‘…π‘šπ‘–π‘›?delimited-[]𝐢60Rmin=?[C<=60]italic_R italic_m italic_i italic_n = ? [ italic_C < = 60 ] (minimum expected cumulative reward within 60 time steps), P⁒m⁒a⁒x=?⁒[F<=60;i⁒s⁒O⁒u⁒t⁒O⁒f⁒L⁒a⁒n⁒e=1]π‘ƒπ‘šπ‘Žπ‘₯?delimited-[]formulae-sequence𝐹60π‘–π‘ π‘‚π‘’π‘‘π‘‚π‘“πΏπ‘Žπ‘›π‘’1Pmax=?[F<=60;isOutOfLane=1]italic_P italic_m italic_a italic_x = ? [ italic_F < = 60 ; italic_i italic_s italic_O italic_u italic_t italic_O italic_f italic_L italic_a italic_n italic_e = 1 ] (maximum probability of lane departure within 60 steps), and P⁒m⁒a⁒x=?⁒[F<=60;i⁒s⁒C⁒r⁒a⁒s⁒h⁒e⁒d=1]π‘ƒπ‘šπ‘Žπ‘₯?delimited-[]formulae-sequence𝐹60π‘–π‘ πΆπ‘Ÿπ‘Žπ‘ β„Žπ‘’π‘‘1Pmax=?[F<=60;isCrashed=1]italic_P italic_m italic_a italic_x = ? [ italic_F < = 60 ; italic_i italic_s italic_C italic_r italic_a italic_s italic_h italic_e italic_d = 1 ] (maximum probability of collision involvement within 60 time steps) were analyzed to measure decision-making effects.

Our analysis in the sphere of MDP modeling, particularly through semantic gap analysis via PRISM, highlights the pronounced superiority of the spatio-temporal value approach over the Euclidean and Multi-Step approaches. This conclusion is drawn from a systematic juxtaposition across varied scenarios like LKA, ACC, and ICA. The spatio-temporal value approach consistently exhibited lower discrepancies in predicting minimum expected cumulative rewards and maximum probabilities of specific events, thereby indicating a higher fidelity in mimicking real-world dynamics. This aspect is especially pronounced in intricate scenarios like ACC and ICA, where the approach’s accuracy in capturing semantic nuances suggests its enhanced capability in semantic property representation. The proficiency of the spatio-temporal value approach in bridging semantic gaps accentuates its robustness and reliability as a tool in stochastic decision-making frameworks, where fidelity to real-world conditions is paramount.

Response to RQ2: The findings corroborate that abstraction based on spatio-temporal values not only ensures semantic alignment with the true model but also offers substantial insights for refining training strategies. This aligns with the overarching goal of achieving a harmonious balance between model abstraction and real-world decision-making accuracy.

Case StudyNumber of Stateskπ‘˜kitalic_kMetricAbstract States1s⁒tsuperscript1𝑠𝑑{}^{1^{st}}start_FLOATSUPERSCRIPT 1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPTAbstract States2n⁒dsuperscript2𝑛𝑑{}^{2^{nd}}start_FLOATSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPTCRMAE
CR1s⁒tsuperscript1𝑠𝑑{}^{1^{st}}start_FLOATSUPERSCRIPT 1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPTCR2n⁒dsuperscript2𝑛𝑑{}^{2^{nd}}start_FLOATSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPTMAE1s⁒tsuperscript1𝑠𝑑{}^{1^{st}}start_FLOATSUPERSCRIPT 1 start_POSTSUPERSCRIPT italic_s italic_t end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPTMAE2n⁒dsuperscript2𝑛𝑑{}^{2^{nd}}start_FLOATSUPERSCRIPT 2 start_POSTSUPERSCRIPT italic_n italic_d end_POSTSUPERSCRIPT end_FLOATSUPERSCRIPT
Euclidean2931200024.38%16.87%17.08191.696
ElbowMulti-Step193016.28%85.008
\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 1930\cellcolorgray!25 16.28%\cellcolorgray!25 43.43
Euclidean217018.30%169.578
SilhouetteMulti-Step120010.12%107.043
ACC11858\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 1200\cellcolorgray!25 10.12%\cellcolorgray!25 46.796
Euclidean221018.64%168.003
GapMulti-Step126010.63%109.632
\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 1260\cellcolorgray!25 10.63%\cellcolorgray!25 48.189
Euclidean176514.88%199.788
CanopyMulti-Step154413.02%84.018
\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 1544\cellcolorgray!25 13.02%\cellcolorgray!25 46.419
Euclidean3804240026.93%16.99%6.89251.344
ElbowMulti-Step187013.24%28.151
\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 1870\cellcolorgray!25 13.24%\cellcolorgray!25 11.698
Euclidean276019.54%28.229
SilhouetteMulti-Step153010.83%37.016
LKA14124\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 1530\cellcolorgray!25 10.83%\cellcolorgray!25 16.347
Euclidean288020.39%19.134
GapMulti-Step206014.59%37.2
\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 2060\cellcolorgray!25 14.59%\cellcolorgray!25 16.533
Euclidean299521.21%51.327
CanopyMulti-Step257218.21%19.469
\cellcolorgray!25 Spatio-temporal\cellcolorgray!252572\cellcolorgray!25 18.21%\cellcolorgray!25 16.346
Euclidean4622320022.98%15.91%21.883109.249
ElbowMulti-Step276013.72%103.912
\cellcolorgray!25 Spatio-temporal\cellcolorgray!252760\cellcolorgray!25 13.72%\cellcolorgray!25 77.111
Euclidean357017.75%112.909
SilhouetteMulti-Step210010.44%106.304
ICA20110\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 2100\cellcolorgray!25 10.44%\cellcolorgray!25 76.271
Euclidean423021.03%113.167
GapMulti-Step230011.44%107.561
\cellcolorgray!25 Spatio-temporal\cellcolorgray!25 2300\cellcolorgray!25 11.44%\cellcolorgray!25 76.991
Euclidean453422.55%113.183
CanopyMulti-Step349217.36%108/197
\cellcolorgray!25 Spatio-temporal\cellcolorgray!253492\cellcolorgray!25 17.36%\cellcolorgray!25 78.431

RQ3: Abstract MDP-Guided Learning Process Fig.4 features three line graphs, each corresponding to distinct scenarios in RL: (a) ACC, (b) LKA, and (c) ICA. Each graph portrays the performance of two models across numerous episodes: the conventional model (denoted as TD3 or DDPG) and the abstract MDP-Guided model (referred to as MDP-guided TD3 or MDP-guided DQN). The vertical axis represents the obtained reward, while the horizontal axis indicates the episode count.

An examination of Fig.4 underscores that, across all considered scenarios, the abstract MDP-Guided models (depicted in orange) consistently surpass the performance of the traditional models (depicted in blue). Of particular significance is the discernibly accelerated escalation in reward exhibited by the abstract MDP-Guided models, indicative of more efficient learning dynamics. This acceleration is particularly pronounced in the initial episodes, where the abstract MDP-Guided models attain higher rewards at a faster pace in comparison to their traditional counterparts. Moreover, our proposed approach demonstrates superior guidance, especially in intricate scenarios.

The discerned trends across all three scenarios affirm the pivotal role of abstract models in significantly enhancing the efficiency of the RL process. By adeptly simplifying the inherent complexity of the environment and channeling learning efforts towards critical aspects, abstract models expedite the policy learning process. This, in turn, results in a more expeditious and effective training regimen, thereby effectively addressing the posed research question (RQ3).

Case StudyMetricPropertiesVerification ResultReal ValueError
LKAEuclideanRmin=?[C<=51]44.4348.504.07
Pmax=?[F<=51;isOutOfLane=1]0.13%0.00%0.13
Multi-StepRmin=?[C<=51]44.7248.503.78
Pmax=?[F<=51;isOutOfLane=1]0.13%0.0%0.13
\cellcolorgray!25\cellcolorgray!25 Rmin=?[C<=51]\cellcolorgray!25 46.92\cellcolorgray!25 48.50\cellcolorgray!25 1.58
\cellcolorgray!25Spatio-temporal Value\cellcolorgray!25Pmax=?[F<=51;isOutOfLane=1]\cellcolorgray!250.10%\cellcolorgray!250.00%\cellcolorgray!25 0.10
ACCEuclideanRmin=?[C<=51]57.5359.942.41
Pmax=?[F<=51;isOutOfLane=1]0.7%0.00%0.7
Pmax=?[F<=51;isCrashed=1]0.06%0.00%0.06
Multi-StepRmin=?[C<=51]55.9859.943.96
Pmax=?[F<=51;isOutOfLane=1]1.00%0.00%1.00
Pmax=?[F<=51;isCrashed=1]0.06%0.00%0.06
\cellcolorgray!25\cellcolorgray!25 Rmin=?[C<=51]\cellcolorgray!25 60.33\cellcolorgray!25 59.94\cellcolorgray!25 -0.39
\cellcolorgray!25\cellcolorgray!25 Pmax=?[F<=51;isOutOfLane=1]\cellcolorgray!25 0.01%\cellcolorgray!25 0.00%\cellcolorgray!25 0.01
\cellcolorgray!25 Spatio-temporal Value\cellcolorgray!25 Pmax=?[F<=51;isCrashed=1]\cellcolorgray!25 0.19%\cellcolorgray!25\cellcolorgray!25 0.00%\cellcolorgray!25 0.19
ICAEuclideanRmin=?[C<=60]8.389.360.98
Pmax=?[F<=60;isCrashed=1]18.73%20.80%2.07
Pmax=?[F<=60;reachDest=1]0.17%4.60%4.43
Multi-StepRmin=?[C<=60]8.999.360.37
Pmax=?[F<=60;isCrashed=1]17.33%20.80%3.47
Pmax=?[F<=60;reachDest=1]3.25%4.60%1.35
\cellcolorgray!25\cellcolorgray!25 Rmin=?[C<=60]\cellcolorgray!25 9.38\cellcolorgray!25 9.36\cellcolorgray!25 -0.02
\cellcolorgray!25\cellcolorgray!25 Pmax=?[F<=60;isCrashed=1]\cellcolorgray!25 19.35%\cellcolorgray!25 20.80%\cellcolorgray!25 1.45
\cellcolorgray!25 Spatio-temporal Value\cellcolorgray!25 Pmax=?[F<=60;reachDest=1]\cellcolorgray!25 4.50%\cellcolorgray!25 4.60%\cellcolorgray!25 0.10

Response to RQ3: Fig.4 strongly supports the pivotal role of abstract models in guiding the DRL process. Offering a structured approach to the state space and incorporating domain knowledge through abstraction, these models enhance data utilization and generalization. This leads to accelerated convergence towards higher rewards, signifying a more rapid training process. Furthermore, the improved performance in initial episodes suggests that abstract models effectively address challenges related to data scarcity, leveraging abstracted information to guide the learning algorithm toward profitable strategies early in the training process.

Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (3)
Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (4)

Using the ACC scenario as an illustration, the horizontal axes in both Fig.3 and Fig.3 designate relative speed, whereas the vertical axes signify relative distance, both subjected to normalization. Fig.3 delineates the unprocessed traces of the ADS. Conversely, Fig.3 elucidates abstract traces, encapsulating data that transcends the boundaries of the exploration scope. Distinct hues correspond to disparate abstract states in the representation.

Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (5)
Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (6)
Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (7)

5 Related Work

Action and State Abstraction in MDPs.The innovative concept of MDP action abstraction proves to be a strategic solution for alleviating computational burdens and enhancing problem-solving efficiency by compressing the action space while maintaining solution quality. Chen and Xu[12] pioneered a method grounded in discrete Fourier transform for action abstraction in MDPs with deterministic uncertainty. Extending this paradigm to MDPs with continuous action spaces, Omidshafiei[31] applied the Fourier transform. Additionally, Bita[6] contributes a comprehensive framework that leverages the nondeterministic situation calculus and ConGolog programming language to abstract agent actions in nondeterministic domains, facilitating strategic reasoning and synthesis.

Simultaneously, when employing function approximation to abstract states, involving the intricate process of mapping the concrete space to a lower-dimensional counterpart optimized through RL objectives. Studies integrating NNs and Hierarchical Reinforcement Learning (HRL), such as feudal HRL[4, 30] and option-critic with NNs[16, 15], actively address the nuanced realms of state and temporal abstraction. Abel[2] introduces transitive and PAC state abstractions, triumphing in sample complexity reduction despite potential performance drawbacks. Misra[27] introduces HOMER, a pioneering sample-efficient exploration and DRL algorithm tailored for rich observation environments, guaranteeing provable efficiency and computational effectiveness in specific scenarios. However, these methods, while effective in specific scenarios, may lack semantic preservation in the process of abstraction.

Abstract MDPs for DRL. The realm of MDP abstraction, meticulously preserving transition and reward structures[23], emerges as a linchpin for augmenting RL efficiency and generalization. Existing options, such as option-bisimulation[11, 10], grapple with computational intricacies. Abel et al.[3] pioneer state-abstraction-option classes with insightful suboptimality bounds. Vans[34] introduces MDP hom*omorphic networks, harnessing symmetries for enhanced convergence. Guo[15] with a Multi-Step metric for state-temporal abstraction.Junges[19] takes advantage of the inherent hierarchy of the Markov decision process and divides the state space into macro-level and sub-level. It regards the unresolved sub-level as an uncertainty for constraint and progressive analysis, to reduce the state space explosion problem.Feng[14] unfolds the potential of edited MDPs, efficiently learning safety-critical states from naturalistic driving data, thus showcasing accelerated testing and training of safety-critical autonomous systems. The existing work on abstract MDPs demonstrates notable contributions but faces challenges in terms of verification, safety, and generalization across diverse environments. Strengthening these aspects is essential for advancing the reliability and applicability of abstraction techniques in real-world scenarios.

6 Conclusion and Future Work

Our approach to abstract modeling using spatio-temporal value semantics represents a substantial leap in ensuring the dependability of machine learning systems, particularly in decision-making processes for ICPS employing DRL. This method is characterized by its universal application across a variety of ICPS scenarios, demonstrating its robustness and versatility. However, it’s crucial to recognize certain limitations, such as the learning process efficiency for the abstract model, which we aim to enhance in our future work.

Expanding upon this, the next phase of our research will focus on refining the learning algorithms to accelerate the training phase without compromising the model’s integrity. Additionally, moving beyond experimental validations, we plan to incorporate formal theorem-based evaluations to establish the equivalence between our abstract models and their respective true MDPs. This shift towards a more rigorous theoretical framework, such as bisimulation, will allow for a deeper understanding and validation of the models’ accuracy and reliability.

Moreover, our future endeavors aim to broaden the application of our abstraction technique to more complex ICPS domains. This extension includes optimizing the efficiency of state exploration in DRL training, a crucial aspect for effective navigation in intricate and dynamic environments. Additionally, we intend to explore reachability and robustness aspects within these systems, ensuring that our models not only make predictions but also respond adaptively to real-world scenarios. Through these comprehensive research efforts, our overarching objective is to make significant contributions to the field of abstract modeling, enhancing its effectiveness and reliability in safety-critical and dynamic environments. Addressing these challenges, ensuring model safety, and establishing a cohesive framework remain pivotal for the successful and robust application of abstraction techniques in practical scenarios.

References

  • [1]Abel, D.: A theory of abstraction in reinforcement learning. arXiv preprint arXiv:2203.00397 (2022)
  • [2]Abel, D., Arumugam, D., Lehnert, L., Littman, M.: State abstractions for lifelong reinforcement learning. In: International Conference on Machine Learning. pp. 10–19. PMLR (2018)
  • [3]Abel, D., Umbanhowar, N., Khetarpal, K., Arumugam, D., Precup, D., Littman, M.: Value preserving state-action abstractions. In: International Conference on Artificial Intelligence and Statistics. pp. 1639–1650. PMLR (2020)
  • [4]Andreas, J., Klein, D., Levine, S.: Modular multitask reinforcement learning with policy sketches. In: International conference on machine learning. pp. 166–175. PMLR (2017)
  • [5]Bacon, P.L., Harb, J., Precup, D.: The option-critic architecture. In: Proceedings of the AAAI conference on artificial intelligence. vol.31 (2017)
  • [6]Banihashemi, B., DeGiacomo, G., Lesperance, Y.: Abstraction of nondeterministic situation calculus action theories. In: Elkind, E. (ed.) Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, IJCAI-23. pp. 3112–3122. International Joint Conferences on Artificial Intelligence Organization (8 2023). https://doi.org/10.24963/ijcai.2023/347, https://doi.org/10.24963/ijcai.2023/347, main Track
  • [7]Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguistics 5, 135–146 (2017). https://doi.org/10.1162/tacl_a_00051, https://doi.org/10.1162/tacl_a_00051
  • [8]Brunke, L., Greeff, M., Hall, A.W., Yuan, Z., Zhou, S., Panerati, J., Schoellig, A.P.: Safe learning in robotics: From learning-based control to safe reinforcement learning. Annual Review of Control, Robotics, and Autonomous Systems 5, 411–444 (2022)
  • [9]Burda, Y., Edwards, H., Storkey, A.J., Klimov, O.: Exploration by random network distillation. In: 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net (2019), https://openreview.net/forum?id=H1lJJnR5Ym
  • [10]Castro, P., Precup, D.: Using bisimulation for policy transfer in mdps. In: Proceedings of the AAAI conference on artificial intelligence. vol.24, pp. 1065–1070 (2010)
  • [11]Castro, P.S.: On planning, prediction and knowledge transfer in Fully and Partially Observable Markov Decision Processes. McGill University (Canada) (2011)
  • [12]Chen, Y., Xu, J.: Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. The Journal of Machine Learning Research 17(1), 882–938 (2016)
  • [13]Dockhorn, A., Kruse, R.: State and action abstraction for search and reinforcement learning algorithms. In: Artificial Intelligence in Control and Decision-making Systems: Dedicated to Professor Janusz Kacprzyk, pp. 181–198. Springer (2023)
  • [14]Feng, S., Sun, H., Yan, X., Zhu, H., Zou, Z., Shen, S., Liu, H.X.: Dense reinforcement learning for safety validation of autonomous vehicles. Nature 615(7953), 620–627 (2023)
  • [15]Guo, S., Yan, Q., Su, X., Hu, X., Chen, F.: State-temporal compression in reinforcement learning with the reward-restricted geodesic metric. IEEE Transactions on Pattern Analysis and Machine Intelligence 44(9), 5572–5589 (2021)
  • [16]Jiang, N., Kulesza, A., Singh, S.: Abstraction selection in model-based reinforcement learning. In: International Conference on Machine Learning. pp. 179–188. PMLR (2015)
  • [17]Jiang, Y., Gu, S.S., Murphy, K.P., Finn, C.: Language as an abstraction for hierarchical deep reinforcement learning. Advances in Neural Information Processing Systems 32 (2019)
  • [18]Jin, P., Tian, J., Zhi, D., Wen, X., Zhang, M.: Trainify: A cegar-driven training and verification framework for safe deep reinforcement learning. In: Shoham, S., Vizel, Y. (eds.) Computer Aided Verification - 34th International Conference, CAV 2022, Haifa, Israel, August 7-10, 2022, Proceedings, Part I. Lecture Notes in Computer Science, vol. 13371, pp. 193–218. Springer (2022). https://doi.org/10.1007/978-3-031-13185-1_10, https://doi.org/10.1007/978-3-031-13185-1_10
  • [19]Junges, S., Spaan, M.T.: Abstraction-refinement for hierarchical probabilistic models. In: International Conference on Computer Aided Verification. pp. 102–123. Springer (2022)
  • [20]Kassambara, A.: Practical guide to cluster analysis in R: Unsupervised machine learning, vol.1. Sthda (2017)
  • [21]Kulkarni, T.D., Narasimhan, K., Saeedi, A., Tenenbaum, J.: Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation. Advances in neural information processing systems 29 (2016)
  • [22]Kwiatkowska, M., Norman, G., Parker, D.: Probabilistic symbolic model checking with prism: A hybrid approach. International journal on software tools for technology transfer 6, 128–142 (2004)
  • [23]Lavaei, A., Soudjani, S., Frazzoli, E., Zamani, M.: Constructing mdp abstractions using data with formal guarantees. IEEE Control Systems Letters 7, 460–465 (2022)
  • [24]Leurent, E., etal.: An environment for autonomous driving decision-making (2018)
  • [25]Li, L., Walsh, T.J., Littman, M.L.: Towards a unified theory of state abstraction for mdps. In: AI&M (2006)
  • [26]Li, Y., Gao, T., Yang, J., Xu, H., Wu, Y.: Phasic self-imitative reduction for sparse-reward goal-conditioned reinforcement learning. In: International Conference on Machine Learning. pp. 12765–12781. PMLR (2022)
  • [27]Misra, D., Henaff, M., Krishnamurthy, A., Langford, J.: Kinematic state abstraction and provably efficient rich-observation reinforcement learning. In: International conference on machine learning. pp. 6961–6971. PMLR (2020)
  • [28]Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A.A., Veness, J., Bellemare, M.G., Graves, A., Riedmiller, M.A., Fidjeland, A., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran, D., Wierstra, D., Legg, S., Hassabis, D.: Human-level control through deep reinforcement learning. Nat. 518(7540), 529–533 (2015). https://doi.org/10.1038/nature14236, https://doi.org/10.1038/nature14236
  • [29]Nashed, S.B., Svegliato, J., Bhatia, A., Russell, S., Zilberstein, S.: Selecting the partial state abstractions of mdps: A metareasoning approach with deep reinforcement learning. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). pp. 11665–11670. IEEE (2022)
  • [30]Oh, J., Singh, S., Lee, H., Kohli, P.: Zero-shot task generalization with multi-task deep reinforcement learning. In: International Conference on Machine Learning. pp. 2661–2670. PMLR (2017)
  • [31]Omidshafiei, S., Agha-Mohammadi, A.A., Amato, C., How, J.P.: Decentralized control of partially observable markov decision processes using belief space macro-actions. In: 2015 IEEE international conference on robotics and automation (ICRA). pp. 5962–5969. IEEE (2015)
  • [32]Parker, D.A.: Implementation of symbolic model checking for probabilistic systems. Ph.D. thesis, University of Birmingham (2003)
  • [33]Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Moschitti, A., Pang, B., Daelemans, W. (eds.) Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL. pp. 1532–1543. ACL (2014). https://doi.org/10.3115/v1/d14-1162, https://doi.org/10.3115/v1/d14-1162
  • [34]Vander Pol, E., Worrall, D., van Hoof, H., Oliehoek, F., Welling, M.: Mdp hom*omorphic networks: Group symmetries in reinforcement learning. Advances in Neural Information Processing Systems 33, 4199–4210 (2020)
  • [35]Radanliev, P., Roure, D.D., Kleek, M.V., Santos, O., Ani, U.: Artificial intelligence in cyber physical systems. AI Soc. 36(3), 783–796 (2021). https://doi.org/10.1007/s00146-020-01049-0, https://doi.org/10.1007/s00146-020-01049-0
  • [36]Schmidt, L.M., Brosig, J., Plinge, A., Eskofier, B.M., Mutschler, C.: An introduction to multi-agent reinforcement learning and review of its application to autonomous mobility. In: 25th IEEE International Conference on Intelligent Transportation Systems, ITSC 2022, Macau, China, October 8-12, 2022. pp. 1342–1349. IEEE (2022). https://doi.org/10.1109/ITSC55140.2022.9922205, https://doi.org/10.1109/ITSC55140.2022.9922205
  • [37]Silver, D., Huang, A., Maddison, C.J., Guez, A., Sifre, L., vanden Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T.P., Leach, M., Kavukcuoglu, K., Graepel, T., Hassabis, D.: Mastering the game of go with deep neural networks and tree search. Nat. 529(7587), 484–489 (2016). https://doi.org/10.1038/nature16961, https://doi.org/10.1038/nature16961
  • [38]Song, J., Xie, X., Ma, L.: Siege: A semantics-guided safety enhancement framework for ai-enabled cyber-physical systems. IEEE Transactions on Software Engineering (2023)
  • [39]Spirtes, P., Glymour, C., Scheines, R.: Reply to humphreys and freedman’s review of causation, prediction, and search. The British journal for the philosophy of science 48(4), 555–568 (1997)
  • [40]Spirtes, P.L., Meek, C., Richardson, T.S.: Causal inference in the presence of latent variables and selection bias. arXiv preprint arXiv:1302.4983 (2013)
  • [41]Yang, W., Xu, C., Pan, M., Ma, X., Lu, J.: Improving verification accuracy of CPS by modeling and calibrating interaction uncertainty. ACM Trans. Internet Techn. 18(2), 20:1–20:37 (2018). https://doi.org/10.1145/3093894, https://doi.org/10.1145/3093894
  • [42]Zhang, M., Du, D., Zhang, M., Zhang, L., Wang, Y., Zhou, W.: A meta-modeling approach for autonomous driving scenario based on STTD. Int. J. Softw. Informatics 11(3), 315–333 (2021). https://doi.org/10.21655/IJSI.1673-7288.00262, https://doi.org/10.21655/ijsi.1673-7288.00262
Spatio-temporal Value Semantics-based Abstraction for Dense Deep Reinforcement Learning (2024)
Top Articles
Latest Posts
Article information

Author: Pres. Lawanda Wiegand

Last Updated:

Views: 6004

Rating: 4 / 5 (51 voted)

Reviews: 90% of readers found this page helpful

Author information

Name: Pres. Lawanda Wiegand

Birthday: 1993-01-10

Address: Suite 391 6963 Ullrich Shore, Bellefort, WI 01350-7893

Phone: +6806610432415

Job: Dynamic Manufacturing Assistant

Hobby: amateur radio, Taekwondo, Wood carving, Parkour, Skateboarding, Running, Rafting

Introduction: My name is Pres. Lawanda Wiegand, I am a inquisitive, helpful, glamorous, cheerful, open, clever, innocent person who loves writing and wants to share my knowledge and understanding with you.