Aug 29, 2014 - iterative adaptive dynamic programming (ADP) to obtain the iterative control law ... Dynamic programming is an important technique in h...

0 downloads 0 Views 429KB Size

Nearly Optimal Control Scheme for Discrete-Time Nonlinear Systems With Finite Approximation Errors Using Generalized Value Iteration Algorithm ⋆ Qinglai Wei, Derong Liu The State Key Laboratory of Management and Control for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China (Tel: +86-10-82544761; Fax: +86-10-82544799; emails: [email protected], [email protected]). Abstract: In this paper, a new generalized value iteration algorithm is developed to solve inﬁnite horizon optimal control problems for discrete-time nonlinear systems. The idea is to use iterative adaptive dynamic programming (ADP) to obtain the iterative control law which makes the iterative performance index function reach the optimum. The generalized value iteration algorithm permits an arbitrary positive semi-deﬁnite function to initialize it, which overcomes the disadvantage of traditional value iteration algorithms. When the iterative control law and iterative performance index function in each iteration cannot be accurately obtained, a new design method of the convergence criterion for the generalized value iteration algorithm with ﬁnite approximation errors is established to make the iterative performance index functions converge to a ﬁnite neighborhood of the lowest bound of all performance index functions. Simulation results are given to illustrate the performance of the developed algorithm. Keywords: Adaptive dynamic programming, approximate dynamic programming, nonlinear system, optimal control, reinforcement learning. 1. INTRODUCTION Dynamic programming is an important technique in handling optimal control problems. However, due to the “curse of dimensionality”, the optimal solutions cannot be obtained directly by dynamic programming (Bellman [1957]). Adaptive dynamic programming (ADP), proposed by Werbos [1977] and [1991], has demonstrated the capability to ﬁnd the optimal control policy and solve the HJB equation in a principled way. Iterative methods are primary tools in ADP to obtain the solution of HJB equation indirectly and have attracted increasing attention (Heydari and Balakrishnan [2013]; Liu et al. [2013]; Liu and Wei [2014]; Zhang et al. [2011]). Value iteration algorithms are one class of the most primary and important iterative ADP algorithms (Wei et al. [2009]; Wei and Liu [2012]; Yang and Jagannathan [2012]). Value iteration algorithms of ADP are given in Bertsekas and Tsitsiklis [1996]. In 2008, Al-Tamimi et al. studied a value iteration algorithm for discrete-time aﬃne nonlinear systems (Al-Tamimi et al. [2008]). Starting from a zero initial performance index function, it is proven that the iterative performance index function is a non-decreasing sequence and bounded, which makes the iterative perfor⋆ This work was supported in part by the National Natural Science Foundation of China under Grants 61034002, 61233001, 61273140, 61304086, and 61374105, in part by Beijing Natural Science Foundation under Grant 4132078, and in part by the Early Career Development Award of SKLMCCS.

Copyright © 2014 IFAC

mance index function converge to the optimum as the iteration index increases to inﬁnity. In recent years, value iteration algorithms have attracted more and more researchers (Liu et al. [2012]; Wei and Liu [2013a]; Wei and Liu [2013b]; Zhang et al. [2008]). But it is known that the previous value iteration algorithms, i.e., traditional value iteration algorithms in brief, are required to start from a zero initial condition. Other initial conditions are seldom discussed. On the other hand, most previous discussions on ADP required that the approximation structure could approximate the iterative performance index function accurately. But for most real-world control systems, the accurate performance index function cannot be achieved. Hence, ADP algorithms with approximation errors are important to discuss. Although in several papers (Liu and Wei [2013a]; Wei and Liu [2014]), the convergence properties of ADP algorithm were discussed, in these papers a uniform approximation error was required to build these convergence criteria. However, the uniform approximation error is generally diﬃcult to obtain. To the best of our knowledge, all the convergence criteria in the previous papers were diﬃcult to obtain and there are no discussions on how to design a convergence criterion that makes the iterative ADP algorithms converge. This motivates our research. In this paper, a new discrete-time generalized value iteration algorithm with ﬁnite approximation errors will be constructed. First, the detailed generalized value iteration algorithm is described. It permits an arbitrary positive

4134

19th IFAC World Congress Cape Town, South Africa. August 24-29, 2014

semi-positive function to initialize the developed algorithm, which overcomes the disadvantage of traditional value iteration algorithms. Second, the convergence properties for the ﬁnite-approximation-error based generalized value iteration algorithm are analyzed. We emphasized that for the ﬁrst time a new “design method of the convergence criterion” for the generalized value iteration algorithm with ﬁnite approximation errors is established. It permits that the developed generalized value iteration algorithm designs a suitable approximation error adaptively to make the iterative performance index function converge to a ﬁnite neighborhood of the optimal performance index function. Finally, simulation results are given to show the eﬀectiveness of the developed iterative ADP algorithm. 2. PROBLEM FORMULATION In this paper, the following discrete-time nonlinear system is considered xk+1 = F (xk , uk ), k = 0, 1, 2, . . . , (1) where xk ∈ Rn , uk ∈ Rm and x0 is the initial state. Let uk = (uk , uk+1 , . . . ) be a sequence of controls from k to ∞. The performance index function is deﬁned as ∞ ∑ J(x0 , u0 ) = U (xk , uk ), (2)

be established. The new design method of the convergence criterion will be developed. 3.1 Derivation of the Generalized Value Iteration Algorithm With Finite Approximation Errors The developed generalized value iteration algorithm is updated by iterations, with the iteration index i increasing from 0 to ∞. For ∀xk , let the initial function Vˆ0 (xk ) = Ψ(xk ), where Ψ(xk ) ≥ 0 is a positive semi-deﬁnite function. The iterative control law vˆ0 (xk ) can be computed as follows: { } vˆ0 (xk ) = arg min U (xk , uk ) + Vˆ0 (xk+1 ) + ρ0 (xk ) (7) uk

where Vˆ0 (xk+1 ) = Ψ(xk+1 ) and the performance index function can be updated as Vˆ1 (xk ) = U (xk , vˆ0 (xk )) + Vˆ0 (F (xk , vˆ0 (xk ))) + π0 (xk ), (8) where ρ0 (xk ) and π0 (xk ) are ﬁnite approximation error functions. For i = 1, 2, . . ., the iterative ADP algorithm will iterate between{ } vˆi (xk ) = arg min U (xk , uk ) + Vˆi (xk+1 ) + ρi (xk ) uk { } = arg min U (xk , uk ) + Vˆi (F (xk , uk )) + ρi (xk ) uk

(9)

k=0

where U (xk , uk ) > 0, for ∀ xk , uk ̸= 0, is the utility function. In this paper, we aim to ﬁnd an optimal control scheme that minimizes the performance index function (2). The following assumption is necessary for the analysis of the developed ADP algorithm. Assumption 1. The system (1) is controllable; xk = 0 is a unique equilibrium state, i.e., F (0, 0) = 0; u(xk ) = 0 for xk = 0; U (xk , uk ) is positive deﬁnite. { uk : uk = Deﬁne the control sequence set as U} k = (uk , uk+1 , . . .), ∀uk+i ∈ Rm , i = 0, 1, . . . . Then, for arbitrary control sequence uk ∈ Uk , the optimal performance index function can be deﬁned as { } (3) J ∗ (xk ) = inf J(xk , uk ) : uk ∈ Uk . uk

∗

According to Bellman’s principle of optimality, J (xk ) satisﬁes the discrete-time Hamilton-Jacobi-Bellman (HJB) equation { } J ∗ (xk ) = inf U (xk , uk ) + J ∗ (F (xk , uk )) . (4) uk

Then, the law of optimal single control can be expressed as { } u∗ (xk ) = arg inf U (xk , uk ) + J ∗ (F (xk , uk )) . (5) uk

Hence, the HJB equation (4) can be written as J ∗ (xk ) = U (xk , u∗ (xk )) + J ∗ (F (xk , u∗ (xk ))).

(6)

3. GENERALIZED VALUE ITERATION ALGORITHM WITH FINITE APPROXIMATION ERRORS In this section, a new generalized value iteration algorithm is developed to obtain the optimal control law for nonlinear systems (1). Approximation errors of the iterative performance index functions and iterative control laws are considered. New convergence property analysis methods will

and Vˆi+1 (xk ) = U (xk , vˆi (xk )) + Vˆi (F (xk , vˆi (xk ))) + πi (xk ), (10) where ρi (xk ) and πi (xk ) are ﬁnite approximation error functions of the iterative control and iterative performance index function, respectively. In next subsection, it will be proven that for i → ∞, the iterative performance index function Vi (xk ) and the iterative control law vi (xk ) converge to the optimal ones. 3.2 Properties of the Generalized Value Iteration Algorithm With Finite Approximation Errors For the generalized value iteration algorithm (7)–(10), if for ∀ i = 0, 1, . . ., the iterative performance index function and the iterative control law can accurately be obtained, then the algorithm is reduced to the following equations { } vi (xk ) = arg min U (xk , uk ) + Vi (F (xk , uk )) , u {k } Vi+1 (xk ) = min U (xk , uk ) + Vi (F (xk , uk )) uk

= U (xk , vi (xk )) + Vi (F (xk , vi (xk ))), (11) where V0 (xk ) = Ψ(xk ) is an arbitrary positive semideﬁnite function. In Liu and Wei [2013b], it is shown that iterative performance index function converges to the optimum. As the existence of the approximation errors, the convergence may not hold. The following lemma will show this property. Lemma 1. For i = 1, 2, . . ., Let Υi (xk ) be the target iterative performance index function, which is expressed as { } Υi (xk ) = min U (xk , uk ) + Vˆi−1 (xk+1 ) , (12) uk

where Vˆi (xk ) is deﬁned in (10). If the initial iterative performance index function Vˆ0 (xk ) = Υ0 (xk ) = Ψ(xk ) and for

4135

19th IFAC World Congress Cape Town, South Africa. August 24-29, 2014

∀ i = 1, 2, . . ., there exists a uniform ﬁnite approximation error ζ that satisﬁes Vˆi (xk ) − Υi (xk ) ≤ ζ, (13) then we have Vˆi (xk ) − Vi (xk ) ≤ iζ. (14) Proof. The details of the proof can be seen in Liu and Wei [2013a] and omitted here. Thus, a new analysis method will be developed. To facilitate analysis, the expressions of the approximation error are transformed. For ∀ i = 1, 2, . . ., there exists a ﬁnite constant ϑi > 0 that makes Vˆi (xk ) ≤ ϑi Υi (xk ) (15) hold. From (15), it can be seen that the iterative performance index function Vˆi (xk ) is upper bounded by ϑi Υi (xk ). If the convergence properties of Υi (xk ) are analyzed for diﬀerent ϑi , then the convergence of Vˆi (xk ) can be justiﬁed. Thus, in the following, the convergence properties of the upper bound will be discussed. Theorem 1. For ∀ i = 1, 2, . . ., let Υi (xk ) be expressed as in (12) and Vˆi (xk ) be expressed as in (10). If for ∀ i = 1, 2, . . ., there exists 0 < ϑi < 1 that makes (15) hold, then we have that the iterative performance index function is convergent. Proof. If 0 < ϑi < 1, according to (15), we have 0 ≤ Vˆi (xk ) < Υi (xk ). Using mathematical induction, we can prove that for ∀ i = 1, 2, . . ., the following inequality 0 < Vˆi (xk ) < Vi (xk ) (16) holds. According to Liu and Wei [2013b], we have Vi (xk ) → J ∗ (xk ). Then for ∀ i = 0, 1, . . ., Vˆi (xk ) is upper bounded and 0 < lim Vˆi (xk ) < lim Vi (xk ) = J ∗ (xk ). (17) i→∞

i→∞

The proof is completed. Next, we will analyze the situation of 1 ≤ ϑi < ∞. Theorem 2. For ∀ i = 1, 2, . . ., let Υi (xk ) be expressed as (12) and Vˆi (xk ) be expressed as (10). Let 0 < φi < ∞ be a constant that makes Vi (F (xk , uk )) ≤ φi U (xk , uk ) (18) hold. If Assumption 1 holds and for ∀ i = 1, 2, . . ., there exists 1 ≤ ϑi < ∞ that makes (15) hold, then we have ( ∑ i−1 ( ˆ Vi (xk ) ≤ ϑi 1+ ϑi−1 ϑi−2 · · · ϑi−j+1 (ϑi−j − 1)

where we deﬁne

From (19), we can see that for ∀ i = 0, 1, . . ., there exists an error between the Vˆi (xk ) and Vi (xk ). As i → ∞, the bound of the approximation errors may increase to inﬁnity. Thus, in the following, we will give the convergence properties of the iterative ADP algorithm (7)–(10) using error bound method. Before presenting the next theorem, the following lemma is necessary. Lemma 2. Let {bi }, i = 1, 2, . . . be a sequence of positive number. Let 0 < λi < ∞ be a bounded positive constant ∞ ∑ bi is ﬁnite, then for ∀i = 1, 2, . . . and let ai = λi bi . If we have that

∞ ∑

i=1

ai is ﬁnite.

i=1

¯ = Proof. As for ∀ i = 1, 2, . . ., λi is ﬁnite, if we let λ sup{λ1 , λ2 , . . .}, then we have that ∞ ∞ ∞ ∑ ∑ ∑ ¯ bi (20) λi bi ≤ λ ai = i=1

i=1

i=1

is ﬁnite. Theorem 3. Let Vˆi (xk ) be expressed as (19). If for ∀ i = 1, 2, . . ., the inequality φi + 1 1 ≤ ϑi+1 ≤ qi (21) φi holds, where qi is an arbitrary constant which satisﬁes φi < qi < 1, then as i → ∞, the iterative performance φi + 1 index function Vˆi (xk ) of the generalized value iteration algorithm converges to a ﬁnite neighborhood of J ∗ (xk ). Proof. For (19) in Theorem 2, if we let i−1 ( ∑ ∆i = ϑi−1 ϑi−2 · · · ϑi−j+1 (ϑi−j − 1) j=1

× aij =

φi−1 φi−2 · · · φi−j (φi−1 + 1)(φi−2 + 1) · · · (φi−j + 1)

)

ϑi−1 ϑi−2 · · · ϑi−j φi−1 φi−2 · · · φi−j , (φi−1 + 1)(φi−2 + 1) · · · (φi−j + 1)

,

(22) (23)

and

(19) (·) = 0, for ∀j > i, i, j = 0, 1, . . ., and

are both ﬁnite as i → ∞, then lim ∆i is ﬁnite. According

φi−1 φi−2 · · · φi−j (φi−1 + 1)(φi−2 + 1) · · · (φi−j + 1) i ∑

Then, according to (15), we can obtain (19). The mathematical induction is completed.

ϑi−1 ϑi−2 · · · ϑi−j+1 φi−1 φi−2 · · · φi−j , (24) (φi−1 + 1)(φi−2 + 1) · · · (φi−j + 1) where i = 1, 2, . . ., and j = 1, 2, . . . , i − 1, then we have i−1 i−1 i−1 i−1 ∑ ∑ ∑ ∑ bij aij and bij . We know that if aij − ∆i =

j=1

×

According to (15), we have Vˆ1 (xk ) ≤ ϑ1 V1 (xk ). Thus, the conclusion holds for i = 1. Assume that (19) holds for i = l − 1, where l = 2, 3, . . .. Then, for i = l, we can obtain (19).

))

bij =

Vi (xk ),

j

j=1

j=1

Proof. The theorem can be proven by mathematical induction. First, let i = 1 and then (12) becomes { } Υ1 (xk ) = min U (xk , uk ) + Vˆ0 (xk+1 ) uk

(20)

j=1

i→∞

ϑi−j+1 φi−j bij = ≤ qi−j < . If to (24), we have bi(j−1) (φi−j + 1) bi(j−1) φi−j + 1 1, then we can get ϑi−j+1 ≤ qi−j . Let ℓ = i − j φi−j and then we can obtain φℓ + 1 , (25) ϑℓ+1 ≤ qℓ φℓ bij

ϑi−1 ϑi−2 · · · ϑi−j+1 (ϑi−j − 1) = (ϑi−1 − 1), for j = 1.

= V1 (xk ).

j=1

4136

19th IFAC World Congress Cape Town, South Africa. August 24-29, 2014

{ } Υl (xk ) = min U (xk , uk ) + Vˆl−1 (F (xk , uk )) uk { ( l−2 ( ∑ ≤ min U (xk , uk ) + ϑl−1 1 + ϑl−2 ϑl−3 · · · ϑl−j (ϑl−j−1 − 1) uk

j=1

}

φl−2 φl−3 · · · φl−j−1 (φl−2 + 1)(φl−3 + 1) · · · (φl−j−1 + 1)

))

× Vl−1 (xk ) {( ≤ min uk

(∑ l−1 (ϑl−1 ϑl−2 · · · ϑl−j+1 (ϑl−j − 1) 1 + φl−1 j=1

(

φl−2 φl−3 · · · φl−j (φl−1 + 1)(φl−2 + 1) · · · (φl−j + 1)

)) U (xk , uk )

( ) l−2 ∑ φl−2 φl−3 · · · φl−j−1 φl−1 ϑl−1 (ϑl−2 ϑl−3 · · · ϑl−j (ϑl−j−1 − 1)) 1+ + φl−1 + 1 (φl−2 + 1)(φl−3 + 1) · · · (φl−j−1 + 1) j=1 ) } 1 Vl−1 (xk ) + φl−1 + 1 l−1 { } ∑ φl−1 φl−2 · · · φl−j min U (xk , uk ) + Vl−1 (xk ) (ϑl−1 ϑl−2 · · · ϑl−j+1 (ϑl−j − 1)) = 1 + uk (φl−1 + 1)(φl−2 + 1) · · · (φl−j + 1) j=1 l−1 ∑ φl−1 φl−2 · · · φl−j Vl (xk ) (ϑl−1 ϑl−2 · · · ϑl−j+1 (ϑl−j − 1)) = 1 + (19) (φ + 1)(φ + 1) · · · (φ + 1) l−1 l−2 l−j j=1 where ℓ = 1, 2, . . . , i − 1. Let i → ∞ and we can obtain (21). Let q = sup{q1 , q2 , . . .} and we have 0 < q < 1. We can obtain ) i−1 ( i−1 ∑ ∑ φi−1 + 1 bij ≤ q j−1 . (26) φ i−1 j=1 j=1

We can see that if we can obtain φi , then we can design the approximation error to make Vˆi (xk ) converge. The following theorem will give an eﬀective way to obtain φi . Deﬁne Ωφi as { } Ωφi = φi |φi U (xk , uk ) ≥ Vi (F (xk , uk )) . (28)

φi < q < 1 and φi−1 is ﬁnite for ∀ i = 1, 2, . . ., let φi + 1 i−1 ∑ bij is ﬁnite. i → ∞ and we have lim

Theorem 5. Let µ(xk ) be an arbitrary admissible control law of the nonlinear system (1), i.e., Pi+1 (xk ) = U (xk , µ(xk )) + Pi (xk+1 ) (29) where P0 (xk ) = V0 (xk ) = Ψ(xk ). If there exists a constant φ˜i that satisﬁes φ˜i U (xk , uk ) ≥ Pi (F (xk , uk )), (30) then we have φ˜i ∈ Ωφi .

As

i→∞ j=1

On the other hand, for ∀ i = 1, 2, . . . and for ∀j = 1, 2, . . . , i − 1, we have aij = ϑi−j bij . As for ∀ i = 1, 2, . . . and for ∀j = 1, 2, . . . , i − 1, 1 ≤ ϑi−j < ∞ is ﬁnite, i−1 ∑ aij must be ﬁnite. according to Lemma 2, we have lim i→∞ j=1

Therefore, we can obtain lim ∆i is ﬁnite. According to Liu i→∞

and Wei [2013b], we have lim Vi (xk ) = J ∗ (xk ). Hence, the i→∞

iterative performance index function Vˆi (xk ) is convergent to a bounded neighborhood of the optimal performance index function J ∗ (xk ). The proof is completed. Combining Theorems 1 and 3, the convergence criterion of the generalized value iteration algorithm with ﬁnite approximation errors can be established. Theorem 4. If Assumption 1 holds and for ∀ i = 0, 1, . . ., the inequality φi + 1 (27) 0 < ϑi+1 ≤ qi φi holds, where 0 < qi < 1 is an arbitrary constant, then the iterative performance index function Vˆi (xk ) in the generalized value iteration algorithm converges to a ﬁnite neighborhood of the optimal performance index function J ∗ (xk ), as i → ∞.

Proof. As µ(xk ) is an arbitrary admissible control law, we have Pi (xk ) ≥ Vi (xk ). If φ˜i satisﬁes (30), then can get φ˜i U (xk , uk ) ≥ Pi (F (xk , uk )) ≥ Vi (F (xk , uk )). (31) The proof is completed. From Theorem 5, we know that if we obtain an admissible control law µ(xk ), then φi can be estimated. The method to obtain the admissible control law can be seen in Liu and Wei [2014] and omitted here. Remark 1. One property should be pointed out. First, the developed value iteration algorithm of ADP in this paper is diﬀerent from the traditional value iteration algorithms (Al-Tamimi et al. [2008] and Wei et al. [2009]). For the traditional value iteration algorithms, the initial performance index function is required to be zero. In this paper, the initial performance index function can be an arbitrary positive semi-deﬁnite function. On the other hand, the developed value iteration algorithm in this paper is also diﬀerent from Liu and Wei [2013a] and Wei and Liu [2014]. In Liu and Wei [2013a] and Wei and Liu [2014], it requires a uniform approximation error to

4137

19th IFAC World Congress Cape Town, South Africa. August 24-29, 2014

construct the convergence criterion. In this paper, the approximation error ϑi can be diﬀerent for diﬀerent i. This makes the convergence analysis in this paper diﬀerent from our previous papers.

100

150

80 100

γ

γ

60 40

50

20 0

3.1 Summary of the Generalized Value Iteration Algorithm With Finite Approximation Errors

0

20 40 Iteration steps (a)

0

60

150

where ζi (xk ) is the approximation error function. According to the deﬁnition of ϑi in (15) and the convergence criterion (27), we can easily obtain the following convergence criterion 1 Vˆi (xk ). (33) ζi (xk ) ≤ φi−1 + 1 4. SIMULATION STUDIES We now examine the performance of the developed algorithm in a torsional pendulum system in Liu and Wei [2014]. The dynamics of the pendulum is as follows ] [ [ ] [ ] x1(k+1) 0.1x2k + x1k 0 = + u , −0.49 sin(x1k ) + 0.98x2k x2(k+1) 0.1 k (34) where x1k = θk and x2k = ωk . Let the initial state be x0 = [1, −1]T . We choose the p = 10000 states. Let the structures of the critic and action networks be 2– 12–1 and 2–12–1. The neural network training method can be seen in Liu and Wei [2014] and omitted here. To illustrate the eﬀectiveness of the algorithm, we also choose four diﬀerent initial performance index functions which are expressed by Ψj (xk ) = xTk Pj xk , j = 1, . . . , 4. Let P1 = 0. Let P2 –P4 be initialized by arbitrary positive deﬁnite matrices with the forms P2 =

60

0

20 40 Iteration steps (d)

60

160

100

γ

γ

120 100

50

Remark 2. Generally, in iterative ADP algorithms, the diﬀerence between Vˆi (xk ) and Γi (xk ) is obtained, i.e., Vˆi (xk ) − Γi (xk ) = ζi (xk ). (32)

20 40 Iteration steps (b)

140

Now, we summarize the generalized value iteration algorithm with ﬁnite approximation errors in Algorithm 1. Algorithm 1 Generalized value iteration algorithm with finite approximation errors Initialization: Choose randomly an array of initial states x0 ; Choose a semi-positive deﬁnite function Ψ(xk ) ≥ 0; Choose a convergence precision ζ; Choose an admissible control law µ(xk ); Give a sequence {qi }, i = 0, 1, . . ., where 0 < qi < 1; Give two constants 0 < ς < 1, 0 < ϱ < 1. Iteration: 1: Let the iteration index i = 0; 2: Let V0 (xk ) = Ψ(xk ) and obtain φ0 by (30); ˆi (xk ) by (9) and obtain Vˆi+1 (xk ) by (10); 3: Compute v 4: Obtain ϑi+1 by (15). If ϑi+1 satisﬁes (27), then estimate φi+1 by (30), and goto next step. Otherwise, decrease ρi (xk ) and πi (xk ), i.e., ρi (xk ) = ςρi (xk ) and πi (xk ) = ϱπi (xk ), respectively. Goto Step 3; ˆi+1 (xk ) − Vˆi (xk )| ≤ ζ, then the optimal perfor5: If |V mance index function is obtained and goto Step 6; else let i = i + 1 and goto Step 3; ˆi (xk ) and Vˆi (xk ). 6: return v

0

80 0

0

20 40 Iteration steps (c)

60

60

Fig. 1. The trajectories of φ’s with Ψ1 (xk )–Ψ4 (xk ). (a) Ψ1 (xk ). (b) Ψ2 (xk ). (c) Ψ3 (xk ). (d) Ψ4 (xk ).

Fig. 2. The curves of the admissible errors and the iterative performance index functions with Ψ1 (xk )–Ψ4 (xk ). (a) Admissible errors with Ψ1 (xk ). (b) Admissible errors with Ψ2 (xk ). (c) Admissible errors with Ψ3 (xk ). (d) Admissible errors with Ψ4 (xk ). (e) Performance index function with Ψ1 (xk ). (f) Ψ2 (xk ). (g) Performance index function with Ψ3 (xk ). (h) Performance index function with Ψ4 (xk ). [2.35, 3.31; 3.31, 9.28], P3 = [5.13, −5.72; −5.72, 15.13], P4 = [100.78, 5.96; 5.96, 20.51], respectively. Let qi = 0.9999 for ∀ i = 0, 1, . . ., and let ς = ϱ = 0.5. Initialized by Ψj (xk ), j = 1, . . . , 4, the developed algorithm with ﬁnite approximation errors is implemented. The trajectories of φ’s with Ψ1 (xk )–Ψ4 (xk ) are presented in Figs. 1(a)–(d), respectively. According to φ’s, the curved surfaces of the admissible errors with Ψ1 (xk )–Ψ4 (xk ) are shown in Figs. 2(a)–(d), and the iterative performance index functions are shown in Figs. 2(e)–(h) where “In” denotes initial iteration and “Lm” denotes limiting iteration. From Figs. 1–2, it can be seen that for diﬀerent initial performance index functions Ψ1 (xk )–Ψ4 (xk ), the iterative performance index functions by the generalized value iteration algorithm can converge to a ﬁnite neighborhood of the optimal one. The corresponding iterative controls

4138

19th IFAC World Congress Cape Town, South Africa. August 24-29, 2014

2

In x

0 −1 −2

In x1

−3

0 50 100 Time steps (a)

0 −1 −2

Lm x2

−3

0 50 100 Time steps (b)

3

3

3

2

2

2

0 −1

1 0 −1

0 50 100 Time steps (e)

−2

0 In x2 Lm x

−1

1

−1.5 −2

Lm x2 0 50 100 Time steps (d)

3 2

In

1 0

1

Lm

Lm −2

In

0

−1

0 50 100 Time steps (f)

1

−0.5

0 50 100 Time steps (c)

Lm

In −2

Control

Control

1

In x

0.5

Lm x1 Lm x2

In

Lm

Control

Lm x1

1

In x2 In x1

1

System states

−5 Lm x 2

In x1

Control

0

−10

1

1

System states

System states

5

2 In x2

2

Lm x

System states

10

0 50 100 Time steps (g)

−1

0 50 100 Time steps (h)

Fig. 3. Iterative trajectories of states and controls with Ψ1 (xk )–Ψ4 (xk ). (a) States with Ψ1 (xk ). (b) States with Ψ2 (xk ). (c) States with Ψ3 (xk ). (d) States with Ψ4 (xk ). (e) Controls with Ψ1 (xk ). (f) Controls with Ψ2 (xk ). (g) Controls with Ψ3 (xk ). (h) Controls with Ψ4 (xk ). and iterative states are shown in Figs. 3, which are also convergent. Therefore, the eﬀectiveness of the developed generalized value iteration algorithm with ﬁnite approximation errors can be proven. 5. CONCLUSION In this paper, a new generalized value iteration algorithm is developed to solve inﬁnite horizon optimal control problems for discrete-time nonlinear systems. The developed generalized value iteration algorithm of ADP permits an arbitrary positive semi-deﬁnite function to initialize the algorithm, which overcomes the disadvantage of traditional value iteration algorithms. Considering the approximation errors, for the ﬁrst time a new “design method of the convergence criterion” for the generalized value iteration algorithm with ﬁnite approximation errors is established to make the iterative performance index function converge to a ﬁnite neighborhood of the optimal performance index function. Finally, simulation results are given to illustrate the performance of the developed algorithm. REFERENCES A. Al-Tamimi, F.L. Lewis, and M. Abu-Khalaf. Discretetime nonlinear HJB solution using approximate dynamic programming: convergence proof. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, 38(4): 943–949, 2008. D.P. Bertsekas, and J.N. Tsitsiklis. Neuro-Dynamic Programming. Athena Scientiﬁc, Belmont, MA, 1996. R.E. Bellman, Dynamic Programming. Princeton University Press, Princeton, New Jersey, 1957. A. Heydari, and S.N. Balakrishnan. Finite-horizon controlconstrained nonlinear optimal control using single network adaptive critics. IEEE Transactions on Neural Networks and Learning Systems, 24(1): 145–157, 2013. D. Liu, D. Wang, D. Zhao, Q. Wei, and N. Jin. Neuralnetwork-based optimal control for a class of unknown discrete-time nonlinear systems using globalized dual

heuristic programming. IEEE Transactions on Automation Science and Engineering, 9(3): 628–634, 2012. D. Liu, Y. Huang, D. Wang, and Q. Wei. Neural network observer-based optimal control for unknown nonlinear systems using adaptive dynamic programming. International Journal of Control, 86(9): 1554–1566, 2013. D. Liu and Q. Wei. Finite-approximation-error-based optimal control approach for discrete-time nonlinear systems. IEEE Transactions on Cybernetics, 43(2): 779– 789, 2013a. D. Liu and Q. Wei. Generalized adaptive dynamic programming algorithm for discrete-time nonlinear systems: convergence and stability analysis. In Proceedings of Third IEEE Intermational Conference on Information Science and Technology, Yangzhou, China, 134– 141, 2013b. D. Liu and Q. Wei. Policy iterative adaptive dynamic programming algorithm for discrete-time nonlinear systems. IEEE Transactions on Neural Networks and Learning Systems, 25(3): 621–634, 2014. Q. Wei, H. Zhang, and J. Dai. Model-free multiobjective approximate dynamic programming for discretetime nonlinear systems with general performance index functions. Neurocomputing, 72(7–9): 1839–1848, 2009. Q. Wei and D. Liu. An iterative ϵ-optimal control scheme for a class of discrete-time nonlinear systems with unﬁxed initial state. Neural Networks, 32: 236–244, 2012. Q. Wei and D. Liu. Numerical adaptive learning control scheme for discrete-time nonlinear systems. IET Control Theory & Applications, 7(11): 1472–1486, 2013a. Q. Wei and D. Liu. A novel iterative θ-Adaptive dynamic programming for discrete-time nonlinear systems. IEEE Transactions on Automation Science and Engineering, 2013b. Article in Press. DOI: 10.1109/TASE.2013.2280974 Q. Wei and D. Liu. Data-driven neuro-optimal temperature control of water gas shift reaction using stable iterative adaptive dynamic programming. IEEE Transactions on Industrial Electronics, 2014. Article in Press. DOI: 10.1109/TIE.2014.2301770 P.J. Werbos. Advanced forecasting methods for global crisis warning and models of intelligence. General Systems Yearbook, 22: 25–38, 1977. P.J. Werbos. A menu of designs for reinforcement learning over time. W.T. Miller, R.S. Sutton, and P.J. Werbos, editors. Neural Networks for Control. MIT Press, Cambridge, 1991. Q. Yang and S. Jagannathan. Reinforcement learning controller design for aﬃne nonlinear discrete-time systems using online approximators. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 42(2): 377–390, 2012. H. Zhang, Q. Wei, and D. Liu. An iterative adaptive dynamic programming method for solving a class of nonlinear zero-sum diﬀerential games. Automatica, 47(1): 207–214, 2011. H. Zhang, Q. Wei, and Y. Luo. A novel inﬁnite-time optimal tracking control scheme for a class of discretetime nonlinear systems via the greedy HDP iteration algorithm. IEEE Transactions on Systems, Man, and Cybernetics-Part B: Cybernetics, 38(4): 937–942, 2008.

4139