as uT Xv + b where u â Rn1 and v â Rn2 . Thus, there are only n1 + n2 + 1 parameters. This .... Specifically, SVM try to find a decision surface t...

5 downloads 0 Views 181KB Size

UILU-ENG-2006-1748

Learning with Tensor Representation

by Deng Cai, Xiaofei He, and Jiawei Han

April 2006

Learning with Tensor Representation∗ Deng Cai† †

Xiaofei He‡

Jiawei Han†

Department of Computer Science, University of Illinois at Urbana-Champaign ‡

Yahoo! Research Labs

Abstract Most of the existing learning algorithms take vectors as their input data. A function is then learned in such a vector space for classification, clustering, or dimensionality reduction. However, in some situations, there is reason to consider data as tensors. For example, an image can be considered as a second order tensor and a video can be considered as a third order tensor. In this paper, we propose two novel algorithms called Support Tensor Machines (STM) and Tensor Least Square (TLS). These two algorithms operate in the tesnor space. Specifically, we represent data as the second order tensors (or, matrices) in Rn1 ⊗ Rn2 , where Rn1 and Rn2 are two vector spaces. STM aims at finding a maximum margin classifier in the tensor space, while TLS aims at finding a minimum residual sum-of-squares classifier. With tensor representation,

the number of parameters estimated by STM (TLS) can be greatly reduced. Therefore, our algorithms are especially suitable for small sample cases. We compare our proposed algorithms with SVM and the ordinary Least Square method on six databases. Experimental results show the effectiveness of our algorithms.

1

Introduction

The problem of data representation has been at the core of machine learning. Most of the traditional learning algorithms are based on the Vector Space Model. That is, the data are represented as vectors x ∈ Rn . The learning algorithms aim at finding a linear (or nonlinear) function f (x) = wT x

according to some pre-defined criteria, where w = (w1 , · · · , wn )T are the parameters to estimate.

However, in some situations, there might be reason to consider data as tensors. For example, an image is essentially a second order tensor, or matrix. It is reasonable to consider that pixels close to each other are correlated to some extent. Similarly, a video is essentially a third order tensor with the third dimension being time. Also, two consecutive frames in a video are probably correlated. ∗

The work was supported in part by the U.S. National Science Foundation NSF IIS-03-08215/IIS-05-13678. Any

opinions, findings, and conclusions or recommendations expressed in this paper are those of the authors and do not necessarily reflect the views of the funding agencies.

1

In supervised learning settings with many input features, overfitting is usually a potential problem unless there is ample training data. For example, it is well known that for unregularized discriminative models fit via training-error minimization, sample complexity (i.e., the number of training examples needed to learn “well”) grows linearly with the Vapnik-Chernovenkis (VC) dimension. Further, the VC dimension for most models grows about linearly in the number of parameters (Vapnik, 1982), which typically grows at least linearly in the number of input features. All these reasons lead us to consider new representations and corresponding learning algorithms with less number of parameters. In this paper, we propose two novel supervised learning algorithms for data classification, which are called Support Tensor Machines (STM) and Tensor Least Square (TLS). Different from most of previous classification algorithms which take vectors in Rn as inputs, STM and TLS take the second order tensors in Rn1 ⊗ Rn2 as inputs, where n1 × n2 ≈ n. For example, a vector

x ∈ Rn can be transformed by some means to a second order tensor X ∈ Rn1 ⊗ Rn2 . A linear

classifier in Rn can be represented as wT x + b in which there are n + 1 (≈ n1 × n2 + 1) parameters (b, wi , i = 1, · · · , n). Similarly, a linear classifier in the tensor space Rn1 ⊗ Rn2 can be represented

as uT Xv + b where u ∈ Rn1 and v ∈ Rn2 . Thus, there are only n1 + n2 + 1 parameters. This

property makes STM and TLS especially suitable for small sample cases.

Recently there have been a lot of interests in tensor based approaches to data analysis in high dimensional spaces. Vasilescu and Terzopoulos (2003) have proposed a novel face representation algroithm called Tensorface. Tensorface represents the set of face images by a higher-order tensor and extends Singular Value Decomposition (SVD) to higher-order tensor data. Some other researchers have also shown how to extend Principal Component Analysis, Linear Discriminant Analysis, Locality Preserving Projection, and Non-negative Matrix Factorization to higher order tensor data (Cai et al., 2005; He et al., 2005; Shashua & Hazan, 2005; Ye et al., 2004). Most of previous tensor based learning algorithms are focused on dimensionality reduction. In this paper, we extend SVM and Least Square based ideas to tensor data for classification. The rest of this paper is organized as follows: Section 2 gives some descriptions on tensor space model for data representation. The Tensor Least Square (TLS) and Support Tensor Machines (STM) approaches for classification in tensor space are described in Section 3. The experimental results on six datasets from UCI Repository are presented in Section 4. In Section 5, we provide a general algorithm for learning functions in tensor space. Finally, we provide some concluding remarks and suggestions for future work in Section 6.

2

Tensor based Data Representation

Traditionally, a data sample is represented by a vector in high dimensional space. Some learning algorithms are then applied in such a vector space for classification. In this section, we introduce

2

1

1 4 7

2

2 5 8 3 6 9

3 4 5

Vector to Tensor Conversion

(a) 1 6

6

2 7

7

3 8

8

4 9

9

5 x (b)

Figure 1: Vector to tensor conversion. 1∼9 denote the positions in the vector and tensor formats. (a) and (b) are two possible tensors. The ‘x’ in tensor (b) is a padding constant. a new data representation model: Tensor Space Model (TSM)1 . In Tensor Space Model, a data sample is represented as a tensor. Each element in the tensor corresponds to a feature. For a data sample x ∈ Rn , we can convert it to the second order tensor

(or matrix) X ∈ Rn1 ×n2 , where n1 × n2 ≈ n. Figure 1 shows an example of converting a vector to

a tensor. There are two issues about converting a vector to a tensor.

The first one is how to choose the size of the tensor, i.e., how to select n1 and n2 . In figure 1, we present two possible tensors for a 9-dimensional vector. Suppose n1 ≥ n2 , in order to have at least

n entries in the tensor while minimizing the size of the tensor, we have (n1 − 1) × n2 < n ≤ n1 × n2 .

With such a requirement, there are still many choices of n1 and n2 , especially when n is large. Generally all these (n1 , n2 ) combinations can be used. However, it is worth noticing that the number of parameters of a linear function in the tensor space is n1 + n2 . Therefore, one may try to minimize n1 + n2 . In other words, n1 and n2 should be as close as possible. The second issue is how to sort the features in the tensor. In vector space model, we implicitly assume that the features are independent. A linear function in the vector space can be written as g(x) = wT x + b. Clearly, the change of the order of the features has no impact on the function learning. In tensor space model, a linear function can be written as f (X) = uT Xv + b. Thus, the independency assumption of the features no longer holds for the learning algorithms in the tensor space model. Different feature sorting will lead to different learning result in the tensor space model. A possible approach for sorting the features is using the w learned by a vector classifier. Each wi in w, where i = 1, · · · , n, corresponds to a feature. Thus, the problem of sorting features in

the tensor can be converted to fill these n elements wi into the n1 × n2 tensor. We divide wi into

three groups: positive, negative and zero. The corresponding features in the same group tend to be correlated. Suppose each group has lk elements, where k = 1, 2, 3. For each group, we sort lk elements by their absolute value and get (|w1k | ≥ · · · ≥ |wlkk |). We take the first n1 elements to form

the first column of the tensor and the next n1 elements to form the second column of the tensor 1

Note that, in this paper our primary interest is focused on the second order tensors. However, the TSM presented

here and the algorithms presented in the next section can also be applied to higher order tensors.

3

and so on. For each group, we will fill ⌊lk /n1 ⌋ columns. The remainders in each group will be put

together to fill the remaining entries in the tensor. √ In this paper, we take n1 ≈ n2 ≈ n which we called square tensors. The features are filled into the tensor entries by using the way we described above. The better ways of converting a data vector to a data tensor with theoretical guarantee will be left for our future work.

3

Classification with Tensor Representation

Given a set of training samples {Xi , yi }, i = 1, · · · , m, where Xi is the data point in order-2 tensor

space, Xi ∈ Rn1 ⊗ Rn2 and yi ∈ {−1, 1} is the label associated with Xi .

Let (u1 , · · · , un1 ) be a set of orthonormal basis functions of Rn1 . Let (v1 , · · · , vn2 ) be a set of

orthonormal basis functions of Rn2 . Thus, X can be uniquely written as: X=

X

(uTi Xvj )ui vTj

ij

A linear classifier in the tensor space can be written as f (X) = uT Xv + b, where u ∈ Rn1 and

v ∈ Rn2 are two vectors. The problem of linear classification in tensor space model is to find the u and v based on a specific objective function.

In the following two subsections, we introduce two novel classifiers in tensor space model based on different objective functions, which are Tensor Least Square classifier and Support Tensor Machines. One is the tensor generalization of least square classifier and the other is the tensor generalization of support vector machines.

3.1

Tensor Least Square Analysis

The least square classifier might be one of the most well known classifiers (Duda et al., 2000; Hastie et al., 2001). In this subsection, we extend the least square idea to tensor space model and develop a Tensor Least Square classifier (TLS). The objective function in tensor least square analysis is as follows2 : min u,v

m X i=1

2

uT Xi v − yi

2

Note that, we need to add a constant column (or row) to each data tensor to fit the intercept.

4

(1)

By simple algebra, we see that: m X

=

i=1 m X i=1

u T X i v − yi

2

uT Xi vvT XiT u − 2yi uT Xi v + yi2 m X

T

= u

Xi vv

T

!

XiT

i=1

m X

+

u−u

T

2

m X

yi X i

i=1

!

v

yi2

(2)

i=1

Similarly, we also have:

m X

uT Xi v − yi

i=1

=

m X

vT XiT uuT Xi v − 2yi vT XiT u + yi2

i=1

m X

= vT

XiT uuT Xi

i=1

+

2

m X

!

v − vT

2

m X

yi XiT

i=1

!

yi2

u (3)

i=1

Requiring the derivative of Eqn (2) with respect to u to be zero, we get: ! ! X X T T Xi vv Xi u = 2 yi X i v i

i

Thus, we get:

u=

X

Xi vvT XiT

i

Similarly, we can get: v=

X

XiT uuT Xi

i

!−1

!−1

2

X i

2

X i

yi X i

!

yi XiT

v

!

u

(4)

(5)

Notice that u and v are dependent on each other, and can not be solved independently. We can use an iterative approach to compute the optimal u and v, i.e., we first fix u, and compute v by Equation (5); Then we fix v, and compute u by Equation (4). The convergence proof of such an iterative approach is given in Section 3.3.

3.2

Support Tensor Machines

Support Vector Machines are a family of pattern classification algorithms developed by Vapnik (1995) and collaborators. SVM training algorithms are based on the idea of structural risk minimization rather than empirical risk minimization, and give rise to new ways of training polynomial, 5

neural network, and radial basis function (RBF) classifiers. SVM has proven to be effective for many classification tasks (Joachims, 1998; Ronfard et al., 2002). Specifically, SVM try to find a decision surface that maximizes the margin between the data points in a training set. The objective function of linear SVM can be stated as: m

X 1 T ξi w w+C 2

min w,b,ξ

i=1

T

subject to

yi (w xi + b) ≥ 1 − ξi , ξi ≥ 0,

(6)

i = 1, · · · , m.

Our Support Tensor Machines (STM) is fundamentally based on the same idea. As we discussed before, A linear classifier in the tensor space can be naturally represented as follows: f (X) = uT Xv + b,

u ∈ Rn1 , v ∈ Rn2

(7)

Equation (7) can be rewritten through matrix inner product as follows: f (X) = < X, uvT > +b,

u ∈ Rn1 , v ∈ Rn2

(8)

Thus, the large margin optimization problem in the tensor space is reduced to the following: m

X 1 ξi kuvT k2 + C 2

min u,v,b,ξ

i=1

yi (uT Xi v + b) ≥ 1 − ξi ,

subject to

ξi ≥ 0,

(9)

i = 1, · · · , m.

We will now switch to a Lagrangian formulation of the problem. We introduce positive Lagrange multipliers αi , µi , i = 1, · · · , m, one for each of the inequality constraints (9). This gives Lagrangian: LP

=

X X 1 ξi − αi yi uT Xi v + b kuvT k2 + C 2 i i X X X + αi − αi ξi − µi ξi i

i

i

Note that 1 kuvT k2 = 2 = =

1 trace uvT vuT 2 1 T v v trace uuT 2 1 T T v v u u 2

6

Thus, we have: LP

=

X 1 T T v v u u +C ξi 2 i X T − αi yi u Xi v + b i

+

X i

αi −

X i

αi ξi −

X

µi ξi

i

Requiring that the gradient of LP with respect to u, v, b and ξi vanish give the conditions: P αi yi Xi v u= i T v v P αi yi uT Xi v= i T u u X αi yi = 0

(10) (11) (12)

i

C − αi − µi = 0,

i = 1, · · · , m

(13)

From Equations (10) and (11), we see that u and v are dependent on each other, and can not be solved independently. In the following, we describe a simple yet effective computational method to solve this optimization problem. We first fix u. Let β1 = kuk2 and xi = XTi u. Thus, the optimization problem (9) can be

rewritten as follows:

m

min v,b,ξ

X 1 β1 kvk2 + C ξi 2 i=1

subject to

T

yi (v xi + b) ≥ 1 − ξi , ξi ≥ 0,

(14)

i = 1, · · · , m.

It is clear that the new optimization problem (14) is identical to the standard SVM optimization problem. Thus, we can use the same computational methods of SVM to solve (14), such as (Graf et al., 2004; Platt, 1998; Platt, 1999). ˜ i = Xv. Thus, u can be obtained by solving the Once v is obtained, let β2 = kvk2 and x

following optimization problem:

m

min u,b,ξ

X 1 ξi β2 kuk2 + C 2 i=1

subject to

T

˜ i + b) ≥ 1 − ξi , yi (u x ξi ≥ 0,

(15)

i = 1, · · · , m.

Again, we can use the standard SVM computational methods to solve this optimization problem. Thus, v and u can be obtained by iteratively solving the optimization problems (14) and (15). In our experiments, u is initially set to the vector of all ones. 7

3.3

Convergence Proof

In this section, we provide a convergence proof of the iterative computational method in STM and TLS algorithms. We have the following theorem: Theorem 1 The iterative procedure to solve the optimization problems (14) and (15) will monotonically decreases the objective function value in (9), and hence the STM algorithm converges. Proof Define:

m

X 1 ξi f (u, v) = kuvT k2 + C 2 i=1

Let u0 be the initial value. Fixing u0 , we get v0 by solving the optimization problem (14). Likewise, fixing v0 , we get u1 by solving the optimization problem (15). Notice that the optimization problem of SVM is convex, so the solution of SVM is globally optimum (Burges, 1998; Fletcher, 1987). Specifically, the solutions of equations (14) and (15) are globally optimum. Thus, we have: f (u0 , v0 ) ≥ f (u1 , v0 ) Finally, we get: f (u0 , v0 ) ≥ f (u1 , v0 ) ≥ f (u1 , v1 ) ≥ f (u2 , v1 ) ≥ · · · Since f is bounded from below by 0, it converges. Similarly, for Tensor Least Square algorithm, since the solution for least square is globally optimum, iterative procedure (4) and (5) will monotonically decreases the objective function value in (1), and hence the TLS algorithm converges.

3.4

From Matrix to High Order Tensor

The TLS and STM algorithms described above take order-2 tensors, i.e., matrices, as input data. However, these algorithms can also be extended to high order tensors. In this section, we briefly describe the extension of these algorithms to high order (> 2) tensors. we take STM as an example. Let (Ti , yi ), i = 1, · · · , m denote the training samples, where Ti ∈ Rn1 ⊗ · · · ⊗ Rnk . The decision

function of STM is:

f (T ) = T (a1 , a2 , · · · , ak ) + b a1 ∈ Rn1 , a2 ∈ Rn2 , · · · , ak ∈ Rnk where T (a1 , a2 , · · · , ak ) =

X

1 ≤ i1 ≤ n1 . . . 1 ≤ ik ≤ nk

8

Ti1 ,··· ,ik a1i1 × · · · × akik

As before, a1 , · · · , ak can also be computed iteratively. We first introduce the l-mode product of a tensor T and a vector a, which we denote as T ×l a.

The result of l-mode product of a tensor T ∈ Rn1 ⊗ · · · ⊗ Rnk and a vector a ∈ Rnl , 1 ≤ l ≤ k will be a new tensor B ∈ Rn1 ⊗ · · · ⊗ Rnl−1 ⊗ Rnl+1 ⊗ · · · Rnk , where Bi1 ,··· ,il−1 ,il+1 ,··· ,ik =

nl X

il =1

Ti1 ,··· ,il−1 ,il ,il+1 ,··· ,ik · ail

Thus, the decision function in higher order tensor space can also be written as: f (T ) = T ×1 a1 ×2 a2 · · · ×k ak + b The optimization problem of STM in high order tensors is: m

X 1 1 ka ⊗ · · · ⊗ ak k2 + C ξi 2

min

a1 ,··· ,ak ,b,ξ

(16)

i=1

yi (Ti (a1 , a2 , · · · , ak ) + b) ≥ 1 − ξi ,

subject to

ξi ≥ 0,

i = 1, · · · , m.

Here ka1 ⊗ · · · ⊗ ak k denotes the tensor norm of a1 ⊗ · · · ⊗ ak (Lee, 2002).

First, to compute a1 , we fix a2 , · · · , ak . Let β2 = ka2 k2 , · · · , βk = kak k2 . We then define

ti = Ti ×2 a2 · · · ×k ak . Thus, the optimization problem (16) can be reduced as follows: m

min

a1 ,b,ξ

subject to

X 1 β2 · · · βk ka1 k2 + C ξi 2 i=1

yi (a

1T

ti + b) ≥ 1 − ξi ,

ξi ≥ 0,

(17)

i = 1, · · · , m.

Again, we can use the standard SVM computational methods to solve this optimization problem. Once a1 is computed, we can fix a1 , a3 , · · · , ak to compute a2 . So on, all the ai can be computed

in such iterative manner.

4

Experiments

To evaluate the performance of tensor based classifiers, we performed experiments on six datasets from UCI Repository and report results for all of them. These six datasets include BREASTCANCER, DIABETES, IONOSPHERE, MUSHROOMS, SONAR, as well as the ADULT data in a representation produced by Platt (1999). All of these datasets are binary categories and the features were scaled to [−1, 1]. The preprocessed datasets can also be downloaded from LIBSVM dataset webpage3 . 3

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

9

Table 1: Classification accuracies on various data sets. n is the number of features and m is the total number of examples

Data set

n

m

Train

TLS

LS

TLS vs LS

STM

SVM

STM vs SVM

Adult Breast-cancer

123 10

32561 683

0.1% 1%

63.6 87.6

60.7 85.5

76.2 94.3

76.5 92.7

Diabetes Ionosphere

8 34

768 351

1% 5%

60.3 75.4

59.0 72.4

≫ ≫

66.2 77.4

66.5 76.4

∼ ≫

Mushrooms Sonar

112 60

8124 208

0.1% 5%

72.2 60.7

72.4 60.3

85.7 65.1

83.1 62.9

∼ ≫ ∼ ∼

∼ >

≫ ≫

“≫” or “≪” means P-value ≤ 0.01 in the paired T-test on the 50 random splits “>” or “<” means 0.01 < P-value ≤ 0.05 “∼” means P-value > 0.05

All the datasets were randomly split into training and testing sets. Notice that the training sets are pretty small since we are particularly interested in the performance on the small training size cases. We averaged the results over 50 random splits and reported the average performance.

4.1

Experimental Results on TLS and LS

We used the regress function in statistics toolbox of Matlab 7.04 to solve the least square problems in both Tensor Least Square classifier (TLS) and Least Square classifier (LS). Table 1 summarized the results. Besides the average accuracy, a paired T-test on the 50 random splits is also reported, which indicates whether the difference between two systems is significant. We can see that in 3 out of 6 datasets, TLS is better than LS. In the remaining 3 datasets, both algorithms are comparable. To get a more detailed picture of the performance of TLS with respect to the training set size. We tested TLS and LS over various training sizes (from 1% to 10%) on BREAST-CANCER dataset. The results are shown in Figure 2. As can be seen, when the training set is small (1%, 2% and 3%), TLS outperforms LS. As the number of training samples increases, TLS tends to get same results with LS.

4.2

Experimental Results on STM and SVM

In this experiment, we compared the classification performance of Support Tensor Machines (STM) and traditional Support Vector Machines (SVM). We used the LIBSVM system (Chang & Lin, 2001) and tested it with the linear model. For both STM and SVM, there is a parameter C need to be set. We use cross-validation on the 10

96

Accuracy (%)

94 92 90 88 Tensor Least Square Least Square

86 84

0

2

4 6 8 Training sample ratio (%)

10

Figure 2: Classification accuracy with respect to the training sample size of TLS and LS on BREAST-CANCER dataset. As can be seen, when the training set is small (1%, 2% and 3%), TLS outperforms LS. As the number of training samples increases, TLS tends to get same results with LS. training set to find the best parameter C. Table 1 summarized the results. We can see that in 4 out of 6 datasets, STM is better than SVM. In the remaining 2 datasets, both algorithms are comparable. We also tested STM and SVM over various training size (from 1% to 10%) on BREAST-CANCER dataset. The results are shown in Figure 3. As can be seen, when the training set is small (1%, 2% and 3%), STM outperforms SVM. As the number of training samples increases, STM tends to get same results with SVM. In all the cases, large margin classifiers (STM and SVM) are better than least square classifiers (TLS and LS). Both of these two experiments suggest that classifiers based on tensor representation are particularly suitable for small sample problems. This might be due to the fact that the number of parameters need to be estimated in tensor classifiers is n1 + n2 which can be much smaller than n1 × n2 in vector classifiers.

5

The General Algorithm for Learning Functions in Tensor Space

In the above sections, we have developed an iterative algorithm for solving the optimization problems of STM and TLS. It can be noticed that these two algorithms share the similar iterative procedure. Actually, such iterative algorithm can be generalize to solve a broad family of tensor version of traditional vector based learning algorithms. Suppose the linear classifier in vector space is g(x) = wT x + b and its objective function is min V (w, xi , yi , i = 1, · · · , m) 11

(18)

96.5 96

Accuracy (%)

95.5 95 94.5 94 93.5

STM SVM

93 92.5

0

2

4 6 8 Training sample ratio (%)

10

Figure 3: Classification accuracy with respect to the training sample size of STM and SVM on BREAST-CANCER dataset. As can be seen, when the training set is small (1%, 2% and 3%), STM outperforms SVM. As the number of training samples increases, STM tends to get same results with SVM. The corresponding linear classifier in tensor space is f (X) = uT Xv + b with objective function min T (u, v, Xi , yi , i = 1, · · · , m)

(19)

The algorithmic procedure for solving the optimization problem (19) is formally stated below: 1. Initialization: Let u = (1, · · · , 1)T . 2. Computing v: Let xi = XTi u. The tensor classifier can be rewritten as f (X) = uT Xv + b = xT v + b = g ′ (x) Thus, the optimization problem of (19) can be rewritten as the optimization problem in vector space: min V ′ (v, xi , yi , i = 1, · · · , m)

(20)

Note: Any computational method for solving (18) can also be used here. ˜ i = Xi v. Similarly, The tensor classifier can be 3. Computing u: Once v is obtained, let x rewritten as ˜ + b = g ′′ (˜ f (X) = uT Xv + b = uT x x) Thus, u can be computed by solving the same vector space optimization problem: ˜ i , yi , i = 1, · · · , m) min V ′′ (u, x

(21)

Note: As above, Any computational method for solving (18) can also be used to solve (21).

12

4. Iteratively computing u and v: By step 2 and 3, we can iteratively compute u and v until they tend to converge. Note: As long as the solution of optimization problem (18) is globally optimum, the iterative procedure described here will converge. The convergence proof will be similar to the proof we given in Section 3.3.

6

Conclusions

In this paper we have introduced a tensor framework for data representation and classification. In particular, we have proposed two new classification algorithms called Support Tensor Machines (STM) and Tensor Least Square (TLS) for learning a linear classifier in tensor space. Our experimental results on 6 databases from UCI Repository demonstrate that tensor based classifiers are especially suitable for small sample cases. This is due to the fact that the number of parameters estimated by a tensor classifier is much less than that estimated by a traditional vector classifier. There are several interesting problems that we are going to explore in the future work: 1. In this paper, we empirically construct the tensor. The better ways of converting a data vector to a data tensor with theoretical guarantee need to be studied. 2. Both STM and TLS are linear methods. Thus, they fail to discover the nonlinear structure of the data space. It remains unclear how to generalized our algorithms to nonlinear case. A possible way of nonlinear generalization is to use kernel techniques. 3. In this paper, we use an iterative computational method for solving the optimization problems of STM and TLS. We expect that there exist more efficient computational methods.

References Burges, C. J. C. (1998). A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2, 121–167. Cai, D., He, X., & Han, J. (2005). Subspace learning based on tensor analysis (Technical Report). Computer Science Department, UIUC, UIUCDCS-R-2005-2572. Chang, C.-C., & Lin, C.-J. (2001). LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm. Duda, R. O., Hart, P. E., & Stork, D. G. (2000). Pattern classification. Hoboken, NJ: WileyInterscience. 2nd edition. Fletcher, R. (1987). Practical methods of optimization. John Wiley and Sons. 2nd edition edition. 13

Graf, H. P., Cosatto, E., Bottou, L., Dourdanovic, I., & Vapnik, V. (2004). Parallel support vector machines: The cascade SVM. Advances in Neural Information Processing Systems 17. Hastie, T., Tibshirani, R., & Friedman, J. H. (2001). The elements of statistical learning. SpringerVerlag. He, X., Cai, D., & Niyogi, P. (2005). Tensor subspace analysis. Advances in Neural Information Processing Systems 18. Joachims, T. (1998). Text categorization with support vector machines: Learning with many relevant features. ECML’98. Lee, J. M. (2002). Introduction to smooth manifolds. Springer-Verlag New York. Platt, J. C. (1998). Using sparseness and analytic QP to speed training of support vector machines. Advances in Neural Information Processing Systems 11. Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization, 185–208. Cambridge, MA, USA: MIT Press. Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. ECCV’02. Shashua, A., & Hazan, T. (2005). Non-negative tensor factorization with applications to statistics and computer vision. Proc. 2005 Int. Conf. Machine Learning (ICML’05). Vapnik, V. N. (1982). Estimation of dependences based on empirical data. Springer-Verlag. Vapnik, V. N. (1995). The nature of statistical learning theory. Springer-Verlag. Vasilescu, M. A. O., & Terzopoulos, D. (2003). Multilinear subspace analysis for image ensembles. IEEE Conference on Computer Vision and Pattern Recognition. Ye, J., Janardan, R., & Li, Q. (2004). Two-dimensional linear discriminant analysis. Advances in Neural Information Processing Systems 17.

14