lec01

Lecture 1. What is Optimization
...

1.1 Introductory examples
...

What is optimization? Roughly speaking, optimization is to minimize or maximize a function (which is called the objective function) under some constraints.

For example, we some several ways to return the campus from Hongqiao Station: by taxi (Didi / Gaode), by metro or by bus (Hongqiao 4 Line / Min-Hong 2 Line), etc. We would like to minimize the time, but our money is limited. This is an optimization problem.

Formally, an optimization problem can be defined by where is called the objective function and is called the feasible set, usually specified by constraint functions The optimal solution is usually denoted by In this course, we consider continuous optimization problem, where the objective function and the constraints are continuous functions. We now give some more examples.

Knapsack problem
...

Example

Suppose there are types of items. The -th type has volume , weight and value . We have a knapsack to bring some items. However the capacity of this knapsack is and the load-bearing is . That is, the total volume of the items in the knapsack can not exceed and the total weight of the items in the knapsack can not exceed . What is the maximum value we can bring?

For each , define a variable to denote the number of carried items of the -th type. Then we can formalize the problem as How can we solve this problem? For simplicity, we assume that there are only two types: Cola and potato chips. Each Cola has volume , weight and value ; each potato chips has volume , weight and value . Our knapsack has capacity and load-bearing . Now the problem is Actually, we can solve this problem by drawing a graph:

Question

What if we require integer ?
What if there are more types?

Data fitting
...

Example

Consider the free falling motion. The height and the time of a free fall follow the law . However, the practical data may not exhibit the perfect law.
Suppose we have the following data and we would like to use to fit the data. Which value of coefficients should we choose?

	10	20	30	40
	1.011	2.019	3.032	4.041

However, before solving this problem, we should first ask the following question: if we choose a certain value of , how can we measure the difference between the theoretical values of and the practical data?

Question

Generally, we have the following question. Let , , where . Given a set of data and we guess the values of and , how can we measure the difference between and , where ?

If we only have two numbers and , it is natural to use the absolute value to measure the difference. Moreover, it is clear that is closer to than . However, if we have two vectors, how can we measure the difference? Is closer to than ?

We need to extend the concept of the absolute value to measure the distances between vectors in .

Definition (Norm)

Given a vector space over a field (usually ), a norm is a function having the following properties:
1. (Nonnegativity) , .
2. (Positive definiteness) iff .
3. (Absolute homogeneity) and , .
4. (Triangle inequality) , .

This definition is not constructive. That means any function satisfying above properties can reasonably measure the distance between two vectors. We now see some specific examples.

Example ( norm)

norm defined on : , where . In particular,
- norm: .
- norm: , which is the most common norm in .
- norm: .
Sometimes we will see the so-called norm, given by , that is, the number of nonzero entries. Note that it is not a norm indeed. We call it norm just for convenience.

Question

Why do norms satisfy the triangle inequality?

Tip

The triangle inequality follows the so-called Minkowski inequality, which we will prove several weeks later.

Another example is called the canonical norm, which is induced by the inner product. Usually, the inner product of two vectors and is define by their dot product However, in fact, the inner product can be defined more general.

Definition (Inner product)

An inner product for a vector space is a function (we assume in our course) satisfying

(Nonnegetivity) .
(Positive definiteness) iff .
(Symmetry) .
(Linearity) .

Given a vector space with an inner product, the canonical norm is given by .

Example (Euclidean space )

The inner product is given by and thus the canonical norm is the norm.

Theorem (Cauchy-Schwarz inequality)

For any vector space with any inner product and the canonical norm, it holds that

Least square method
...

We now return to the linear regression problem. A famous and well-applied method is the least square method, which use the -norm as the objective function.

Given and , assume the value of coefficients are , then the value of should be (using and our ) where and our goal is to minimize . Using the form of matrix multiplication, this is to solve Note that the term is not necessary. Let , the above problem is converted into the following form: We now consider how to solve the above problem. is the column space of the matrix , denoted by or . The above optimization problem is to ask the minimum distance from to the subspace . The answer is the distance from to the orthogonal projection of onto the subspace.

Assume is the orthogonal projection, and let . Then we have that is orthogonal to the subspace , i.e., for any , . It yields that . Also there exists such that (actually is the desired in the above optimization problem). So we have Thus, we have which implies So if is invertible. Then suffices. We will revisit this topic later.

Classification and the support vector machine
...

Given a data set , a support-vector machine is to classify (separate) these data using a -dimensional hyperplane. Associate to each . We would like to divide the group of for which from the group of for which . Then a hyperplane is a desired one if for all , However, there are infinite many hyperplanes satisfying our requirements. For example, we would like to classify black points and white points in the following picture, and two dot lines are satisfied ones. Which one is better? A reasonable choice is to find the "maximum-margin" hyperplane, that is, to make the minimum distance from points to the hyperplane as large as possible.

Distance to a hyperplane
...

We first consider the problem of computing the distance from a point to a hyperplane.
Assume the hyperplane is and the point is . Again (similar to the least square method), consider the orthogonal projection of onto . Suppose the orthogonal projection is . It is clear that . So such that . Also, since .
Now we have which yields The distance from to is Back to the problem of classification. Now our goal is to solve the following optimization: But this form is too complicated to solve. We would like to simplify it.
Note that and we could choose proper and so that . So the optimization is equivalent to which is further equivalent to The last form is easy to solve (since it is actually a convex optimization).

Remark

The constraints are equivalent to our assumption , because our goal is to minimize the norm of . If , the corresponding cannot be the optimal solution.

Lecture 1. What is Optimization...

1.1 Introductory examples...

Knapsack problem...

Data fitting...

Least square method...

Classification and the support vector machine...

Distance to a hyperplane...

Lecture 1. What is Optimization
...

1.1 Introductory examples
...

Knapsack problem
...

Data fitting
...

Least square method
...

Classification and the support vector machine
...

Distance to a hyperplane
...