- Home
- Documents
*Theory & Applications of Online Learning - ttic. shai/icml08tutorial/ Theory & Applications*

If you can't read please download the document

View

2Download

0

Embed Size (px)

Theory & Applications of Online Learning

Shai Shalev-Shwartz Yoram Singer

ICML, July 5th 2008

Motivation - Spam Filtering

For t = 1, 2, . . . , T

Receive an email

Expert advice: Apply d spam filters to get x ∈ {+1,−1}d

Predict ŷt ∈ {+1,−1}

Receive true label yt ∈ {+1,−1}

Suffer loss !(yt, ŷt)

Motivation - Spam Filtering

Goal – Low Regret

We don’t know in advance the best performing expert

We’d like to find the best expert in an online manner

We’d like to make as few filtering errors as possible

This setting is called ”regret analysis”. Our goal:

T∑

t=1

!(ŷt, yt)−min i

T∑

t=1

!(xt,i, yt) ≤ o(T )

Regret Analysis

Low regret means that we do not loose much from not knowing future events

We can perform almost as well as someone who observes the entire sequence and picks the best prediction strategy in hindsight

No statistical assumptions

We can also compete with changing environment

In many cases, data arrives sequentially while predictions are required on-the-fly

Applicable also in adversarial and competitive environments (e.g. spam filtering, stock market)

Can adapt to changing environment

Simple algorithms

Theoretical guarantees

Online-to-batch conversions, generalization properties

Why Online ?

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part I: What prediction tasks are possible

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part I: What prediction tasks are possible

Regression with squared-loss Classification with 0-1 loss

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part I: What prediction tasks are possible

Regression with squared-loss Classification with 0-1 loss

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part I: What prediction tasks are possible

Regression with squared-loss Classification with 0-1 loss

Convexity is a key property

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Online Convex Optimization

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Online Convex Optimization

regression

experts’ advice

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Online Convex Optimization

regression

Non-

Convex experts’ advice

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Online Convex Optimization

regression

Non-

Convex experts’ advice

Convexification

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Online Convex Optimization

regression

Non-

Convex experts’ advice

Randomization

Convexification

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Online Convex Optimization

regression

Non-

Convex experts’ advice

Randomization

Convexification

class ificat

ion

structured output

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part II: An algorithmic framework for online convex optimization

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part II: An algorithmic framework for online convex optimization

Update Typeadditive multiplicative

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part II: An algorithmic framework for online convex optimization

Update Type

U pd

at e

co m

pl ex

ity

follow-the-leader

gradient based

additive multiplicative

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part III: Derived algorithms

• Perceptrons (aggressive, conservative) • Passive-Aggressive algorithms for the hinge-loss • Follow the regularized leader (online SVM) • Prediction with expert advice using multiplicative updates • Online logistic regresssion with multiplicative updates

Outline

Tutorial’s goals: provide design and analysis tools for online algorithms

Part IV: Application - Mail filtering

• Algorithms derived from framework for online convex optimization:

•Additive & multiplicative dual steppers • Aggressive update schemes: instantaneous dual maximizers

• Mail filtering by online multiclass categorization

Outline

Part V: Not covered due to lack of time

• Improved algorithms and regret bounds: Self-tuning Logarithmic regret for strongly convex losses • Other notions of regret: internal regret, drifting hypotheses • Partial feedback: Bandit problems, Reinforcement learning • Online-to-batch conversions

Problem I: Regression

Task: guess the next element of a real-valued sequence

What could constitute a good prediction strategy ?

Online Regression

For t = 1, 2, . . .

Predict a real number ŷt ∈ R

Receive yt ∈ R

Suffer loss (ŷt − yt)2

Regression (cont.) Follow-The-Leader

Predict: ŷt = 1t−1 t−1∑

i=1

yt

Similar to Maximum Likelihood

Regret Analysis

The FTL predictor satisfies:

∀y!, T∑

t=1

(ŷt− yt)2− T∑

t=1

(y!− yt)2 ≤ O(log(T ))

FTL is minimax optimal (outside scope)

Regression (cont.)

Proof Sketch

Be-The-Leader: ỹt = 1t t∑

i=1

yt

The regret of BTL is at most 0 (elementary)

FTL is close enough to BTL (simple algebra)

(ŷt − yt)2 − (ỹt − yt)2 ≤ O( 1t )

Summing over t (harmonic series) and we are done

Problem II: Classification

Guess the next element of a binary sequence

Online Prediction For t = 1, 2, . . .

Predict a binary number ŷt ∈ {+1,−1}

Receive yt ∈ {+1,−1}

Suffer 0− 1 loss

!(ŷt, yt) =

{ 1 if yt #= ŷt 0 otherwise

Classification (cont.)

No algorithm can guarantee low regret !

Proof Sketch

Adversary can force the cumulative loss of the learner to be as large as T by using yt = −ŷt

The loss of the constant prediction

y! = sign

( ∑

t

yt

) is at most T/2

Regret is at least T/2

Intermediate Conclusion

Two similar problems

Predict the next real-valued element with squared loss

Predict the next binary-valued element with 0-1 loss

Size of decision set does not matter !

In the first problem, loss is convex and decision set is convex

Is convexity sufficient for predictability ?

Online Convex Optimization

Online Convex Optimization

For t = 1, 2, . . . , T

Learner picks wt ∈ S

Environment responds with convex loss !t : S → R

Learner suffers loss !t(wt)

Abstract game between learner and environment Game board is a convex set S Learner plays with vectors in S Environment plays with convex functions over S

Online Convex Optimization -- Example I

Regression

S = R

Learner predicts element ŷt = wt ∈ S

A true target yt ∈ R defines a loss function !t(w) = (w − yt)2

Online Convex Optimization -- Example II

Regression with Experts Advice

S = {w ∈ Rd : wi ≥ 0, ‖w‖1 = 1}

Learner picks wt ∈ S

Learner predicts ŷt = 〈wt,xt〉

A pair (xt, yt) defines a loss function over S: !t(w) = (〈w,xt〉 − yt)2

Coping with Non-convex Loss Functions

Method I: Convexification

Find a surrogate convex loss function

Mistake bound model

Method II: Randomization

Allow randomized predictions

Analyzed expected regret

Loss in expectation is convex

Convexification and Mistake Bound

Non-convex loss: mistake indicator a.k.a 0-1 loss

Recall that regret can be as large as T/2

Surrogate loss function: hinge-loss

where

By construction

[a]+ = max {a, 0}

!0−1(ŷt, yt) =

{ 1 if yt != ŷt 0 otherwise

!hi(w, (xtyt)) = [1 − yt〈wt,xt〉]+

!0−1(ŷt, yt) ≤ !hi(wt, (xtyt))

Convexification and Mistake Bound

Non-convex loss: mistake indicator a.k.a 0-1 loss

Recall that regret can be as large as T/2

Surrogate loss function: hinge-loss

where

By