Skip to content

09




文本

████    重点词汇
████    难点词汇
████    生僻词
████    词组 & 惯用语

[学习本文需要基础词汇量:5,000 ]
[本次分析采用基础词汇量:6,000 ]

Hi everyone and welcome [NOISE] to Lecture 9 uh, for CS230.

Uh, today we are going to discuss an advanced topic uh,

that will be kind of the,

the marriage between deep learning and

another field of AI which is reinforcement learning,

and we will see a practical uh,

application in how deep learning methods can be plugged in another family of algorithm.

So, it's interesting because deep learning methods and deep neural networks

have been shown to be very good uh, function approximators.

Essentially, that's what they are.

We're giving them data so that they can approximate a function.

There are a lot of different fields which require these function approximators,

and deep learning methods can be plugged in all these methods.

This is one of these examples.

So, we'll first uh, motivate uh,

the, the setting of reinforcement learning.

Why do we need reinforcement learning?

Why cannot, what wh - why can't we use deep learning methods to solve everything?

There is some set of methods that we cannot solve with

deep learning and reinforcement learning meth - uh,

re-reinforcement learning applications are examples of that.

Uh, we will see an example uh,

to introduce an algorithm, a re - reinforcement learning algorithm called Q-Learning,

and we will add deep learning to this algorithm and make it Deep Q-Learning.

[NOISE] Uh, as we've seen with uh,

generative adversarial networks and also deep neural networks,

most models are hard to train.

We've had, we had to come up with Xavier initialization, with dropout,

with batch norm, and myriads,

myriads of methods to make these deep neural networks trained.

In GANs, we had to use methods as well in order to train GANs,

and tricks and hacks.

So, here we will see some of the tips and tricks to train Deep Q-Learning,

which is a reinforcement learning algorithm.

[NOISE] And at the end,

we will have a

guest speaker coming uh,

to talk about advanced topics which are mostly research,

which combine deep learning and reinforcement learning.

Sounds good? Okay. Let's go.

[NOISE] So, deep reinforcement learning is a very recent field I would say.

Although both fields are,

are - reinforcement learning have,

has existed for a long time.

Uh, only recently, it's been shown that using deep learning as a way to

approximate the functions that play

a big role in reinforcement learning algorithms has worked a lot.

So, one example is AlphaGo and uh,

you probably all have heard of it.

It's Google DeepMind's

AlphaGo has uh, beaten world champions in a game called the game of Go,

which is a [NOISE] very,

a very old strategy old game,

and the one on the right here um,

or on, on your right,

human level control through deep reinforcement learning is also uh

DeepMind, Google's DeepMind paper that came

out and hit the headlines on the front page of Nature,

which is uh, one of the leading uh,

multidisciplinary peer review journals uh, in the world.

And they've shown that with deep learning plugged in a reinforcement learning setting,

they can train an agent that beats human level in a variety of games and in fact,

these are Atari games.

So, they've shown actually that their algorithm,

the same algorithm reproduced for a large number of games,

can beat humans on all of these games. Most of these games.

Not all of these games. So, these are two examples,

although they use different sub techniques of uh, reinforcement learning.

they both include some deep learning aspect in it.

And today we will mostly talk about the human level controlled through

deep reinforcement learning or so-called deep Q network,

presented in this paper.

So, let's, let's start with,

with motivating reinforcement learning using the, the AlphaGo setting.

Um, this is a board of Go and the picture comes from DeepMind's blog.

Uh, so Go you can think of it as a strategy game,

where you have a grid that is up to 19 by 19 and you have two players.

One player has white stones and one player has black stones,

and at every step in the game,

you can position a stone on the, on the board.

On one of the grid cross.

The goal is to surround your opponent,

so to maximize your territory by surrounding your opponent.

And it's a very complex game for different reasons.

Uh, one reason is that you have to be you,

you cannot be shortsighted in this game.

You have to have a long-term strategy.

Another reason is that the board is so big.

It's much bigger than a chessboard, right?

Chessboard is eight by eight.

So, let me ask you a question.

If you had to solve or build an agent that solves this game and beats

humans or plays very well at least with

deep learning methods that you've seen so far, how would you do that?

[NOISE]

Someone wants to try.

[NOISE] So, let's say you have a,

you have to collect the data set because in,

in classic supervised learning,

we need a data set with x and y.

What do you think would be your x and y?

[NOISE] Yeah.

The game board is the input, the output probability of victory in that position.

Okay. Input is game board, and output is probability of victory in that position.

So, that's, that's a good one I think. Input output.

What's the issue with that one?

[NOISE] So, yeah.

[inaudible]

Yeah. It's super hard to represent what the probability of winning is from this board.

Even, like nobody can tell.

Even if I ask [NOISE] an experienced human to come and tell us what's

the probability of black winning in this or white winning [NOISE] in,

in this setting, they wouldn't be able to tell then [NOISE].

So, this is a little more complicated.

Any other ideas of data sets? Yeah.

[NOISE] [inaudible].

Okay. Good point. So, we could have the grid like this one and then this is the input,

and the output would be the move, the next action taken by probably a professional player.

So, we would just watch professional players playing and we would record their moves,

and we will build a data set of what is a professional move,

and we hope that our network using these input/outputs,

will at some point learn how the professional players play and given an input,

a state of the board,

would be able to decide of the next move.

What's the issue with that?

Yeah.

[inaudible]

Yes. [NOISE] We need a whole lot of data.

Why? And - and you, you said it.

You - you said because, uh,

we need basically to represent all types of positions of the board, all states.

So, if, if you were actually.

Let's - let's do that.

If we were to compute the number of possible states,

uh, of this board, what would it be?

[NOISE] It's a 19 by 19 board.

[NOISE].

Remember what we did with adversarial examples.

We did it for pixels, right?

[NOISE] Now, we're doing it for the board.

So, what's - the, the question first is. Yeah, You wanna try?

Uh, three to the 19 squared.

Yeah. Three to the power 19 [NOISE] times 19.

Or 19 squared. Yeah. So, why is it that?

[NOISE] So, why is it? Is it this?

Each spot can have [inaudible] uh,

have no stone a white stone or a black.

Yeah. Each spot and there are 19 times 19 spots,

can have three state basically.

No stone, white stone, or black stone.

But this is the all possible state.

This is about 10 to the 170.

So, it's super, super big.

So, we can probably not get even close to that by observing professional players.

First because we don't have enough professional players and because,

uh, we are humans and we don't have infinite life.

So, the professional players can not play forever.

They might get tired as well.

Uh, but, so one issue is the state space is too big.

Another one is that the ground truth probably would be wrong.

It's not because you're a professional player that you will play

the best move every time, right?

Every player has their own strategy.

So, the ground truth we're,

we're having here is not necessarily true,

and our network might, might not be able to beat these human players.

What we are looking into here is an algorithm that beats humans.

Okay. Second one, too many states in the game as you mentioned,

and third one we will likely not generalize.

The reason we will not generalize is because in classic supervised learning,

we're looking for patterns.

If I ask you to build an algorithm to detect cats versus dogs,

it will look for what the pattern of a cat is versus what the pattern of the dog is in,

and in the convolutional filters, we will learn that.

In this case, it's about a strategy.

It's not about a pattern.

So, you have to understand the process

of winning this game in order to make the next move.

You cannot generalize if you don't understand this process of long-term strategy.

So, we have to incorporate that,

and that's where RL comes into place.

RL is reinforcement learning,

a method that would - could be described with one sentence

as automatically learning to make good sequences of decisions.

So, it's about the long-term.

It's not about the short-term,

and we would use it generally when we have delayed labels,

like in this game.

The label that you mentioned at the beginning was probability of victory.

This is a long-term label.

We cannot get this label now but over time,

the closer we get to the end,

the better we are at seeing the victory or not,

and it's for to make sequences of decisions.

So, we make a move then the opponent makes a move.

Then we make another move, and

all the decisions of these moves are correlated with each other.

Like you have to plan in, in advance.

When you are human you do that,

when you play chess, when you play Go [NOISE].

So examples of RL applications can be robotics

and it's still a research topic how deep RL can change robotics,

but think about having a robot walking from here and you wanna send it there.

You'll want to send the robot there.

What you're teaching the robot is if you get there it's good, right? It's good.

You achieve the task, but I

cannot give you the probability of getting there at every point.

I can help you out by giving you

a reward when you arrive there and let you trial and error.

So, the robot will try and randomly

initialized, the robot would just fall down at the first, at first.

Gets a negative reward.

Then, repeats.

This time the robot knows that it shouldn't fall down.

It shouldn't go down.

You should probably go this way.

So, through trial and error and reward on the long-term,

the - the robot is supposed to learn this pattern.

Another one is games and that's the one we will see today.

Uh, games can be represented as, as,

as a set of reward for reinforcement learning algorithm.

So, this is where you win,

this is where you lose.

Let the algorithm play and figure out what

winning means and what losing means, until it learns.

Okay. The problem with using deep learning is that the algorithm

will not learn cause this reward is too long-term.

So, we're using reinforcement learning, and finally advertisements.

So, a lot of advertisements,

um, are real-time bidding.

So, you wanna know given a budget when you wanna invest this budget,

and this is a long-term strategy planning as well,

that reinforcement learning can help with.

Okay. [NOISE] So, this was the motivation of reinforcement learning.

We're going to jump to a concrete example that is

a super vanilla example to understand Q-Learning.

So, let's start with this game or environment.

So, we call that an environment generally and it has several states.

In this case five states.

So, we have these states and we can define rewards, which are the following.

So, let's see what is our goal in this game.

We define it as maximize the return or the reward on the long-term.

And what is the reward is the,

the numbers that you have here,

that were defined by a human.

So, this is where the human defines the reward.

Now, what's the game? The game has five states.

State one is a trash can and has a reward of plus two.

State two is a starting state, initial state,

and we assume that we would start in

the initial state with the plastic bottle in our hand.

The goal would be to throw this plastic bottle in a can.

[NOISE] If it hits the trash can, we get +2.

If we get to state five,

we get to the recycle bin,

and we can get +10.

Super important application.

State four has a chocolate.

So, what happens is if you go to state four,

you get a reward of 1 because you can eat the chocolate,

and you can also throw the,

the chocolate in the, in - in the,

in the recycle bin hopefully.

Does this setting makes sense?

So, these states are of three types.

One is the starting state initial which is brown.

The [NOISE] normal state which is not a starting neither,

neither a starting nor a,

an ending state, and it's gray.

And the blue states are terminal states.

So, if we get to the terminal state,

we end up a game or an episode let's say.

Does the setting makes sense?

Okay, and you have two, two possible actions.

You have to move. Either you go on the left or you go on the right.

An additional rule will,

will add is that the garbage collector will come in three minutes,

and every step takes you one minute.

So, you cannot spend more than three minutes in this game.

In other words, you cannot stay at the chocolate and eat chocolate forever.

You have to, to move at some point.

Okay. So, one question I'd have is how do you define the long-term return?

Because we said we want a long-term return.

We don't want, we don't care about short-term returns.

[NOISE] What do you

think is a good way to define the long-term return here? [NOISE] Yeah.

Sum of the terminal state.

The sum of the terminal states.

No, the sum of how many points you have when you reach the terminal state.

The sum of how many points you have when you reach a terminal state.

So, let's say I'm in this state two,

I have zero reward right now.

If I reach the terminal state on the,

on the, on your left, the +2.

I get +2 reward and I finish the game.

If I go on the right instead and I reached the +10.

You are saying that the long-term return can be

all the sum of the rewards I got to get there, so +11.

So, this is one way to define the long-term return.

Any other ideas?

[NOISE]

[inaudible] reward.

Yeah, we probably want to incorporate the time-steps

and reduce the reward as, as time passes

and in fact this would be called a discounted return

versus what you said would be called a return.

Here we'll use a discounted return in and it has several advantages,

some are mathematical because the return

you describe which is not discounted might not converge.

It might go up to plus infinity.

This discounted return will converge with an appropriate discount.

So intuitively also, why is the discounted return intuitive,

it's because time is always an important factor in our decision-making.

People prefer cash now then cash in 10 years, right?

Or similarly, you can consider that the robot has a limited life expectancy,

like it has a battery and loses battery every time it moves.

So you want to take into account this discount of if I can eat chocolate close,

I go for it because I know that if the chocolate is too far,

I might not get there because I'm losing some battery,

some energy for example.

So this is the discounted return.

Now, if we take gamma equals one which means we have no discount,

the best strategy to follow in the settings seems to be to go

to the left or to go to the right starting in the initial state two.

Right. And the reason is,

it's a simple computation.

On one side I get +2,

on the other side I get +11.

What if my discount was 0.1?

Which one would be better?

Yeah, the left would be better, directly to plus.

And the reason is because we compute in our mind.

We just do 0 plus 0.1 times 1,

which gives us 0.1,

plus 0.1 squared times 10.

And it's less than 2. We know it.

Okay. So now we're going to assume that the discount is 0.9.

And it's a very common discount to

use in reinforcement learning and we use a discounted return.

So the general question here and it's the core

of reinforcement learning in this case of Q-Learning is,

what do we want to learn?

And this is really, really,

think of it as a human,

what would you like to learn?

What are the numbers you need to have

in order to be able to make decisions really quickly,

assuming you had a lot more states than that and actions?

Any ideas of what we want to learn?

What would help our decision-making?

Optimal action at each state.

Yeah. That's exactly what we want to learn.

For a given state, tell me the actions that I can take.

And for that I need to have a score for all the actions in every state.

In order to store the scores,

we need a matrix, right?

So, this is our matrix. We will call it a Q table.

It's going to be of shape,

number of states times number of actions.

If I have this matrix of scores and the scores are correct,

I'm in state three,

I can look on the third row of this matrix and look what's the maximum value I have,

is it the first one or the second one?

If it's the first one, I go to the left,

if it's the second one that is maximum, I go to the right.

This is what we would like to have.

Does that make sense, this Q table?

So now, let's try to build a Q table for this example.

If you had to build it, you would first think of it as a tree.

Oh and by the way, every entry of this Q table tells you how good

it is to take this action in that state.

State corresponding to the row,

action corresponding to the column.

So now, how do we get there? We can build a tree.

And that's, that's similar to what we would do in our mind.

We start in S2. In S2, we have two options.

Either we go to S1,

we get 2, or we go to S3 and we get 0.

From S2 we - from S1,

we cannot go anywhere, it's a terminal state.

But from S3, we can go to S2,

and get 0 by going back,

or we can go to S4 and get 1.

That makes sense? From S4, same.

We can get 0 by going back to S3 or we can go to S5 and get +10.

Now, here I just have my immediate reward for every state.

What I would like to compute is

the discounted return for all the state because ultimately,

what should lead my decision-making in a state is,

if I take this action,

I get to a new state.

What's the maximum reward I can get from there in the future?

Not just the reward I get in that state.

If I take the other action,

I get to another state.

What's the maximum reward I could get from that state?

Not just the immediate reward that I get from going to that state.

So what we will do - we can do it together.

Let's say we want to compute the value of,

of the actions from S3,

from S3 going right to left.

From S3, I can either go to S4 or S2.

Going to S4, I know that the immediate reward was 1,

and I know that from S4,

I can get +10.

This is the maximum I can get.

So I can discount this 10 multiplied by 0.9,

10 times 0.9 gives us 9,

+ 1 which was the immediate reward gives us 9,

gives us, gives us 10.

So 10 is the score that we give to the action go right from state S3.

Now, what if we do it from one step before S2?

From S2, I know that I can go to S3,

and to S3 I get zero reward.

So the immediate reward is zero.

But I know that from S3,

I can get 10 reward ultimately on the long-term.

I need to discount this reward from one step.

So I multiply this 10 by 0.9 and I get 0 plus 0.9 times 10 which gives me 9.

So now, in state two going right will give us a long-term reward of 9.

Makes sense? And you do the same thing.

You can copy back that going from S4 to S3 will give you 0 plus the maximum,

you can get from S3 which was 10 discounted by 0.9,

or you can do it from S2.

From S2, I can go left and get +2,

or I can go right, and get 9.

And the immediate reward would be 9, would be 0.

And I will discount the 9 by 0.9 and get 8.1.

So that's the process we would do to compute that.

And you see that it's an iterative algorithm.

I will just copy back all these values in my matrix.

And now, if I'm in state two,

I can clearly say that the best action seems to go,

seems to say go to the left because the long-term discounted reward is 9,

while the long-term discounted reward for going to the right is 2.

And I'm done. That's Q-Learning.

I solved the problem.

I had, I had a problem statement.

I found a matrix that tells me in every state what action I should take.

I'm done. So, why do we need deep learning?

That's the question we will try to answer.

So the best strategy to follow with 0.9 is still right,

right, right, and the way I see it is,

I just look at my matrix at every step.

And I follow always the maximum of my row.

So, from state two,

9 is the maximum, so I go right.

From state three, 10 is the maximum so I still go right.

And from state four, 10 is the maximum, so I go right again.

So I take the maximum over all the actions in a specific state.

Okay. Now, one interesting thing to follow is that when you do this iterative algorithm,

at some point, it should converge.

And ours converged to some values that

represent the discounted rewards for every state and action.

There is an equation that this Q-function follows.

And we know that the optimal Q-function follows this equation.

The one we have here follows this equation.

This equation is called the Bellman equation.

And it has two terms.

One is R and one is discount times the maximum of the Q scores over all the actions.

So, how does that make sense?

Given that you're in a state S,

you want to know the score of going,

of taking action a in the state.

The score should be the reward that you get by going there,

plus the discount times the maximum you can get in the future.

That's actually what we used in the iteration.

Does this Bellman equation make sense?

Okay. So remember this is going to be

very important in Q-learning, this Bellman equation.

It's the equation that is satisfied by the optimal Q table or Q-function.

And if you try out all these entries,

you will see that it follows this equation.

So when Q didn't,

is not optimal, it's not following this equation yet.

We would like you to follow this equation.

Another point of vocabulary in reinforcement learning is a policy.

Policies denoted P sometimes or mu.

And - sorry.

Pi, pi of S is equal to argmax over the actions of the optimal Q that you have.

What it means it is exactly our decision process.

It's even that we are in state S. We look at

all the columns of the state S in our Q table, we take the maximum.

And this is what pi of S is telling us.

It's telling us, "This is the action you should take."

So, pi, our policy is our decision-making.

Okay. It tells us what's the best strategy to follow in a given state.

Any questions so far?

Okay, and so I have a question for you.

Why is deep learning helpful?

Yes?

The number of states is - is large, is way too large to store.

Yeah. That's very easy.

Number of states is way too large to store a table like that.

So like, if you have a small number of states and number of action,

then easily you can you use the Q-table.

You can add every state.

Looking into the Q-table is super quick,

and find out what you should do.

But ultimately, this Q-table will get bigger

and bigger depending on the application, right?

And the number of states for Go is 10 to the power 170 approximately,

which means that this matrix should have a number of rows

equal to 10 with 170 zeros after it.

You-you know what I mean. It's very big.

And number of actions is also going to be bigger.

In Go, you can place your action everywhere on the board that is available of course.

Okay. So, many - way too many states and actions.

So, we would need to come up with

maybe a function approximator that can give us the action based on the state,

instead of having to store this matrix.

That's where deep learning will come.

So, just to recap these first 30 minutes.

In terms of vocabulary, we learned what an environment is.

It's the - it's the general game definition.

An agent is the thing we're trying to train, the decision-maker.

A state, an action,

reward, total return, a discount factor.

The Q-table which is the matrix of entries representing

how good it is to take action A in state S,

a policy which is our decision-making function,

telling us what's the best strategy to apply in a state.

and Bellman equation which is satisfied by the optimal Q - table.

Now, we will tweak this Q-table into a Q-function.

And that's where we - we shift from Q-learning to deep Q-learning.

So, find a Q-function to replace the Q-table.

Okay? So, this is the setting.

We have our problem statement. We have our Q-table.

We want to change it into a function approximator that will be our neural network.

Does that make sense how deep learning comes into reinforcement learning here?

So now, we take a state as input, forward propagate it in the deep network,

and get an output which is an action - an action score.

For all the actions.

It makes sense to have an output layer that is the size of the number of

actions because we don't wanna - we don't wanna

give an action as input and the state as input,

and get the score for this action taken in this state.

Instead, we can be much quicker.

We can just give the state as inputs,

get all the distribution of scores over the output,

and we just select the maximum of this vector,

which will tell us which action is best.

So if - if we're in state two let's say here.

We are in state two and we forward propagate state two,

we get two values which are the scores of going left and right from state two.

We can select the maximum of those and it will give us our action.

The question is how to train this network? We know how to train it.

We've been learning it for nine weeks.

Compute the loss, back propagate.

Can you guys think of some issues that,

that make this setting different from a classic supervised learning setting?

The reward change is dynamic.

Yes?

The reward change is dynamic.

The reward change is dynamic.

So, the reward doesn't change. The reward is set.

You define it at the beginning. It doesn't change dynamically.

But I think what you meant is that the Q-scores change dynamically.

Yeah.

That's true. The Q-scores change dynamically.

But that's - that's probably okay because our network changed.

Our network is now the Q-score.

So, when we update the parameters of the network,

it updates the Q-scores.

What's-what's another issue that we might have?

No labels.

No labels. Remember in supervised learning,

you need labels to train your network. What are the labels, here?

[NOISE]. And don't say

compute the Q-table, use them as labels.

It's not gonna work. [NOISE]. Okay. So, that's

the main issue that makes this problem very different from classic supervised learning.

So, let's see how - how deep learning can be tweaked a little.

And we want you to see these techniques because they - they're

helpful when you read a variety of research papers.

We have our network given a state gives

us two scores that represent actions for going left and right from the state.

The loss function that we will define,

is it a classification problem or a regression problem?

[NOISE] Regression problem because

the Q-score doesn't have to be a probability between zero and one.

It's just a score that you want to give.

And that should look - that should me - meet the long-term discounted reward.

In fact, the loss function we can use is the L2 loss function,

y minus the Q-score squared.

So, let's say we do it for the Q going to the right.

The question is, what is y?

What is the target for this Q?

And remember what I copied on the top of the slide is the Bellman equation.

We know that the optimal Q should follow this equation. We know it.

The problem is that this equation depends on its own Q.

You know like, you have Q on both sides of the equation.

It means if you set the label to be r plus gamma times the max of Q stars,

then when you will back propagate,

you will also have a derivative here.

Let me - let me go into the details.

Let's define a target value.

Let's assume that going, uh,

left is better than going right at this point in time.

So, we initialize the network randomly.

We forward propagate state two in the network,

and the Q-score for left is more than the Q-score for right.

So, that's the action we will take at this point is going left.

Let's define our target y as the reward you get when you go left immediate,

plus gamma times the maximum of all the Q values you get from the next step.

So, let me spend a little more time on that because it's a little complicated.

I'm in s. I move to s next

using a move on the left?

I get immediate reward r,

and I also get a new state s prime, s next.

I can forward propagate this state in

the network and understand what is the maximum I can get from this state.

Take the maximum value and plug it in here.

So, this is hopefully what the optimal Q should follow.

It's a proxy to a good label.

It means we know that the Bellman equation tells us the best Q satisfies this equation.

But in fact, this equation is not true yet because the true equation we have Q star here,

not Q. Q star which is the optimal Q.

What we hope is that if we use this proxy as our label,

and we learn the difference between where we are now in this proxy,

we can then update the proxy,

get closer to the optimality.

Train again, update the proxy,

get closer to optimality,

train again, and so on.

Our only hope is that these will converge.

So, does it make sense how this is different from deep learning?

The labels are moving.

They're not static labels.

We define a label to be a best guess of what would be the best Q-function we have.

Then we compute the loss of where the Q-function is right now compared to this.

We back propagate so that the Q-function gets closer to our best guess.

Then, now that we have a better Q-function,

we can have a better guess.

So, we make a better guess,

and we fix this guess.

And now, we compute the difference between this Q-function that we have and our best guess.

We back propagate up.

We get to our best guess.

We can update our best guess again.

And we hope that doing that iteratively will end with the convergence

and a Q-function that will be very close to satisfy the Bellman equation,

the optimal Bellman equation.

Does it make sense? This is the most complicated part of Q-learning. Yeah?

So, are you saying we generate [inaudible] of the Q-function?

We generate the output of the network,

we get the Q function,

we compare it to the Q,

the best Q function that we think is - it is.

What is the best Q function?

The one that satisfies the Bellman equation.

But we're never actually given the Bellman equation.

We don't but we - we guess it based on the Q we have.

Okay.

So basically, when you have Q you can compute

this Bellman equation and it will give you some values.

These values are probably closer to where you want to get,

to where you - from where you are now.

Where you are now is - is further from this optimality,

and you want to reduce this gap by,

by - like to close the gap,

you back propagate. Yes?

Is there possibility for this will diverge?

So, the question is,

is there a possibility for this to diverge?

So, this is a broader discussion that would take a full lecture to prove.

So, I put a paper here from

Fran - Francisco Melo which proves the convergence of this algorithm so,

it converges, and in fact,

it converges because we're using a lot of tips and tricks that we will see later,

but if you want to see the math behind it and it's a,

it's a full lecture of proof,

I invite you to look at this simple ah proof for convergence of the Bellman equation.

Okay. Okay. So, this is the case where our left score is higher than right score,

and we have two terms in our target,

immediate reward for taking action left and also discounted

maximum future reward when you are in state S, S next.

Okay. The - the, the tricky part is that let's say,

we we compute that we can do it, we have everything,

we have everything to compute our target,

we have R which is defined by the,

by the human at the beginning,

and we can also get this number because we know that if we

take action left we can then get S next,

and we forward propagate S next in the network.

We take the maximum output and it's this.

So, we have everything in this, in this equation.

The problem now is if I plug this and

my Q score in my loss function and I ask you to back propagate.

Back propagation is what W equals W

minus alpha times the derivative of the loss function with respect to W,

the parameters of the network.

Which term will have a non-zero value?

Obviously, the second term Q of S go to the left will have

a non-zero value because it depends on the parameters of the network W,

but Y will also have a non-zero value.

Because you have Q here.

So, how do you handle that?

You actually get a feedback loop in

this back propagation that makes the network unstable.

What we do is that we consider this fixed,

we will consider this Q fixed.

The Q that is our target is going to be fixed for many iteration.

Let's say, a million or a 100,000 iteration until we get close to there,

and our gradient is small then we will update it and we'll fix it.

So, we actually have two networks in parallel,

one that is fixed and one that is not fixed.

Okay. And the second case is similar.

If the Q score to go on the right was more than the Q score to go on the left,

we would define our targets as immediate reward of going to

the right plus gamma times the maximum Q score we get,

if we're in the state that we will be in the next state and take the best action.

Does this makes sense? It's the most complicated part of Q-learning.

This is the hard part to understand.

So, immediate reward to go to

the right and discounted maximum future reward when you're in state S next, going to the right.

So, this is whole fix for back prop.

So, no derivative. If we do that then no problem,

Y is just a number.

We come back to our original supervised learning setting.

Y is a number and we compute the loss and we back propagate, no difference.

Okay. So, compute dL - dL over dW and update W using stochastic gradient descent methods.

RMS prop Adam whatever you guys want.

So, let's go over this, this full DQN,

deep Q-network implementation, and this slide is

a pseudocode to help you understand how this entire algorithm works.

We will actually plug in many methods in this, in this pseudocode.

So, please focus right now,

and - and if you understand this,

you understand the entire rest of the lecture.

We initialize our Q-network parameters just as we initialize a network in deep learning,

we loop over episodes.

So, let's define an episode to be one game like going

from start to end to a terminal state, is one episode.

We can also define episode sometimes to be

many states like Breakout which is the game with the paddle,

usually is 20 points.

The first player to get 20 points finishes the game.

So, episode will be 20 points.

Once your looping over episode starts from an initial state S. In our case,

it's only one initial state which is state two and loop over time steps.

Forward propagate S state two in the Q-network,

execute action A which has the maximum Q-score,

observe a immediate reward R and the next step S prime.

Compute target Y and to compute Y we know that,

we need to take S prime, forward propagate it in the network again,

and then, compute the loss function,

update the parameters with gradient descent.

Does this loop make sense?

It's very close to what we do in general.

The only difference would be this part

like we compute target Y using a double forward propagation.

So, with forward propagation,

we forward propagate two times in each loop.

Do you guys have any questions on,

on this pseudocode?

Okay. So, we will now see a concrete application of a Deep Q-Network.

So, this was the theoretical part.

Now, we are going to the practical part which is going to be more fun.

So, let's look at this game, it's called Breakout.

The goal when you play Breakout is to destroy all the bricks

without having the ball pass the line on the bottom.

So, we have a paddle and our decisions can be idle, stay,

stay where you are, move the paddle to the right or move the paddle to the left.

Right? And this demo,

and you have the credits on the bottom of the slide, uh,

shows that after training Breakout using Q-learning they

gets a super intelligent agent which figures

out a trick to finish the game very quickly.

So, actually even like good players didn't know this trick,

professional players know this trick, but, uh,

in Breakout you can actually try to dig a tunnel to get on the other side of the bricks,

and then, you will destroy all the bricks super

quickly from top to bottom instead of bottom-up.

What's super interesting is that the network figured out

this on its own without human supervision,

and this is the kind of thing we want.

Remember, if we were to use input, the Go boards and output professional players.

We will not figure out that type of stuff most of the time.

So, my question is,

what's the input of the Q-network in this setting?

Our goal is to destroy all the bricks. So, play Breakout.

What should be the input?

[NOISE]

Try something.

[inaudible] position of bricks.

Position of the paddle,

position of the bricks.

What else? Ball position.

Okay. Yeah I agree. So, this is what we will call a feature representation.

It means when you're in an environment you can extract some features, right?

And these are examples of features.

Giving the position of the ball is one feature,

giving the position of the bricks,

another feature, giving the position of the paddle, another feature.

Which are good features for this game,

but if you want to get the entire information you'd better do [NOISE] something else.

Yeah.

The pixels? You don't want any human supervision.

You don't wanna put features you - you just.

Okay. Take the pixels,

take the game, you can control the paddle, take the pixel.

Oh, yeah. This is a good input to the Q-network.

What's the output? I said it earlier.

Probably the output of the network will be 3

Q values representing the action going left,

going right and staying idle in a specific state.

That is the input of the network.

So, given a pixel image we want to predict Q scores for the three possible actions.

Now, what's the issue with that?

Do you think that would work or not?

Can someone think of something going wrong here?

Looking at the inputs.

[NOISE]

Okay. I'm gonna help you. If I give-yeah, you wanna try?

[inaudible].

Oh yeah, good point. Based on this image,

you cannot know if the ball is going up or down.

Actually, it's super hard because the,

the action you take highly depends on whether the ball is going up or down, right?

If the ball is going down,

and even if the ball is going down,

you don't even know which direct - direction it's going down.

So, there is a problem here definitely.

There is not enough information to make a decision on the action to take.

And if it's hard for us,

it's gonna be hard for the network.

So, what's a hack to, to prevent that?

Is to take successive frames.

So, instead of one frame,

we can take four frames-successive frames.

And here, the same setting as we had before but we see that the ball is going up.

We see which direction is going up,

and we know what action we should take because we know the slope of the ball and also,

uh, also if it's going up or down.

That make sense? Okay. So, this is called the preprocessing.

Given a state, compute a function Phi of S. That gives

you the history of this state which is the four-sequence of four last frames.

What other preprocessing can we do?

And this is something I want you to be quick.

Like, we we learned it together in deep learning, input preprocessing.

Remember that second lecture where the question was what resolution should we use?

Remember, you have a cat recognition,

what resolution would you wanna use? Here same thing.

If we can reduce the size of the inputs, let's do it.

If we don't need all that information, let's do it.

For example, do you think the colors are important?

Very minor. I don't think they're important.

So, maybe we can gray scale everything.

That removes three chan - that converts three channels into one channel,

which is amazing in terms of computation.

What else? I think we can crop a lot of these.

Like maybe there's a line here we don't need to make any decision.

We don't need the scores maybe.

So actually, there's some games where the score is important for decision-making.

The example is football like, or soccer.

Uh, when you're - when you're winning 1-0,

you you'd better if you're playing against a strong team defend like,

get back and defend to keep this 1-0.

So, the score is actually important in the decision-making process.

And in fact, uh, there are famous coach in

football which have this technique called park the bus,

where you just put all your team in front of the goal once you have scored a goal.

So, this is an example. So, here there is no park the bus.

But, uh, we can definitely get rid of the score,

which removes some pixels and reduces the number of computations,

and we can reduce to grayscale.

One important thing to be careful about when you reduce to grayscale is

that grayscale is a dimensionality reduction technique.

It means you - you lose information.

But you know, if you have three channels and you reduce everything in one channel,

sometimes you would have different color pixels which will end up with

the same grayscale value depending on the grayscale that you use.

And it's been seen that you lose some information sometimes.

So, let's say the ball and some bricks have the same grayscale value,

then you would not differentiate them.

Or let's say the paddle and the background have the same grayscale value,

then you would not differentiate them.

So, you have to be careful of that type of stuff.

And there's other methods that do grayscale in other ways like luminance.

So, we have our Phi of S which is this,

which is this uh,

input to the Q network,

and the deep Q network architecture is going to be

a convolutional neural network because we're working with images.

So, we forward propagate that,

this is the architecture from Mnih, Kavukcuoglu,

Silver at al from Deep Mind.

CONV ReLU, CONV ReLU, CONV ReLU,

two fully connected layers and you get your Q-scores.

And we get back to our training loop.

So, what do we need to change in our training loop here?

Is we said that one frame is not enough.

So, we preprocess all the frames.

So, the initial state is converted to Phi of

s. The forward propagation state is Phi of s and so on.

So, everywhere we had s or s prime,

we convert to Phi of s or Phi of s prime which gives us the history.

Now, there are a lot more techniques that we can plug in here,

and we will see three more.

One is keeping track of the terminal state.

In this loop, we should keep track of

the terminal state because we said if we reach a terminal state,

we want to end the loop, break the loop.

Now, the reason is because the y function.

So basically, we have to add,

create a Boolean to detect the terminal state before looping through the time steps.

And inside the loop,

we wanna check if the new s prime we are going to is a terminal state.

If it's a terminal state,

then I can stop this loop and go back, play another episode.

So, play another, start at another starting state, and continue my game.

Now, this y target that we compute is different if we are in a terminal state or not.

Because if we're in a terminal state,

there is no reason to have a discounted long-term reward.

There's nothing behind that terminal state.

So, if we're in a terminal state, we just set it to the immediate reward and we break.

If we're not in a terminal state,

then we would add this discounted future reward.

Any questions on that?

Yep, another issue that we're seeing this and which makes, uh,

this reinforcement learning setting super different from the classic

supervised learning setting is that we only train on what we explore.

It means I'm starting in a state s. I compute,

I forward propagate this Phi of s in my network.

I get my vector of Q-values.

I select the best Q-value, the largest.

I get a new state because I can move now from state s to s prime.

So, I have a transition from s take action A,

get s prime or Phi of s. Take action A, get Phi of s prime.

Now, this is, is what I will use to train my network.

I can forward propagate Phi of s prime again in the network,

and get my y target.

Compare my y to my Q and then backpropagate.

The issue is I may never explore this state transition again.

Maybe I will never get there anymore.

It's super different from what we do in supervised learning where you have a dataset,

and your dataset can be used many times.

With batch gradient descent or with any gradient descent algorithm.

One epoch, you see all the data points.

So, if you do two epochs, you see every data point two times.

If you do 10 epochs, you see every data points three times or 10 times.

So, it means that every data point can be used several time to

train your algorithm in classic deep learning that we've seen together.

In this case, it doesn't seem possible because we only train when we explore,

and we might never get back there.

Especially because the training will be influenced by where we go.

So, maybe there are some places where we will never

go because while we train and while we learn,

it will, it will kind of direct our

decision process and we will never train on some parts of the game.

So, this is why we have other techniques to keep this training stable.

One is called experience replay.

So, as I said, here's what we're currently doing.

We have Phi of s, forward propagate, get a.

From taking action a,

we observe an immediate reward r,

and a new state Phi of s prime.

Then from Phi of s prime,

we can take a new action a prime,

observe a new reward r prime,

and the new state Phi of s prime prime, and so on.

And each of these is called a state transition,

and can be used to train.

This is one experience leads to one iteration of gradient descent.

E1, E2, E3,

Experience 1, Experience 2, Experience 3.

And the training will be trained on Experience 1

then trained on Experience 2 then trained on Experience 3.

What we're doing with experience replay is the following.

We will observe experience 1 because we start in a state,

we take an action.

We see another state and a reward and this is called experience 1.

We will create a replay memory.

You can think of it as a data structure in computer science

and you will place this Experience 1 tuple in this replay memory.

Then from there, we will experience Experience 2.

We'll put Experience 2 in the replay memory.

Same with Experience 3, put it in the replay memory and so on.

Now, during training, what we will do is we will first train on Experience

1 because it's the only experience we have so far.

Next step, instead of training on E 2,

we will train on a sample from E 1 and E 2.

It means we will take one out of the replay memory and use this one for training.

But we will still continue to experiment something else and we will sample from there.

And at every step,

the replay memory will become bigger and bigger and while we train,

we will not necessarily train on the step we explore,

we will train on a sample which is the replay memory plus the new state we explored.

Why is it good is because E 1 as you see can be useful

many times in the training and maybe one was a critical state.

Like it was a very important data point to learn or a Q function and so on and so on.

Does the replay memory makes sense?

So, several advantages.

One is data efficiency.

We can use data many times.

Don't have to use one data point only one time.

Another very important advantage of

experience replay is that if you don't use experience replay,

you have a lot of correlation between the successive data points.

So, let's say the ball is on the bottom right here,

and the ball is going to the top left.

For the next 10 data points,

the ball is always going to go to the top left.

And it means the action you can take,

is always the same.

It actually doesn't matter a lot because the ball is going up.

But most likely you wanna follow where the ball is going.

So, the action will be to go towards the ball for 10 actions in a row.

And then the ball would bounce on the wall and on the top and go back down here,

down to the bottom left-to the bottom right.

What will happen if your paddle is here is that,

for 10 steps in a row you will send your paddle on the right.

Remember what we said when we,

when we ask the question if you have to train a cat

versus dog classifier with batches of images of cats,

batches of images of dog, train first on the cats then trains on the dogs,

then trains on the cats, then trains on the dogs.

We will not converge because your network will be super

biased towards predicting cat after seeing 10 images of cat.

Super biased with predicting dogs when it sees 10 images of dog.

That what's happening here.

So, you wanna de-correlate all these experiences.

You want to be able to take one experience,

take another one that has nothing to do with it and so on.

This is what experience replay does.

And the third one, is that the third one is that you're

basically trading computation and memory against exploration.

Exploration is super costly.

The state-space might be super big,

but you know you have enough computation probably,

you can have a lot of computation and you have memory space,

let's use an experience replay.

Okay. So, let's add experience replay to our code here.

The transition resulting from this part,

is added to the experience to

the replay memory D and will not necessarily be used in the iteration space.

So, what's happening is before we propagate phi of S,

we get, we observe a reward and an action.

And this action leads to a state phi of S prime.

This is an experience.

Instead of training on this experience,

I'm just going to take it, put it in the replay memory.

Add experience to replay memory.

And what I will train on is not this experience,

it's a sampled random mini-batch of transition from the replay memory.

So, you see, we're exploring but we're not training on what we explore.

We're training on the replay memory,

but the replay memory is dynamic.

It changes. And update using the sample transitions.

So, the sample transition from the replay memory will

be used to do the update. That's the hack.

Now, another hack we want,

the last hack we want to talk about is exploration versus exploitation.

So, as a human, and let's say you're commuting to Stanford

every day and you know the road you're commuting at. You know it.

You always take the same road and you're biased towards taking this road.

Why? Because the first time you took it it went well.

And the more you take it,

the more you learn about it not that it's good to know

the tricks of how to drive fast but but like you know the tricks,

you know that this, this,

this light is going to be green at that moment and so on.

So, you, you, you build a very good expertise in this road, super expert.

But maybe there's another road that you don't wanna try that is better.

You just don't try it because you're focused on that road.

You're doing exploitation.

You exploit what you already know.

Exploration would be - okay let's do it.

I'm gonna try another road today,

I might get late to the course but maybe I will have

a good discovery and I would like this road and I will take it later on.

There is a trade off between these two because the RL algorithm is going to

figure out some strategies that are super good.

And will try to do local search in these to get better and better.

But you might have another minimum that is better than this one and you don't explore it.

Using the algorithm we currently have,

there's no trade-off between exploitation and exploration.

We're almost doing only exploitation.

So, how to incentivize this exploration. Do you guys have an idea?

So, right now, when we're in a state S,

we're forward propagating the states,

for all states in the network and we take the action that is the best action always.

So we're exploiting. We're exploiting what we already know. We take the best action.

Instead of taking this best action,

what can we do? Yeah.

Monte Carlo sampling.

Monte Carlo sampling, good point. Another one, you wanted to try something else?

Could have a parameter that's

the ratio times you take the best action versus exploring another action.

Okay. Take a hyper-parameter that tells you when you can explore,

when you can exploit.

Is that what you mean?

Yeah, that's a good point.

So, I think that that's a solution.

You can take a hyper-parameter that is a probability telling

you with this probability explore,

otherwise with one minus this probability exploit.

That's what, that's what we're going to do.

So, let's look why exploration versus exploitation doesn't work.

We're in initial state one, S1.

And we have three options.

Either we go using action A1 to S2 and we get a reward of zero,

or we go to action use action 2,

get to S3 and get reward of 1 or use action 3 and go to S4,

and get a reward of 1,000.

So, this is obviously where we wanna go.

We wanna go to S4 because it has the maximum reward.

And we don't need to do much computation in our head.

It's simple, there is no discount, it's direct.

Just after initializing the Q-network,

you get the following Q-values.

Forward propagate S1 in the Q-network and get 0.5 for taking action 1,

0.4 for taking action 2,

0.3 for taking action 3.

So, this is obviously not good but our network,

it was randomly initialized.

What it's telling us is that 0.5 is the maximum.

So, we should take action 1. So, let's go.

Take action 1, observe S 2.

You observe a reward of 0.

Our target because it's a terminal state is only equal to the reward.

There is no additional term.

So, we want our target to match our Q. Our target is zero.

So, Q should match zero.

So, we train and we get the Q that should be zero.

Does that makes sense?

Now, we do another round of iteration.

We look we are in S1,

we get back to the beginning of the episode we see that

our Q function tells us that action two is the best.

Because 0.4 is the maximum value.

It means go to S3.

I go to S3, I observe reward of 1.

What does it mean? It's a terminal state.

So, my target is 1.

Y equals 1. I want the Q to match my Y.

So, my Q should be 1.

Now, I continue third step up.

Q function says go to A2.

I go to A2 nothing happens.

I already matched the reward.

Four step go to A2, you see what happens?

We will never go there.

We'll never get there because we're not exploring.

So, instead of doing that, what we are saying is that five percent of the time,

take your random action to explore and 95 percent of the time follow your exploitation.

Okay. So, that's where we add it.

With probability epsilon, the hyper-parameter take random action A,

otherwise do what we were doing before,

exploit. Does that make sense?

Okay, cool.

So, now we plugged in

all these tricks in our pseudo code and this is our new pseudo code.

So, we have to initialize a replay memory which we did not have to do earlier.

In blue, you can find the replay memory added lines.

In orange, you can find the added lines for checking the terminal state and in purple,

you can find the added lines related to epsilon-greedy,

exploration versus exploitation.

And finally in bold, the pre-processing.

Any questions on that?

So, that's, that's we wanted to see a variant

of how deep learning can be used in a setting

that is not necessarily classic supervised learning setting.

Can you see that the main advantage of deep learning in

this case is it's a good function approximator?

Your convolutional neural network can extract a lot of information from

the pixels that we're not able to get with other networks.

Okay. So, let, let's see what we have here.

We have our super Atari bot that's gonna dig a tunnel,

and it's going to destroy all the bricks super quickly.

It's good to see that after building it.

So, this is work from DeepMind's team,

and you can find this video on YouTube.

Okay, another thing I wanted to say quickly is

what's the difference between with and without human knowledge?

You will see a lot of people-a lot of papers mentioning

that this algorithm was trained with human learned knowledge,

or this algorithm was trained without any human in the loop.

Why is human knowledge very important?

Like, think about it.

Just playing one game as

a human and teaching that to the algorithm will help the algorithm a lot.

When the algorithm sees this game,

what it sees is pixels.

What do we see when we see that game?

We see that there is a key here.

We know the key is usually a good thing.

So, we have a lot of context, right?

As a human. We know I'm probably gonna go for the key.

I'm not gonna go for this-this thing, no.

Uh, same, ladder. What is the ladder?

We-we directly identify that the ladder is something we can go up and down.

We identify that this rope is probably

something I can use to jump from one side to the other.

So as a human, there is a lot more background information that

we have even without knowing it-without realizing it.

So, there's a huge difference between

algorithms trained with human in the loop and without human in the loop.

This game is actually Montezuma's Revenge.

The DQN algorithm when the paper came out on-on the

nature-on nature - in Nature the-the second-the second version of the paper.

They showed that they-they beat human on

49 games that are the same type of games I-as Breakout.

But this one was the hardest one.

So, they couldn't beat human on this one. And the reason was because

there's a lot of information and also the game has-is very long.

So in order-it's called Montezuma's Revenge and.

I think Ramtin Keramati is going to talk about it a little later.

But in order to get to win this game,

you have to go through a lot of different stages,

and it's super long.

So, it's super hard for the-the-the algorithm to explore all the state space.

Okay. So, that slide I will show you

a few more games that-that the DeepMind team has solved. Pong is one.

SeaQuest is another one,

and Space Invaders that you might know which-which is probably

the-the most famous of the three. I hope you know.

Okay. So, that said,

I'm gonna hand in the microphone to-we're lucky to have an RL expert.

So, Ramtin Keramati is a fourth year PhD student,

uh, in RL working with Professor Brunskill at Stanford.

And he will tell us a little bit about his experience and he will show

us some advanced applications of deep learning in RL,

and how these plug in together.

Thank you. Thanks Kian for that introduction.

Okay. Can everyone hear me now?

Right, good. Cool. Okay first,

I have like, 8-9 minutes.

You have more.

I have more?

Yes.

Okay. Great, okay first question.

After seeing that lecture so far like,

how many are you-of you are thinking that RL is actually cool? Like, honestly.

That's like, oh that's a lot.

Okay. [LAUGHTER] That's a lot.

Okay. My hope is after showing you some other advanced topics here,

then the percentage is gonna even increase.

So, let's [LAUGHTER] let's see.

Uh, it's almost impossible to talk about it like,

advancements in RL recently without mentioning Alpha Go

I think right now wrote down on a table that it's almost 10 to the uh,

power of 170 different configuration of the board.

And that's roughly more

than-I mean that's more than the estimated number of atoms in the universe.

So, one traditional al-algorithm that before like deep learning and stuff like that.

Was like tree search in RL,

which is basically go

exhaustively search all the-all the possible actions that you can take,

and then take the best one.

In that situation Alpha Go that's all-all almost impossible.

So, what they do that's also a paper from DeepMind is that they

train a neural network for-for that.

They kind of merge the tree search with-with deep learning, a neural network that they have.

They have two kinds of networks.

One is called the value network.

Value network is basically consuming this image-image of

a board and telling you what's the probability that if you can win in this situation.

So, if the value is higher,

then the probability of winning is higher.

Oh, how does it help you-help you in the case that if you wanna search for the action,

you don't have to go until the end of the game

because the end of the game is a lot of steps,

and it's almost impossible to go to the end of the game in all these simulations.

So, that helps you to understand what's the value of each game like, beforehand?

Like, after 48 step or 58 step

if you're gonna win that game or-or if you're gonna lose that game?

There's another network of the policy network which helps us to take action.

But I think the most interesting thing of the Alpha Go is that it's trained from scratch.

So, it trains from nothing,

and they have a tree called self play that-there is two AI playing with each other.

The best one I-replicate the best-the best one you keep it fixed,

and I have another one that is trying to beat the previous version of itself.

And after it can beat the previous version of itself like,

reliably many times, then I replace this

again for the previous one. And then I adjust it.

So, this is a training curve of like itself-a self play of the Alpha Go as you see.

And it takes a lot of compute.

So, that's kind of crazy.

But finally they beat the human.

Okay. Another type of algorithm, and this is like,

the whole different class of algorithm called policy gradients. Uh -.

We have developed an algorithm called trust region policy optimization.

Can I stop that? [LAUGHTER].

This method was abled when locomotion controllers for [OVERLAPPING].

Can you mute the sound please?

Okay, great. So, policy gradient algorithm.

[LAUGHTER].

Well, what I can do is restart this from here. Uh -

No. That is not. Doesn't work.

Okay.

Okay. So, here like in the DQN that you've seen,

uh, you-you came and like,

compute the Q-value of each state.

And then what you have done is that you take the argmax of this with

respect to action and then you choose the action that you want to choose, right?

But what you care at the end of the day is the action

which is the mapping from a state to action,

which we call it a policy, right?

So, what you care at the end of the day is actually the policy.

Like, what action should they take?

Is not really Q value itself, right?

So, this class of - class of methods that is called policy gradients,

is trying to directly optimize for the policy.

So, rather than updating the Q function,

I compute the gradient of my policy.

I update my policy network again, and again, and again.

So, let's see these videos.

So, this is like this guy,

that is trying to reach the pink uh,

ball over there, and sometimes like gets hit by the some external forces.

And this is called uh, a raster algorithm, call it like PPO.

This is a policy gradient.

I try to reach for that ball.

So, I think that you've heard of, ah, OpenAI

like five, like the bot that is playing DOTA.

So, this is like,

completely like, PPO algorithm.

And they have like,

a lot of compute showing that,

and I guess I have the numbers here.

There is 180 years of play in one day.

This is how much compute they have. Uh, so that's fine.

There's another even funnier video.

Its called Competitive Self-Play.

Again, the same idea with policy gradient.

Inside you put two agents in front of each other,

and they just try to beat each other.

And if they beat each other, they get a reward.

The most interesting part is that - for example in that game,

the purpose is just to-to pull the other one out, right?

But they understand some emerging behavior which is if -

for us human makes sense but for them to learn out of nothing is kind of cool.

[NOISE]

So there's like one risk here that when they're playing,

oh, this, uh, this guy's trying to kick the ball inside.

One, one risk here is to overfit.

[LAUGHTER] That's also cool.

[LAUGHTER] Oh, yeah.

One technical point before we move,

one technical point here is that here,

wait, where is the, no, the next one.

Okay. Here that two,

our two agent are p - p - playing with each other,

and we are just updating the person with the best other agent, like previously,

we are doing like a self-play,

is that you overfit to the actual agent that you have in front of you.

So, uh, the agent in front of you is powerful,

but you might overfit to this,

and if I, uh,

put the agent that is not that powerful but is

using this simple trick that the powerful agent,

like, never uses, then you might just l - lose the game, right?

So, one trick here to make it more stable is that

rather than playing against only one agent,

you'd alternate between different version of the agent itself,

so it all, like, learns all this skill together.

It doesn't overfit to this level.

So, there's another, uh,

thing called like, meta learning.

Meta learning is a whole different algorithm again,

[NOISE] and the, and the purpose is that

a lot of tasks are like similar to each other, right?

For example, walking to left and

walking to right are like walking in the front direction,

they're like same tasks, essentially.

[NOISE] So, the point is,

rather than training on a single task which is like go left or go right,

you train on a distribution of tasks that are similar to each other.

[NOISE] And then the idea is that,

for each specific task,

I should learn with like, uh,

very few gradient steps,

so very few updates should be enough for me.

So, if I learn,

okay, play this video, it's like, at the beginning,

this agent has been trained with meta learning before,

doesn't know how to move,

but just look at the number of gradient steps, like,

after two or three gradient steps,

it totally knows how to move.

That's, th - that's normally takes a lot of the steps to train,

[NOISE] but that's only because of the meta learning approach that we've used here.

[NOISE] Meta learning is also cool, I mean, uh,

the algorithm is from Berkeley, Chelsea Finn,

which is now also coming to Stanford.

It's called Model-Agnostic Meta-Learning.

[NOISE] So, all right.

Another point this, uh,

very interesting game, Montezuma's Revenge,

that Kian talked.

How much time do we have?

[inaudible].

All right. Uh, yeah. So, you've seen,

uh, exploration-exploitation dilemma, right?

So it's, it's, it's bad if you don't ex - explore,

you gonna fail many times.

So, if you do the exploration with the scheme

that you just saw like epsilon-greedy,

this is a map of the Montezuma game,

and you gonna see all different moves of that game,

if you do like, exploration randomly.

And, uh, the game, I think, has, like,

21 or 20 something different rooms that is hard to reach.

[NOISE] So, there's this recent paper I

think from Google Brain from Marc Bellemare and team.

It's called Unifying the Count-based Metas for Exploration.

Exploration's essentially a very hard challenge,

mostly in the situation that the reward is a sparse.

For exactly, in this game,

the first reward that you get is when you reach the key, right?

[NOISE] And from t - top to here,

it's almost like 200 a steps,

and g - getting the number of actions after 200 steps exactly right by,

like, random exploration, is almost impossible, so you're never gonna do that.

[NOISE] What, uh, a very interesting trick here is

that you're kind of keeping counts on how many times you visited a state,

[NOISE] and then if you visit a state,

that is, s - uh, [NOISE] that has like, uh,

fewer counts, then you give a reward to the agent,

so we call it the intrinsic reward.

So, it kind of makes the -

Let's change your mic really quick. [NOISE]

[LAUGHTER] Right here, I keep it.

[NOISE] S - so, it changes the,

[NOISE] looking for the reward is [inaudible] environment is also,

intense you up, like it,

it has the instant reach because you go and search around

because you gotta increase the counts of the state that it has never seen before.

So, this gets the agent to actually explore and look more,

so it just [NOISE] goes down usually like different rooms and stuff like that.

[NOISE] So, these identities,

and this game is interesting, if you search this,

there's a lot of people that recently are trying to solve the game,

as it includes research on that Montezuma's is one of the game,

and it's just f - fun also to see the agent play.

Any question on that? [NOISE]

[inaudible]

[LAUGHTER] Any question? [NOISE] Well, I -

There is also [NOISE] another interesting [NOISE] point

that would be just fun to know about.

It's called imitation learning.

Imitation learning is the case that,

well, I mean, RL agents,

so sometimes you don't know the reward, like,

for example, the Atari games,

their reward is like, very well-defined, right?

If I get the key, I get the reward,

that's just obvious, but sometimes,

like, defining the reward is hard.

For example, when the car, like the blue one,

wanna drive in a,

in some highway, what is the definition of the reward, right?

So, we don't have a clear definition of that.

But, on the other hand, you have a person,

like you have human expert that can drive for us,

and then this is, "Oh, this is the right way of driving," right?

So, in this situation,

we have something called imitation learning that you try

to mimic the behavior of a expert.

[NOISE] So, not exactly copying this,

because if we copy this and then you show us it completely different states,

then we don't know what to do,

but from now, we learn.

And this is like, for example,

and there is a paper that called Generative Adversarial Imitation Learning,

which was, like, from Stefano's group here at Stanford,

and that was also interesting.

[NOISE] Well, I think that's advanced topic.

If you have any questions, I'm here.

Kian. [NOISE] Question?

[NOISE] No? [NOISE]

Okay. Sorry. Just, uh, for,

for, uh, next week,

so there is no assignment.

We have not finished at C-5 and you know about sequence models now.

Uh, we all need to take a lot of time for this project.

The project's the big part of this because this has no, has, um,

[NOISE] there's gonna be, um,

[NOISE] project team mentorship.

And this Friday, we'll have these sections with reading research papers.

We go over the, the, object detection YOLO

and YOLO v2 papers from Redmon et al.

Okay. See you guys. Thank you.


知识点

重点词汇
infinite [ˈɪnfɪnət] n. 无限;[数] 无穷大;无限的东西(如空间,时间) adj. 无限的,无穷的;无数的;极大的 {cet4 cet6 ky ielts :6045}

detection [dɪˈtekʃn] n. 侦查,探测;发觉,发现;察觉 {cet4 cet6 gre :6133}

denoted [diˈnəutid] 表示,指示(denote的过去式和过去分词) { :6148}

tricky [ˈtrɪki] adj. 狡猾的;机警的 { :6391}

randomly ['rændəmlɪ] adv. 随便地,任意地;无目的地,胡乱地;未加计划地 { :6507}

francisco [fræn'sɪskəʊ] n. 弗朗西斯科(男子名,等于Francis) { :6607}

advancements [ædˈvænsmənts] n. (级别的)晋升( advancement的名词复数 ); 前进; 进展; 促进 { :6629}

inaudible [ɪnˈɔ:dəbl] adj. 听不见的;不可闻的 { :6808}

algorithm [ˈælgərɪðəm] n. [计][数] 算法,运算法则 { :6819}

algorithms [ˈælɡəriðəmz] n. [计][数] 算法;算法式(algorithm的复数) { :6819}

mimic [ˈmɪmɪk] vt. 模仿,摹拟 n. 效颦者,模仿者;仿制品;小丑 adj. 模仿的,模拟的;假装的 {toefl ielts gre :6833}

beforehand [bɪˈfɔ:hænd] adj. 提前的;预先准备好的 adv. 事先;预先 {cet4 cet6 ky toefl ielts :6844}

commuting [kə'mju:tɪŋ] n. 乘公交车上下班;经常往来 { :6867}

chess [tʃes] n. 国际象棋,西洋棋 n. (Chess)人名;(英)切斯 {zk gk cet4 cet6 ky :6948}

unstable [ʌnˈsteɪbl] adj. 不稳定的;动荡的;易变的 {cet4 cet6 toefl :6975}

imitation [ˌɪmɪˈteɪʃn] n. 模仿,仿造;仿制品 adj. 人造的,仿制的 {cet6 ky toefl ielts gre :6994}

optimal [ˈɒptɪməl] adj. 最佳的;最理想的 {cet6 toefl :7002}

myriads ['mɪrɪədz] n. 无数,极大数量( myriad的名词复数 ) { :7106}

gradient [ˈgreɪdiənt] n. [数][物] 梯度;坡度;倾斜度 adj. 倾斜的;步行的 {cet6 toefl :7370}

gradients [ˈgreɪdi:ənts] n. 渐变,[数][物] 梯度(gradient复数形式) { :7370}

alternate [ɔ:lˈtɜ:nət] n. 替换物 adj. 交替的;轮流的 vt. 使交替;使轮流 vi. 交替;轮流 {cet6 ky toefl ielts gre :7396}

marc [mɑ:k] n. 机读目录;(水果,种子等经压榨后的)榨渣 n. (Marc)人名;(塞)马尔茨;(德、俄、法、荷、罗、瑞典、西、英)马克 { :7422}

intrinsic [ɪnˈtrɪnsɪk] adj. 本质的,固有的 {cet6 ky toefl ielts :7449}

reinforcement [ˌri:ɪnˈfɔ:smənt] n. 加固;增援;援军;加强 { :7506}

wh [ ] abbr. 瓦特小时(Watt Hours);白宫(White House);白色(white) { :7515}

idle [ˈaɪdl] adj. 闲置的;懒惰的;停顿的 vt. 虚度;使空转 vi. 无所事事;虚度;空转 {cet4 cet6 ky ielts gre :7526}

phd [ ] abbr. 博士学位;哲学博士学位(Doctor of Philosophy) {ielts :7607}

expectancy [ɪkˈspektənsi] n. 期望,期待 {ielts :7655}

blog [blɒg] n. 博客;部落格;网络日志 { :7748}

Et ['i:ti:] conj. (拉丁语)和(等于and) { :7820}

compute [kəmˈpju:t] n. 计算;估计;推断 vt. 计算;估算;用计算机计算 vi. 计算;估算;推断 {cet4 cet6 ky toefl ielts :7824}

dropout [ˈdrɒpaʊt] n. 中途退学;辍学学生 {ielts :7969}

unifying [ˈju:nifaiŋ] 使统一;(unify的ing形式)使成一体 { :8008}

Finn [fin] n. 芬兰人 爱尔兰巨人 { :8047}

converged [kən'vɜ:dʒd] v. 聚集,使会聚(converge的过去式) adj. 收敛的;聚合的 { :8179}

converge [kənˈvɜ:dʒ] vt. 使汇聚 vi. 聚集;靠拢;收敛 {cet6 toefl ielts gre :8179}

converges [kənˈvə:dʒz] v. (线条、运动的物体等)会于一点( converge的第三人称单数 ); (趋于)相似或相同; 人或车辆汇集; 聚集 { :8179}

berkeley ['bɑ:kli, 'bә:kli] n. 贝克莱(爱尔兰主教及哲学家);伯克利(姓氏);伯克利(美国港市) { :8189}

hack [hæk] n. 砍,劈;出租马车 vt. 砍;出租 vi. 砍 n. (Hack)人名;(英、西、芬、阿拉伯、毛里求)哈克;(法)阿克 {gre :8227}

hacks [hæks] n. (Hacks)人名;(德)哈克斯 老马(hack的复数) 出租汽车 { :8227}

pi [paɪ] abbr. 产品改进(Product Improve) { :8364}

intuitive [ɪnˈtju:ɪtɪv] adj. 直觉的;凭直觉获知的 {gre :8759}

derivative [dɪˈrɪvətɪv] n. [化学] 衍生物,派生物;导数 adj. 派生的;引出的 {toefl gre :9140}

convergence [kən'vɜ:dʒəns] n. [数] 收敛;会聚,集合 n. (Convergence)人名;(法)孔韦尔让斯 { :9173}

proxy [ˈprɒksi] n. 代理人;委托书;代用品 {toefl ielts :9178}

paddle [ˈpædl] n. 划桨;明轮翼 vt. 拌;搅;用桨划 vi. 划桨;戏水;涉水 {gk ky toefl ielts :9187}

sub [sʌb] n. 潜水艇;地铁;替补队员 vi. 代替 { :9196}

replay [ˈri:pleɪ] n. 重赛;重播;重演 vt. 重放;重演;重新比赛 { :9256}

neural [ˈnjʊərəl] adj. 神经的;神经系统的;背的;神经中枢的 n. (Neural)人名;(捷)诺伊拉尔 { :9310}

fran [fræn] abbr. framed-structure analysis 框架分析; franchise 特权,公民权 { :9383}

sparse [spɑ:s] adj. 稀疏的;稀少的 {toefl ielts gre :9557}

invaders [ɪn'veɪdəz] n. 侵略者(invader的复数);侵入种 { :9804}

approximate [əˈprɒksɪmət] adj. [数] 近似的;大概的 vt. 近似;使…接近;粗略估计 vi. 接近于;近似于 {cet4 cet6 ky toefl ielts gre :9895}

robotics [rəʊˈbɒtɪks] n. 机器人学 { :10115}

carlo ['kɑrloʊ] n. 卡洛(男子名) { :10119}

propagate [ˈprɒpəgeɪt] vt. 传播;传送;繁殖;宣传 vi. 繁殖;增殖 {cet6 toefl ielts gre :10193}

propagating [ˈprɔpəɡeitɪŋ] v. 传播(propagate的ing形式);繁殖 adj. 传播的;繁殖的 { :10193}

Monte ['mɒntɪ] n. 始于西班牙的纸牌赌博游戏 n. (Monte)人名;(英)蒙特(教名Montague的昵称);(意、葡、瑞典)蒙特 { :10325}

pixels ['pɪksəl] n. [电子] 像素;像素点(pixel的复数) { :10356}

pixel [ˈpɪksl] n. (显示器或电视机图象的)像素(等于picture element) { :10356}

generalize [ˈdʒenrəlaɪz] vi. 形成概念 vt. 概括;推广;使...一般化 {cet6 ky toefl ielts gre :10707}

gamma [ˈgæmə] n. 微克;希腊语的第三个字母 n. (Gamma)人名;(法)加马;(阿拉伯)贾马 {toefl :10849}

tweaked [twiːk] 拧(tweak的过去分词) 调整(tweak的过去分词) { :10855}

tweak [twi:k] n. 扭;拧;焦急 vt. 扭;用力拉;开足马力 { :10855}

mute [mju:t] adj. 哑的;沉默的;无声的 vt. 减弱……的声音;使……柔和 n. 哑巴;弱音器;闭锁音 n. (Mute)人名;(塞)穆特 {cet4 cet6 ky gre :10937}

infinity [ɪnˈfɪnəti] n. 无穷;无限大;无限距 {cet6 gre :11224}

optimize [ˈɒptɪmaɪz] vt. 使最优化,使完善 vi. 优化;持乐观态度 {ky :11612}

reliably [rɪ'laɪəblɪ] adv. 可靠地;确实地 { :11630}

restart [ˌri:ˈstɑ:t] n. 重新开始;返聘 vt. [计] 重新启动;重新开始 vi. [计] 重新启动;重新开始 { :11902}

ex [eks] n. 前妻或前夫 prep. 不包括,除外 { :12200}

diverge [daɪˈvɜ:dʒ] vt. 使偏离;使分叉 vi. 分歧;偏离;分叉;离题 {cet6 toefl gre :12262}

Demo [ˈdeməʊ] n. 演示;样本唱片;示威;民主党员 n. (Demo)人名;(意、阿尔巴)德莫 { :12334}

propagation [ˌprɒpə'ɡeɪʃn] n. 传播;繁殖;增殖 {cet6 gre :12741}

computations [kɒmpjʊ'teɪʃnz] n. 计算,估计( computation的名词复数 ) { :12745}

computation [ˌkɒmpjuˈteɪʃn] n. 估计,计算 {toefl :12745}

epoch [ˈi:pɒk] n. [地质] 世;新纪元;新时代;时间上的一点 {cet6 ky toefl ielts gre :12794}

epochs [ ] 时代(epoch的复数形式) 时期(epoch的复数形式) { :12794}

intuitively [ɪn'tju:ɪtɪvlɪ] adv. 直观地;直觉地 { :14665}

multidisciplinary [ˌmʌltidɪsəˈplɪnəri] adj. 有关各种学问的 { :14907}

adversarial [ˌædvəˈseəriəl] adj. 对抗的;对手的,敌手的 { :15137}

breakout [ˈbreɪkaʊt] n. 爆发;突围;越狱;脱逃 { :15289}

stanford ['stænfәd] n. 斯坦福(姓氏,男子名);斯坦福大学(美国一所大学) { :15904}

mu [mju:] n. 希腊语的第12个字母;微米 n. (Mu)人名;(中)茉(广东话·威妥玛) { :16619}

meth [meθ] n. 甲安菲他明(一种兴奋剂) n. (Meth)人名;(柬)梅 { :16881}

optimization [ˌɒptɪmaɪ'zeɪʃən] n. 最佳化,最优化 {gre :16923}

iteration [ˌɪtəˈreɪʃn] n. [数] 迭代;反复;重复 { :17595}

chan [tʃæn] n. 通道(槽,沟) n. (Chan)人名;(法)尚;(缅)钱;(柬、老、泰)占 { :17670}

dataset ['deɪtəset] n. 资料组 { :18096}

BOT [bɒt] n. 马蝇幼虫,马蝇 n. (Bot)人名;(俄、荷、罗、匈)博特;(法)博 { :18864}


难点词汇
exhaustively [ɪɡ'zɔ:stɪvlɪ] adv. 耗尽一切地 { :20316}

MIC [maɪk] abbr. 军界,工业界集团(Military-Industrial Complex) n. (Mic)人名;(罗)米克 { :21352}

generative [ˈdʒenərətɪv] adj. 生殖的;生产的;有生殖力的;有生产力的 { :21588}

epsilon [ˈepsɪlɒn] n. 希腊语字母之第五字 { :22651}

locomotion [ˌləʊkəˈməʊʃn] n. 运动;移动;旅行 {toefl gre :22712}

dynamically [daɪ'næmɪklɪ] adv. 动态地;充满活力地;不断变化地 { :23174}

recap [ˈri:kæp] n. 翻新的轮胎 vt. 翻新胎面;扼要重述 { :23344}

chessboard [ˈtʃesbɔ:d] n. 棋盘 { :23620}

pong [pɒŋ] n. 乒乓球;恶臭;难闻的气味 adj. 乒乓的 vi. 发出难闻的气味 n. (Pong)人名;(东南亚国家华语)榜;(柬)邦;(泰)蓬;(中)庞(广东话·威妥玛) { :23635}

pseudo ['sju:dəʊ] n. 伪君子;假冒的人 adj. 冒充的,假的 { :24260}

phi [faɪ] n. 希腊文的第21个字母 n. (Phi)人名;(柬、老)披 { :24548}

iterative ['ɪtərətɪv] adj. [数] 迭代的;重复的,反复的 n. 反复体 { :25217}

xavier ['zʌvɪə] n. 泽维尔(男子名) { :26299}

bellman ['belmən] n. 更夫;传达员;鸣钟者 { :26872}

mentorship ['mentɔːʃɪp] n. 导师制,辅导教师;师徒制 { :27920}

Boolean [ ] adj. 布尔数学体系的 { :27921}

stochastic [stə'kæstɪk] adj. [数] 随机的;猜测的 { :28398}

dw [ ] abbr. 发展的宽度(Developed Width);蒸馏水(Distilled Water);双重墙(Double Wall);双重载(Double Weight) { :29507}

Atari [ ] n. 雅达利(美国一家电脑游戏机厂商) { :29876}

dimensionality [dɪˌmenʃə'nælɪtɪ] n. 维度;幅员;广延 { :29902}

optimality [ɒptɪ'mælɪtɪ] n. [数] 最佳性 { :30883}

tuple [tʌpl] n. 元组,数组 { :31456}

luminance [ˈlu:mɪnəns] n. [光][电子] 亮度;[光] 发光性(等于luminosity);光明 { :32601}

shortsighted ['ʃɔ:t'saɪtɪd] adj. 目光短浅的;近视的 { :32694}

raster ['ræstə] n. [电子] 光栅;试映图 { :33252}

initialized [ɪ'nɪʃlaɪzd] adj. 初始化;初始化的;起始步骤 v. 初始化(initialize的过去分词);预置 { :37736}

initializing [ ] 正在初始化 { :37736}

initialize [ɪˈnɪʃəlaɪz] vt. 初始化 { :37736}

classifier [ˈklæsɪfaɪə(r)] n. [测][遥感] 分类器; { :37807}

DL [ ] abbr. 分升(deciliter);数据传输线路(Data Link);基准面(Datam Ievel);延迟线(Delay Line) { :39786}

initialization [ɪˌnɪʃəlaɪ'zeɪʃn] n. [计] 初始化;赋初值 { :40016}

iteratively [ ] [计] 迭代的 { :48568}

Montezuma [ ] 蒙特苏马 { :49277}

melo ['meləʊ] n. <口>情节剧 { :49586}


生僻词
abled ['eibld] a. [前面往往带副词] (体格)强壮的, 身体(或体格)健全的;无残疾的

approximator [ə'prɒksɪmeɪtə] n. 接近者, 近似者

approximators [ ] [网络] 变动型模糊限制语

backpropagate [ ] [网络] 反向传播

bellemare [ ] n. (Bellemare)人名;(法)贝勒马尔

Brunskill [ ] n. 布伦斯基尔

convolutional [kɒnvə'lu:ʃənəl] adj. 卷积的;回旋的;脑回的

DeepMind [ ] n. 深刻的见解 [网络] 心灵深处;初恋汽水;深层思想

google [ ] 谷歌;谷歌搜索引擎

grayscale ['grei,skeil] 灰度;灰度图;灰度级;灰度模式

incentivize [ɪn'sentɪvaɪz] 以物质刺激鼓励

kian [ ] [网络] 奇恩;奇安;吉安

meta [ ] [计] 元的

metas ['metəz] abbr. metastasis 转移; metastasize 转移

overfit [ ] [网络] 过拟合;过度拟合;过适应

preprocess [pri:'prəʊses] vt. 预处理;预加工

preprocessing [prep'rəʊsesɪŋ] n. 预处理;预加工

pseudocode ['sju:dəʊˌkəʊd] n. 伪代码;假码;虚拟程序代码

redmon [ ] [网络] 雷德蒙市;洪爷

relu [ ] [网络] 关节轴承

seaquest [ ] [网络] 水美净;海探险号

youtube ['ju:tju:b] n. 视频网站(可以让用户免费上传、观赏、分享视频短片的热门视频共享网站)


词组
a grid [ ] [网络] 一格;栅格;网格

a hack [ ] [网络] 网络攻击

a robot [ ] [网络] 一个机器人;到机器人

a trash [ ] None

alternate between [ ] [网络] 时而…时而;在两种状态中交替变化

an algorithm [ ] [网络] 规则系统;运算程式

back prop [ ] 后撑;后支柱

back propagation [ˈbækˌprɔpəˈgeɪʃən] [网络] 反向传播;误差反向传播;反向传播算法

Bellman equation [ ] [数] 贝尔曼方程

bias towards [ ] [网络] 对……有利的偏见

class of algorithm [ ] 演算法类别

correlate with [ ] [网络] 找出一一对应的关系;与…相关;使相互关联

descent algorithm [ ] 下降算法

descent method [ ] un. 下降法 [网络] 下山法;降方法

dimensionality reduction [ ] un. 维数减缩 [网络] 降维;维归约;维度缩减

et al [ ] abbr. 以及其他人,等人

et al. [ˌet ˈæl] adv. 以及其他人;表示还有别的名字省略不提 abbr. 等等(尤置于名称后,源自拉丁文 et alii/alia) [网络] 等人;某某等人;出处

et. al [ ] adv. 以及其他人;用在一个名字后面 [网络] 等;等人;等等

feedback loop [ ] un. 反馈环路;反馈回路;回授电路 [网络] 反馈循环;回馈回路;反馈电路

forward propagation [ ] 正向传播

from scratch [frɔm skrætʃ] adj. 从零开始的;白手起家的 [网络] 从头开始;从头做起;从无到有

garbage collector [ˈɡɑ:bidʒ kəˈlektə] n. 收垃圾的 [网络] 垃圾收集器;垃圾回收器;被垃圾回收器

gradient algorithm [ ] [网络] 其中包括梯度运算;其中包括剃度运算

gradient descent [ ] n. 梯度下降法 [网络] 梯度递减;梯度下降算法;梯度递减的学习法

gradient descent algorithm [ ] [网络] 梯度下降算法;梯度陡降法;梯度衰减原理

hack to [ ] vt.劈,砍

imitation learning [ ] 模仿学习

in bold [ ] [网络] 粗体地;黑体地;粗体的

in the loop [ ] [网络] 灵通人士;通灵人士;圈内

infinite life [ ] 无限寿命

intrinsic reward [ ] [网络] 内在奖励;内在报酬;内在的奖励

intrinsic rewards [ ] 内在报酬

iterative algorithm [ ] [网络] 迭代算法;叠代演算法;迭代法

learning algorithm [ ] [网络] 学习演算法;学习算法;学习机制

life expectancy [laif ɪkˈspektənsi:] n. 预期寿命;预计存在(或持续)的期限 [网络] 平均寿命;平均余命;期望寿命

loop break [ ] 末端钩环

loop in [ ] loop sb in,把某人拉进圈子

loop through [ ] un. 电路接通 [网络] 依次通过;环通输出接口;环通输入接口

Monte Carlo [mɔnti 'kɑ:lәu] n. 【旅】蒙特卡罗 [网络] 蒙特卡洛;蒙地卡罗;蒙特卡罗法

Montezuma's revenge [ ] [网络] 魔宫寻宝;复仇;腹泻

multiply by [ ] v. 乘 [网络] 乘以;乘上;使相乘

neural network [ˈnjuərəl ˈnetwə:k] n. 神经网络 [网络] 类神经网路;类神经网络;神经元网络

neural networks [ ] na. 【计】模拟脑神经元网络 [网络] 神经网络;类神经网路;神经网络系统

no derivative [ ] 禁止改作

object detection [ ] [科技] 物体检测

optimize for [ ] vt.为...而尽可能完善

pixel image [ˈpiksəl ˈimidʒ] [医]像素显像

play chess [ ] na. 下象棋 [网络] 下棋;下国际象棋;着棋

plug in [plʌɡ in] v. 插入;插插头;插上 [网络] 插件;连接;插上电源

plus infinity [ ] [网络] 正无穷大;正无限大

pseudo code [ˈsu:dəʊ kəud] n. 【计】伪码 [网络] 虚拟码;伪代码;假码

raster algorithm [ ] [计] 光栅算法

recycle bin [ˌri:ˈsaikl bin] un. 回收站 [网络] 资源回收筒;资源回收桶;垃圾箱

reinforcement learning [ ] 强化学习

reinforcement learning algorithm [ ] [计] 强化式学习算法

space invader [ ] 空间侵犯者:聊天时站得太靠近而侵犯到对方的个人空间,或坐在某人旁边时坐得太近而碰触到别人的脚或手臂的人。

stay idle [stei ˈaidl] [网络] 吃闲饭;投闲置散;显得没事干

terminal state [ ] 终点状态

the algorithm [ ] [网络] 算法

the instant [ðə ˈinstənt] [网络] 刹那;瞬间;我认许刹那

the Loop [ ] [网络] 主循环;大回圈区;卢普区

the matrix [ ] [网络] 黑客帝国;骇客任务;骇客帝国

the Max [ ] [网络] 麦克斯;牛魔王;电子产品配件

the terminal [ ] [网络] 幸福终点站;航站情缘;机场客运站

to compute [ ] [网络] 计算;用计算机计算

to initialize [ ] 初始化

training loop [ ] 培训回路

trash can [træʃ kæn] n. (街道上或公共建筑物里的)垃圾箱 [网络] 垃圾桶;垃圾筒;垃圾箱图片


惯用语
conv relu
for example
good point
so now
the reward change is dynamic



单词释义末尾数字为词频顺序
zk/中考 gk/中考 ky/考研 cet4/四级 cet6/六级 ielts/雅思 toefl/托福 gre/GRE
* 词汇量测试建议用 testyourvocab.com