04

文本

████    重点词汇
████    难点词汇
████    生僻词
████    词组 & 惯用语

[学习本文需要基础词汇量：7,000 ]
[本次分析采用基础词汇量：6,000 ]

Okay. Let's get started, guys.

So welcome to lecture number 4.

Um, today we will go over two topics that are not discussed,

uh, in the videos.

Uh, you've been learning C2M1 and C2M2,

if I'm not mistaking.

So you've learned about, uh,

what, uh, an is,

how to tune ,

what and train sets are.

Today, we're going to go a little further, uh,

you should have the background to understand 80 percent of this, uh, lecture.

There is gonna be 20 percent that I want you to look back

after you've seen the BatchNorm videos for those of you who haven't seen them.

So we split the lecture in two parts,

and I put back the attendance code at the,

at the very end of the lecture so don't worry.

Ah, one topic is attacking, ah,

, ah, with examples.

Ah, the second one is networks.

[NOISE] And although these two topics have a common word which is ,

they are two separate topics.

You will understand why it's called in both cases.

So let's get started with examples.

And in 2013, ah,

Christian Szegedy and his team have, uh,

published a paper called Intriguing Properties of .

What they noticed is that ,

have kind of a blind spot, the spots, uh,

for which several machine learning including

the state of the art ones that you will learn about, ah,

VGG 16-19 , uh,

networks and - re-residual networks

are vulnerable to something called adversarial examples.

These adversarial examples you're going to learn what it is, in three parts.

First, by explaining how these examples in

the context of images can attack a network in their blind spots,

and, and make the network classify these images as something totally wrong.

How to defend against these type of examples,

and why are networks vulnerable to these type of examples.

This is a little bit more theoretical,

and we're going to go over it on the board.

The, the papers that are listed on the bottom are the two big papers that,

that started this field of research.

So I would advise you to go and,

and read them because we have only one hour-and-a-half to go over two big topics,

um, in, in, in deep learning and,

ah, we will not have the time to go into details of everything.

Okay. So let's set up the goal.

The goal is like, is that given a pre-trained network.

So a network trained on ImageNet on 1,000 classes, millions of images.

Ah, find an input image that is not an ,

so it doesn't look like the animal .

will be classified by the network as an .

We will call this an adversarial example if we manage to find it.

Okay. Yeah, .

Ah, what was the magic code for those that came in this late?

Uh, let me - so 284889,

let me write it down on the board so that you can -

Thank you.

Can you guys see? [NOISE] Okay. Let's move on.

So we have a network pre-trained on image and it's a very good network.

Ah, what I want is to fool it

by giving it an image that doesn't

look like an but is an .

So if I give it a cat image to start with.

The network is obviously going to give me a vector of

probabilities that has the for a cat,

because it's a good network.

And you can guess what's the output layer of this network,

it's probably a , so classification network.

Now what I want is to find

an image x that is going to be an iguana by the network.

Okay. Does the, the,

the setting makes sense to everyone?

Okay. Now as usual, uh,

this - this, this might remind you of what

we've seen together about neural style transfer.

You remember the, the art generation thing,

where we wanted to generate an image based on

the content of the first image and the style of another image.

And in that problem,

the main difference with classic

supervised learning was that we fixed the parameters of the network,

which was also pre-trained,

and we back the error of the loss all the

way back to the input image the ,

so that it looks like the content of the content image

and the style of the style image. The first thing we did is that we the problem.

We, we try to, to,

to phrase what exactly we want.

So - what would you say is a sentence that defines our last function let's say.

Any ideas?

Okay. Complicated. Yep.

An image that provides minimum cost.

An image that provides minimum cost.

Okay. What's the cost you're talking about?

Cost of the, the difference between the expected iguana and non-expected iguana.

Expected iguana and non-expected iguana.

- what do you mean exactly by that?

So if we're sort of going back in the training session,

we're trying to train it on

an image and we wanted to think that [NOISE] this is a cat and iguana.

Yeah. Okay. So you want,

ah, this image to minimize a certain loss function,

and the loss function would be the distance

between the output you're looking for and the output you want.

Okay. Yeah. So I would say,

we want to find x, the image,

such that y hat of x,

which is the result of the of x in the network is equal to y-iguana,

which is a one-hot vector with the one at the position of iguana.

? So now based on that we define our loss function,

which is can be an L2 loss,

can be an L1 loss,

can be a cross-entropy in practice.

Ah, this one, ah, works better.

So you see that minimizing this loss function,

would lead our image x to be as an iguana by the network.

?

And then the process is very similar to neural style transfer,

where we will the image .

So we will start with x,

we will forward it,

the loss function that we just defined.

And remember, we're not training the network, right?

We'll just take the, the of the loss function all the way back to the inputs,

and update the input using a until we

get something that is iguana.

Yeah, any question on that?

But this doesn't necessarily mean that the x that you get in -

Okay. So you mentioned that it

doesn't guarantee that x is - going to look like something.

The only thing is guaranteeing is that

this x will be an iguana if we train properly.

[NOISE] We will, we will talk about that now.

Er, another question in the back I thought. Yeah.

For the last question we miss the one that for .

Oh yeah, it could be cross - it could be .

Yeah. So in this case not because we have a, uh, uh, uh,

a vector of, of n classes,

where it could have been cross-entropy.

Okay. So yeah that's true.

We - are we guaranteed that the forged image x,

this one, i - is going to look like an iguana?

Who thinks it's going to look like an iguana?

- who thinks it's not going to look like an iguana?

Okay. Majority of people.

So can someone tell me why i - it's not going to look like an iguana?

[NOISE].

[] making a vector through a vector.

Okay. So you're saying, uh,

the loss function is ,

is very , so we didn't

put any constraints on what the image should look like.

That's true. Actually, the answer to this question is,

it depends. We don't know.

Maybe it looks like an iguana or maybe it doesn't.

But in terms of probabilities,

it's high chance that it doesn't look like an iguana.

So the reason is here. Let's say this is our space of input images.

And the interesting thing is that even if as a human on

a daily basis we deal with images of the real world.

So like, ,

if you look at a TV,

uh, that is totally ,

you see , random ,

but in other contexts,

we usually see real-world distribution images.

A network is ,

it means it takes an image.

Any input image that fits the,

the first layer would,

would be - would produce an output, right.

So this is the whole space of input images that the network can see.

Um, this is the space of real images,

it's a lot smaller.

Can someone tell me what's the size of the, the,

the space of possible input images for a network?

[NOISE].

.

Huh? Sorry.

.

?

Yeah.

Uh, It's not .

It's, it's a lot but not - [NOISE]

It's the number of the to the power of the number of things it could be.

Okay. Uh, yeah, there is an idea here. Someone there?

I also said the same thing with just number of possible .

Yeah, that's true.

So more precisely - you would start with how many are there?

There are 255, 256 ,

and then what's the size of an image?

Let's say 64 by 64 by 3,

and your results would give you 256,

so you fix the first ,

256 possible value, then the second one can be anything else,

then the third one can be anything else,

and you end up with a very big number.

So this is a huge number.

And the space of real images is here.

Now if we had to plot the space of - of images an iguana,

it would be something like that.

Right. And you see that there is a small between the space

of real images and the space of images classified by - as an iguana by the network.

And this is where we probably are not.

We're probably in the green part that is not the red part,

because we didn't constrain our .

? Okay. Now we're

going to constrain it a little bit more, because in practice,

these type of attacks are not too dangerous because as a,

as a human we would see that the pictures look like garbage.

The dangerous attack is if the picture looks like a cat,

but the network sees it as an iguana and a human see it as a cat.

Can someone think of, uh,

of like applications of that?

[NOISE] Face recognition, it could show a face of - you,

you could show your, your,

picture of your face, it pushed the network [NOISE] to think it's a face of someone else.

What else? Yeah.

Breaking CAPTCHAs and breaking like against .

Yeah. Breaking CAPTCHAs.

If you know what the output,

what output you want you can force the network to think that these CAPTCHA,

uh, - this input CAPTCHA is the output it's looking for.

Or in general, I would say like social , uh,

if someone is and wants to put, uh,

violent content online,

there is - all these companies have to check for this violent content.

If people can use adversarial examples that look still violent,

but are not detected as violent by using this methodology,

they could still publish their violent pictures.

Uh, think about self-driving cars.

A stop sign that looks like a stop sign for everyone,

but when the self-driving car sees it, it's not a stop sign.

So these are applications of adversarial examples, and there are a lot more.

Okay. And in fact, the picture we generated

previously would look like that. It's nothing special.

So now let's constrain our problem a little bit more.

We're going to say we want the picture to look like a cat but be classified as an iguana.

Okay. So now say we have our neural network.

If we give it a cat it's going to predict that it's a cat.

What we want is still give it a cat but predict that it's an iguana.

Okay. I, I go quickly over that because it's very similar to what we did before,

so I just , I just put back what we had on the previous slide.

Okay, exactly the same thing.

Now, the way we our problem will be a little different.

Instead of saying we want only y hat of x equals y - iguana,

we have another constraint.

What's the other constraint?

The picture x should be closer to a picture of a cat.

So we want x equal or very close to x-cat.

And in terms of loss function,

what it does is that it adds

another term which is going to decide how x should be close to x-cat.

If we minimize this loss now,

we should have an image that looks like a cat because of the second term,

and that is predicted as an iguana because of the first term.

? So we're just building up our loss functions,

and I guess you guys are very familiar with this type of thought process now.

Okay, and same process,

we until we hopefully get the cat.

Now our question is,

what should be the initial image we start with?

We didn't talk about that in the previous example [NOISE].Yeah.

White noise?

White noise.

Yeah, possibly white noise.

Any other, uh, proposals?

Maybe a cat.

A cat? Yeah, which cat?

The [] [NOISE].

. Probably the cat that we put in the loss function, right?

Because it's - is the closest one to what we want to get.

So if we want to have a fast process,

we'd better start with exactly this cat,

which is the one we put in our loss function here, right?

If we put another cat,

it's going to be a little longer because we have to

change the of the other cat to look like this cat.

That's what we told our loss function.

If we start with white noise,

it will take even longer because we have to change the all the way

so that it looks real and then it looks like a cat that we defined here.

, the best thing would be probably to start with the picture of the cat.

? And then move the pixels so that

this term is also minimized. Yeah.

So when you write that loss function,

it seems like you are saying that what a human sees as a cat will

just be like minimizing the RMSE error to the actual cat picture, right?

Yeah?

Is that - , I thought that RMSE error was

actually a really bad way to whether or not a human,

like saw two images as similar.

Yeah. This is, this is empirical,

the fact that we use that type of, of loss function.

But in practice, it could have been any distance between X and X cat,

and any distance between Y hat and Y cat, yeah,

and Y iguana, sorry. Yes.

So when you say X cat is [] just one specific cat.

Yeah.

[].

Exactly.

I can't think of a way of making a constrained,

like a complex loss function that takes a bunch of cats.

And then it puts like something like a minimum of it, right?

The minimum distance between []

Can we just look at this wide [inaudible]

like 0.55 iguana and cat and then try to [inaudible]

I'm not sure about the second method.

But just to repeat the point you mentioned,

is that here we had to choose a cat.

It means the X cat is actually an image of a cat.

So what if we don't know what the cat should look like,

we just want a random cat to come out and be classified as an iguana.

We're going to see uh, generating networks

after which can be used to do that type of stuff.

But, uh, but for the second part of the question,

I'm not sure what the would look like.

Okay, let's move on?

, it's probably a good idea to start

with the cat image that we specified in the loss function.

Okay. And so then we have an image of a cat that originally

was classified as 92 percent cat and we modified a few pixels.

So you can see that this image looks a little .

So by doing this modification,

the network will think it's an iguana.

Okay? And sometimes this modification can be very slight and we

can even not be able to notice it. Sounds good.

Now, let's add something else to this,

uh, to this, uh,

to this, uh, draft.

We add a third set which is the space of images that look real to a human.

So that's interesting because the,

the space of images that look real to a human is

actually bigger than space - than the space of real images.

An example is this one.

This is probably an image that looks real to human,

but it's not an image that we could see in,

in the daily life because of these slight pixel changes.

Okay? So these are the space of dangerous adversarial examples.

They looked real to human but they're not actually real.

They might be used to fool a model.

Okay. Now let's see a video, uh,

by , uh,

on real-world example of adversarial examples.

So for those who cannot read,

they're taking, uh, a camera which,

which classify - which has a .

And the classifies the first part as

a library and the second image that is that - the same as a prison.

So the second image has slight different pixels but

it's hard to see for a human. Same here.

So the, the the on the phone classifies

the first image as a with 52 percent accuracy,

confidence, and the second one as a .

Yeah. , this is,

uh, a small example of - of what can, what can be done.

Okay. Now let's go,

we've seen how to generate these adversarial examples.

It's an .

We will see, uh,

what are the type of attacks that we can

lead and what are defenses against these adversarial examples.

So we would usually,

uh, split the attacks into two parts.

non-targeted attacks and targeted attacks.

So non-targeted attacks mean that we just want outputs,

we just want to find an adversarial example that is going to fool the model.

While targeted attack is we want to force

this adversarial example to be output - to output a specific class that we chose.

These are two different type of attacks that,

that are widely discussed in, in the research.

Knowledge of the is something very important.

For those of you who did some ,

you know that we talk about white-box attacks, black-box attacks.

So one interesting thing is that,

uh, a black-box attack - a white-box attack is when you have access to a network.

So we have our image in pre-train - in pre-trained network.

We have fully access to,

to all the parameters and, and the .

So it's probably an easier attack.

Right? We can, we can back-propagate all the way

back to the image and update the image, like we did.

A black-box attack is when the model is probably or something like that,

so that we don't have access to its parameters,

, and, uh, architecture.

So the question is how do we attack in

black-box attack if we cannot back-propagate because we don't have access to the layers?

Any ideas? Yeah.

.

. Yeah, good idea.

So you know you will trick the image a little

bit and you will see how it changes the loss.

Looking at these you can,

you can do have an estimate of the .

Even if the model is a black-box model.

This assumes that you can query the model,

right? You can query it.

What if you cannot even query the model or you can query it one time only,

it's to send the adversarial example.

How would you do that? So this becomes more complicated.

So, there is - a very complex property of

these adversarial examples is that they're highly .

It means I have a model here that is,

uh, an animal , okay?

I don't have access to it.

I cannot even query it.

I still wanna fool it.

What I'm going to do is that I'm going to build my own animal ,

forge an adversarial example on it.

It's highly likely that it's going to

be an adversarial example for the other one as well.

So, this is called ,

and it's still a, uh, research topic, okay?

We're trying to understand why this happens and,

uh, also, uh, how to defend against that.

, maybe a defense against that is to,

is to - we're going to see it after, I'm not gonna say it now, sorry.

Uh, does that make sense or no, this ?

Probably is because two animal look at the same features in images, right?

And maybe these pixels that are play - we're playing

with are changing also the output of the other network.

Let's go over some kind of defenses.

So, one solution to defend against

these adversarial networks is to create a - Safety Net. A Safety Net is what?

Is, uh, a net that - like a ,

you would put it before your network.

Every image that comes in will be classified as fake like forged or real

by the network and you only take those which are real and no - not adversarial.

? So, you could - you could - you could say that,

okay, but we can also build an adversarial network that,

that fools this network, right?

Just we need black-box or white-box,

we can just create an adversarial net - example for this network.

It's true. But the issue is that now we have two constraints.

We have to fool the first one and the second one at the same time.

, maybe if you fool the first one,

there is a chance that the second one is going to be fooled.

We don't know, okay?

It just makes it more complex.

There is no good defense at this point to - to - to all type of adversarial examples.

This is an option that people are researching for.

So, the paper is here if you want to check it out.

Can you guys think of another solution?

[NOISE].

I've got one.

Yeah.

Just like multiple in terms of loss functions [inaudible]

adversarial examples loss functions and train them.

Train on multiple loss functions of different networks?

Yes.

So, you're talking about .

Maybe we can - maybe we can create five networks to do our tasks,

and it's highly unlikely that the adversarial example is going

to fool the five networks the same way, right?

Any other idea? Yes.

Uh, generate adversarial examples and train on them.

Exactly. Generate adversarial examples and train on those, okay?

So, you will generate a cat image that is adversarial.

So, some pixels have been changed to fool a network.

You will label it as the human sees it.

So as a cat because you want the network to still

see that as a cat and you will train on those.

of that is that it's very costly.

We've seen that generating adversarial examples is super

costly and also we don't know if it can to other adversarial examples.

Maybe we are going to to the ones we have.

So, it is another .

Now, another solution is to

train on adversarial examples at the same time as we train on - on normal examples.

So, look at this loss function.

This loss function, the loss is a sum of two loss functions.

One is the classic loss function we would use.

So, , in the case of a - of

a classification and the second one is

the same loss function but we give it the adversarial version of x.

So, what's the complexity of that at every step?

For every of our ,

we're going to have to enough

an adversarial example at every step, right?

Because we have x, what we wanna do is

forward x through the network the first term,

generate x adversarial with the and forward

it to calculate the second term and then back over the weights of the network.

This is super costly as well and is very similar to what you said,

it's just online just all the time, okay?

So, what is interesting is we're going to a little more.

There's another technique called pairing, I just put it here.

We're not going to talk about it.

There is the paper here if you want to check it.

It's another way to do adversarial training.

Uh, but what I would like to talk about is more,

from a theoretical perspective,

why are neural network vulnerable to adversarial examples?

So, let's, let's do some,

some work on the board.

Yeah, .

, uh, so, when you want to expose the [inaudible] probably look like a cat, all right?

So, you expect to be able to [inaudible] can't you just [inaudible] it [inaudible]?

is also a method that's interesting, but you - so the thing is that it's just like in ,

every time you come up with a defense,

someone will come up with an attack and it's a race between humans, .

So, this is the same type of problem.

And security problems are open-ended.

Okay. So, let's go over, uh,

something interesting that is more on the ins - on the side of adversarial examples.

So, let me - let me write down something.

Uh, so, one question we ask ourselves

is why do adversarial example exist? What's the reason?

And Ian and - and his team have came up

with explaining - with the - one of the papers of adversarial examples,

where they argue that although many people in the past have - have

attributed this existence of adversarial examples to

high non-lineari - non-linearities of neural networks and .

So, because we over-fit to a specific ,

we actually don't understand what cats are.

We just understanding what,

what we've been trained on.

Uh, they argue that it's actually the linear parts of networks that

is the cause of the existence of adversarial examples. So, let's see why.

And the example I'm gonna - I'm gonna look at is .

So, together we've seen .

is basically the same thing without the .

So, before the ,

we have y-hat equals plus b.

So, the of our network is going

to be y-hat equals plus b, okay?

And our first example is going to be a six-dimensional input.

Okay. We have a here,

but the doesn't have any because we're in .

So here what happens is simply w x plus b.

Okay? And then we get y-hats.

And we probably use an L1 or L2 loss because it's a regression problem to,

uh, to train this network.

Now let's look at our first example.

Our first example where, uh,

where it's - where we trained our network.

So network has been trained - sorry.

Network has been trained and

to

w equals one,

three, , two, two, three.

This is w. And you know, like,

because we defined x to be a vector of size 6,

a , w has to be a of size 6.

So the network to this value of w and b equals 0.

So now, we're going to look at these inputs.

We're giving a new input to the network.

And the net - th - the input is going to be one,

, two, zero, three, minus two.

Okay. So I'm going to forward propagate this to get y-hat equals plus b.

[NOISE].

And this value is going to be 1 times 1 minus 3

minus 2 plus 0 plus 6 minus 6.

If I didn't make a mistake, up, up,

2 minus 3 plus, okay.

[NOISE] Okay.

And so we - we - we basically get minus 4.

And so this is the - the - the first - the first example that was .

Now, the question is [NOISE] how to change x

into x-star

such that y-hat changes

but x-star is close to x?

So this is basically a problem of adversarial examples.

Can we find an example that is very close to

x but - changes the output of our network?

And we're trying to build on - on adversarial neural networks.

So the interesting part is to - is to identify how we should modify x.

And the comes from the .

If you take the of y-hat with respect to x,

you know that the definition of this term is - is like

the impact on y-hat of

small changes of x, right?

How - what's the impact of small changes of x to - on - on the output?

And if you it, what do you get?

W.

W? Everybody agrees?

What's - what's the shape of this thing?

Shape of that is the same as shape of x.

So it should be w-transpose.

Remember, of a with respect to a vector is the shape of the vector.

Okay. Now it's interesting to - to see this because if we x-star to be,

, x plus a like,

I will call it, value.

Can you write bigger?

Yeah. Sorry. And can you see the top one?

Yeah.

You said yes or no?

Yes.

Okay. [NOISE]. So what if x-star equals x plus times w-transpose?

, and this ,

I will call it value of the .

Now, if we forward propagate x-star,

it means we do y-hat-star equals w x-star plus b,

would be zero at this point.

We're going to get w x plus w times w-transpose.

And w times w-transpose is a , right?

So this is the same as w-squared.

So what is interesting?

It's interesting because the -

the smart part was that this term is always going to be positive.

It means we - we moved a little bit x because we

can make this change little by changing to a small value.

But it's going to push y-hat to a larger value for sure. ?

And if I had here instead of a plus,

it will push y-hat to a smaller value.

And the - the interesting thing is, now,

if we x-star to be x plus times w-transpose,

and we take epsilon to be a small value like, , 0.2.

You can make the calculation.

What we get is - is this.

So 1, minus 1, 2, 0,

3, minus 2, .2 times 1,

0.2 times 3, minus 0.2,

.4, .4, and plus 0.6.

So if you look at that,

all the positive values have been pushed on the right. You agree?

And all the negative values - uh, sorry, sorry.

No, that's my bad. No, no, that's not it.

So let - let's finish the calculation and I'll give the insight after.

1.2, minus 0.4,

1.8, 0.4, 3.4, and minus 1.4.

So this is our x-star that we hope to be adversarial.

Okay. Let's compute y-hat-star to see what happens.

It's w x-star plus b, which is zero.

So what we get when we multiply w by x-star is 1.2 -

[NOISE]

1.2 minus 1.2,

minus 1.8 plus 0.8

plus 6.8 and minus 4.2.

[NOISE], which I believe is going to give us 0.5.

All right.

So we see that a very slight change in x-star has pushed y-hat from minus four to 0.5.

And so a few things we want to notice here.

[NOISE].

So insights on this - on this small example.

The first one is that, uh,

if W is large,

then X star is not similar to X, right?

The larger the W, the less X star is - is likely to be like X.

And specifically, if one entry of W is very large,

XI, the pixel corresponding to this entry is going to be very different from XI star.

Um, if W is large,

X star is going to be different than X.

So what we're going to do is that we are going to take

sign - sign of W instead of taking W. What's the reason why we do that?

Because the interesting part is the sign of - of the W. It means,

if we play correctly with the sign of W,

we will always push the X,

this term star in the positive side.

Because every entry here,

this is going to give us a positive number, right?

And the second insight is that as X grows in dimension,

the impact of plus of W increases.

? So the impact

of sign of W on Y hat increases.

And so what's interesting to notice is that we can keep epsilon as small as possible.

It means X and X star will be very similar but as we grow in dimension,

we're going to get more term in this, a lot more term.

And the change in Y hat is going to grow and grow and grow and grow and grow.

And so the one reason why adversarial examples

exist for images is because the dimension is very high,

64 by 64 by three.

So we can make epsilon very small and take the sign of W,

we will still get Y hat to be far from the original value that it had.

? Yeah. Do you guys have any questions on that?

So epsilon doesn't grow with the dimension,

but its impact of this term increases with the dimension.

[NOISE] Okay.

[NOISE].

The one hot changes what into what? So you have the input image cat, right?

Yeah.

It puts it right between these two that gives [inaudible].

Okay. So you like - you try to unadversarially [inaudible] the cat?

Yeah.

Yeah. I - I don't know if that had been done.

I don't think that has been done.

So you're talking about taking an that takes the adversarial example,

convert it into a normal image of the cat and then give the cat.

Yeah.

Maybe yeah. .

So it's a topic of research.

Uh, okay, let's move on because we don't have too much time.

So just to conclude,

what we're going to count as

a general way to generate adversarial examples is this formula.

[NOISE]

This is going

to be a fast way to generate adversarial example.

So this method is called - Fast Method.

So basically what we're doing is that we can - we - we are

the cost function in - , uh, the parameters.

And we're saying that what's applied to here

is going to also apply for this general formula for deeper networks.

So we're pushing the in one direction

that is going to impact highly the output, okay?

So that's the behind it.

Now you might say that, okay,

we did this example on the ,

but neural networks are not linear,

they are highly non-linear.

, if you look where the research has been going for the past few years,

we are trying to all the behaviors of these neural networks.

With for example, or with .

All that type of methods,

even the , when we train on ,

we do all we can to put in the linear regime,

because we want fast training.

Okay? And one last thing that I'll mention for

adversarial examples is if I have a network like this.

[NOISE]

So fully connected

with three-dimensional inputs, up, yeah.

And then one here and then the output.

What's interesting is computing the chain rule on - on - on this ,

will give you that of the loss function with respect to, ,

X is equal to the derivative of the loss function with respect to Z one, one.

Here of Z one,

one with respect to X.

Let's say we're - we're going - we're going,

there is actually a here.

But anyway. Uh, just let me illustrate the point.

Uh, what we're - what we're saying is that - what we're - what we

try to do with neural networks is to have this gradient be high.

Because if this gradient is not high,

we're not able to train the parameters of

this and we need this gradient to be high.

Because if you want to do the same thing with the - with W one,

one, which is the parameter related to this ,

you would need to go to this chain rule.

Correct? So we need this gradient to be high.

And if this gradient is high,

the gradient with respect to the input is also going to be high.

Because you use the same gradient in the chain rule.

So networks that are - that have

and that are operating in the linear regime are even more,

uh, vulnerable to adversarial examples because of this observation.

So any question on, on adversarial examples?

Before we move on, I think we don't have time and I would like to,

to go over the, the GANs with you guys.

So let's move on to GANs.

I'll stick around to answer questions on that part.

So the general question we're asking now is,

uh, do neural networks understand the data?

Because we've seen that some,

some data points look like they would be real,

uh, but the neural networks don't understand it.

So more generally, uh, can we build generated networks that

can the real-world distribution of images?

Let's say, and this is what we will call adversarial networks.

We'll start by motivating it,

and then we look at something called the game between two networks,

a generator and a ,

that are going to help each other improve,

and finally we'll see that GANs are hard to train, uh,

we'll see some tips to train them, and finally,

go over some nice results and methods to evaluate GANs, okay?

So, uh, the motivation behind adversarial networks is to handle

computers with an understanding of our world, okay?

So by, by that we mean that we want to collect a lot of data,

use it to train a model that can generate

images that look like they're real even if they're not,

so a dog that has never existed can be generated by this network.

Um, and finally, uh,

the number of parameters of the model, uh,

is smaller than the amount of data,

we already talked about that,

and this is the behind why a generated network can exist.

Is because there is too much data in the world,

any images count as data for generating the network,

and there are not enough parameters to this data.

You know, you have - the network needs to understand the of the ,

because it doesn't have enough parameters to everything.

So let's talk about .

So these are samples from real images that have been taken,

and if you plot this real data distribution in a 2-D map,

uh, it would look like something like that.

I made it up, but this is

the image space similar to what we talked about in adversarial networks,

and this green shape is the space of real-world images.

Now, uh, if you train a generator and generate some images that look like this,

and these images come from StackGAN, uh, from .

Uh, this distribution, if the generator is not good,

is not going to match the real world distribution.

So our goal here is to do something so

that the red distribution matches the real-world distribution,

then to train the network so that it realizes what we want.

So this is our generator and it's what counts,

is what, what we want to train ultimately.

We want to give it, let's say,

a random number or a random of 100 dimension numbers,

and we want it to output an image.

But of course because it's not trained initially,

it's going to output a random image,

looks like something like that random pixels.

Now, this image doesn't look very good.

What we want is these images to look like

generated images that are very similar to the real world.

So how are we going to help this generator train?

It's not like what we did in classic supervised learning,

because we don't have,

uh, we don't really have inputs and labels,

you know, there is no label.

We could maybe give it an image of a cat and ask it to output another cat,

but we want the network to be able to output things that don't exist,

things that we've never seen.

Right. So we want the network to understand what a cat

is but not to the cat we give it.

So the way we're going to do it is through

a small game between these network called the generator G,

and another network called the D. Let's,

let's look at how it works.

We have a database of real images,

and we're going to start with this distribution on the bottom,

which is the real-world data distribution,

is the distribution of the images in this database.

Now our generator has this distribution initially,

it means the pixels that you see here

probably follow a distribution that doesn't match the real world.

We'll define the D,

and the goal of the will be to detect if an image is real or not.

So we're going to give several images to this ,

sometimes we will give it generated images,

and sometimes we will give it real-world images.

What we want is that this discriminator is a classifier that outputs

one if the image is real and zero if the image was generated, okay?

So let's say we give it x coming from the generated image is going to give us zero,

because we want the discriminator to detect that x was actually G of z.

If the image came from our database of real images,

we want the discriminator to say one.

So it seems like the discriminator would be easy to train, right?

It's just a .

We can define a loss function.

That is the .

And the good thing is we can have as many label as we want,

like it's, it's but a little bit supervised, you know,

we have this database and we label it all as one,

it's just this image exists,

let's label them as one for discriminator,

and everything that comes out of the generator let's label it as zero for discriminator.

So basically, data is not costly at all in this point.

The way we will train is that we will

the gradient to the discriminator to train the discriminator,

using a binary .

But what we ultimately want is to train the generator, that's what we want.

At the end, we were not going to use the discriminator,

we just want to generate images.

So we are going to direct the gradient to go back to the generator.

And why does this gradient can go back to the generator?

The reason is that x is G of z,

it means we can

the gradient all the way back to the input of the discriminator.

But this input depends on the input of the generator if the image was generated.

So we can also and direct

the gradient to the generator. Does it make sense?

There is a direct relation between z and the loss function,

in the case where the image was generated.

If the image was real,

then the generator couldn't get the gradient,

because x doesn't depend on z or on the features and parameters of the generator.

Okay? So we would run such as Adam,

um, simultaneously on two ,

one for the true data and from, from generated data.

Does this scheme makes sense to everyone?

Yeah, ?

So you said there was two , you're not mixing two and generating it together.

So there's many methods of, your question is about mixing the .

Usually we would use, uh, we would,

we would use one for the real data and one for the fake data.

But in, in practice,

you can try other things.

Yeah. So there are many methods that are being tried to train GANs property.

We're going to a little more into

the details of that when we will see the loss functions.

So we hope that the will match at the end,

and if it matches,

we're going to just take the generator and generate images,

normally it should be able to generate images that look real,

[NOISE] that looked like they came from this distribution.

Okay? Sounds good?

So now let's talk more about the training procedure and

try to figure out what the loss functions should be in this case.

What should be the cost of the discriminator?

Assuming, assuming we give two ,

one for real data, so real images,

and one for generated data that come from G [NOISE].

Yes.

The same basic loss function we use for every binary classes, right?

The same basic loss function we use from binary class - for binary class case.

It's true we're going to it a tiny bit,

but it's the same idea.

So this is what it can look like.

We're going to call it JD,

cost function of the discriminator.

It has two terms. What does the first term say?

What does the second term say?

And you can recognize the binary cross-entropy here.

[NOISE].

The only difference is that we have

a label that is Y_real and a label that is Y_generated.

In practice, Y_real and Y_generated are always going to be set to values.

We know that Y_generated is zero and we know that Y_real is one.

So we can just remove these two terms because they're both equal to one.

The first term is telling us this should correctly

label real data as one, the cross-entropy term.

The first term of a binary cross-entropy.

The second term is going to tell us,

D should correctly labeled generated data as zero.

So the difference with classic cross-entropy we've seen is that,

this is the over the real mini-batch.

And the on the second cross-entropy is a on generated mini-batch.

?

So we both want the D to correctly identify real data,

and also correctly identify fake data.

That's why we have two terms.

Now, what about the generator?

What do you think should be the cost function of the generator? Yes.

So just about that cost function.

If I've been putting data that's from the generator,

I won't run the first pass because I don't have a,

uh, a Y_real if I have the - an input that's coming in from the generator.

Yeah. Exactly.

It's about half of this.

Yeah. But in your batch, we have had, like,

a certain number of real example,

a certain number of generated examples.

The generated examples have no impact on the first cross-entropy,

and same for the real examples on the second cross-entropy. Any other questions?

Okay. So coming back to the cross - to the - to the cost of the generator.

What should it be? This is a tiny bit complicated.

Let's move - let's move on because we don't have too much time.

The cost of the generator basically should say that G should try to

prove D. [NOISE] The goal is to for G to generate real samples.

And in order to generate real samples,

we want to fool D. If G managed to fool D and D is very good,

it means G is very good, right?

The problem is that it's a game.

Because if D is bad and G fools D,

it doesn't mean that G is good.

Because G - because D is bad,

it doesn't detect very well the real versus fake examples.

We want D to go up to - to be very good and G to go up at the same time.

Until is reached at a certain point where D will always output one-half,

like, random probabilities because it cannot

distinguish the samples coming from G versus the real samples.

So this cost function is basically saying, uh,

for generated images, we want D to classify them as one.

That's what it's saying. We want to fool D,

okay? Yeah. .

Uh, just a little bit of a side question, um, I

can kind of see - so if you're implementing this,

I can kind of see how you would, uh, you know,

implement for D, but how would you implement for D as if you're actually implementing this?

Um, is there - has there been a module to dot train this

because it's not immediately obvious how you do this setup?

So, you know, like, - if,

if you're using - so how to implement that?

If you're using a deep learning framework,

you've been building a graph, right?

And at the end of your graph,

you've been building your cost function D that is very close to a binary cross-entropy.

Uh, what you're going to just do is to define a node that is going to be minus

the cost function of D. It's going - every time you are going to call the function J of G,

it's going to run the graph that you define for J of D and run,

uh, an in - an opposition operation - an operation. Yeah.

So now you have two different cost functions.

How can they propagate back the same way?

These are two different cost functions.

Propagate back the same way?

Yeah.

We're not going the same way.

We are going to - to returning []

to a for the - for the generator.

So, you know, you - you - you on the - on

the - on - on D. And when you on G,

you would flip - you would flip the sign. That's all we do.

The same thing with the sign flipped.

In terms of implementation it's just, uh, another operation.

Okay. Now, let's look at - something interesting is that this, uh, log - .

Let's look at [NOISE] at the graph of the .

So I'm going to plot against , axis G,

oh sorry, D of G of z.

So what does this mean?

This axis is the output of D when given a generated example, G of z.

It's going to be between zero and one because it's a probability.

D is a binary classifier with a sigmoid, uh, output probably.

Um, if we plot of X.

So, like, this type of thing.

This would be log of D of G of z.

Does it makes sense? That's the .

Um, if I plot minus that, minus that.

So let me - let me plot minus of G of D of z or,

or let me - let me do something else.

Let me plot logarithm of minus D of G of z.

This is it. Do, do you guys agree?

Now, what I'm going to do is that I'm going to plot another function that is this one.

That is logarithm of one minus D of

G of z, okay?

So the question is,

right now, what we're doing is that we're saying the,

the cost function of the generator is logarithm of 1 minus D of G of z.

So it looks like this,

right? It looks like this one.

[NOISE] What's the issue with this one?

What do you think is the issue with this cost function looking at it like that?

It goes to ?

Sorry.

It goes to ?

Can you say it louder?

, it go - goes to negative in - .

It goes to in,

in one, that's what you mean?

Yeah.

Yeah. And so the, the consequence of that is that

the gradient here is going to be very large,

the closer we go to one.

But the closer we are to zero,

the lower is the gradient.

And it's the reverse phenomenon for this lo - logarithm.

The gradient is very high,

and very high I mean in absolute value.

Very high when we're close to zero,

but it's very low when we go close to one, okay?

So which loss function do you think would be better?

A loss function that looks like this one or a loss function that looks like

this one to train our generator?

The broader question is where are we early in the training?

Are we close to here or are we close to there?

What does it mean to be close there?

Close to one? [NOISE].

.

Hmm?

.

. It means that D thinks that generated,

uh, samples are real.

They're here. This place is the contrary.

D thinks that generated samples are fake.

It means, correctly finds out that they're fake.

Early on, we're generally here.

Because the discriminator is better than the generator.

garbage at the beginning,

and it's very easy for the discriminator to figure out that it's fake

because this garbage looks very different from real world data.

So early on, we're here.

So which function is the best one to - to - to - to - to be our cost?

[inaudible].

Yeah. So probably, this one is better.

So we have to use a mathematical trick to change this into that.

Right. And the mathematical trick is pretty standard.

Right now, we're minimizing something that is in log of one minus X.

We can say that doing so is the same as maximizing something that is in log of X.

Do you agree? Simple flip.

, max flip.

And we can also say that it is the same as minimizing something in minus log of X.

Does it make sense? So we are going to use this mathematical trick

to convert our function that is a cost,

we would say, into a non-saturating cost that is going to look more like this.

Let's see what it looks like.

So to sum up,

our cost function currently looks like that.

It's a cost.

Because early on, the are small.

We cannot train G. We're going to do a I just talked about on the board,

and converts this into another function that is a non-saturating cost.

Okay. Yeah. Well, actually, yeah.

So the reason it's the blue one is like that is because I added a here.

So I'm flipping this.

Okay? And it's the same thing,

it's just the - the sign of the gradient that is going to be different.

Like that, the gradient is high at the beginning and low at the end. That makes sense?

[NOISE] So we're going to do the - use this flip.

And so we have a new training procedure now where J of D

didn't change but J of G changed.

We have a here and instead of the log of one minus D of G of Z,

we have the log of G,

uh, D of G of Z.

Does that make sense to everyone?

Good. And actually, so this is a fun thing

- if you check this paper which is really cool, our GANs

created a large,

study of many, many different GANs.

It shows what people have tried.

And you can see that people have tried all types of loss to make GANs .

So it looks - it looks complicated here.

But actually, the MM GAN is the first one we saw together.

It's the mini-max loss function.

The second one is the non-saturating one that we just see.

So you see between the first two.

The only difference is that on the generator,

we gets the log of one minus D of X hat becoming log - minus log of D of X hat.

Okay. Now, another trick to train GANs is to use the fact that,

uh, a non-saturating, uh,

to use the fact that D is usually easier to train

than G. But as D improves, G can improve.

If D doesn't improve, G cannot improve.

So you can see the - the - the - the performance

of D as an upper bound to what G can achieve.

Because of that, we will usually train D more time than we will train G.

So we will basically train for num_iteration,

K times D, one times G. K times D,

one times G, and so on.

So that the discriminator becomes better than the - the generator can catch up.

Better than can catch up,

and so on. That make sense?

There's also methods to use like

different learning rates for D and G to take this into account,

to train faster the discriminator.

Okay. Uh, because we don't have too much time,

I'm going the BatchNorm with GANs.

We are going to sit probably next week, uh,

together after you guys have seen the BatchNorm videos.

Okay. It's cool. So just to sum up.

Some - some tips to train GANs is to modify the cost function.

We've seen one modification, there are many more.

Uh, keeping D up-to-date with respect to G. So updating D

more than you update G using Virtual BatchNorm which is a of BatchNorm,

so it's a different type of BatchNorm that is used here.

And something called one-sided la - label

smoothing that I'm not going to talk about it today because we don't have time.

So let's see some nice result now,

and that's the funniest part.

Um, so some of you have worked with word ,

and you - you might know that word

are vectors that can the meaning of a word.

And you can compute operations sometimes on these - on these words.

So if you take, um,

if you take king minus queen,

it should be equal to man minus woman.

Operations like that.

That's happened in the space of . So here's the thing.

You can use a generator to generate faces,

and the paper is listed on the bottom here.

So you give a code that is a random code and it will give you an image of a - a face.

You can give it a second code,

it's going to give you a second image that is

different from the first one because the code was different.

You can give it a third one,

it's going to give you a third fa - third face.

The fun part is,

if you take code one minus code two plus code three.

So basically, image of a man with glasses minus image of

a man plus image of a woman will give you an image of a woman with glasses.

So [].

So this is interesting because it means that in

the space of codes have impact directly on the image space.

Okay. Let's look at something even better.

So you can use GANs for image generation.

Of course, these are very nice samples.

You see that sometimes,

GANs have problem with - with the - [LAUGHTER] . I don't think that's a dog.

But - but - but the - but these

are StackGAN++ is a - is a very impressive GAN

that has generated - that has been state of the art for a long time.

Okay. So let's see something fun.

Something called image-to-image translation.

So, uh, actually, the - the -

the project winners last quarter in Spring was a project dealing with exactly that.

Generating satellite images based on the map image.

So given the map image, generate the satellite image using a GAN.

So you see that instead of giving a that was 100 ,

you could give a very detailed code.

The code can be this image.

Right? And you have to find a way to constrain your network in a certain - with - in

a certain way to push it to output

exactly the satellite image that corresponded to this map image.

There are many other results that are fun.

Converting to - horses to and to horses.

Um, and apples to oranges and oranges to apple.

So let's do a - a case study together.

Let's say our goal is to convert horses to on images .

Can you tell me what data we need?

Let's go quickly so that we have some time.

Horses and ?

Yeah. Horses and zebras.

Do you need per images?

You know, like, do you need to have the same image of a horse as a ?

No.

Yeah. So the problem is, uh, okay,

we could have labeled images, you know,

like uh, a horse and its,

uh, in the same position.

Uh, and we could train a network to take one and output the other.

Unfortunately, we don't - not - every horse has

a that is a , so we cannot do that.

Uh, so instead, we're going to do ,

uh, adversarial networks.

It means we have a database of horses and a database of zebras.

But these are different horses and different zebras.

They're not one-to-one - there's no one-to-one mapping between them.

There's no mapping at all. What architecture do you wanna use?

GAN?

Nice.

[LAUGHTER] GAN, not a [inaudible].

Okay. So let's see about the architecture and the cost.

So I'm going over it very quickly because it's a -

it's a very fun GAN with - it's called CycleGAN.

So the way we are going to work it out is we have a horse called

capital H. We want to generate the version of this horse, right.

So we give it to a generator that we call G1.

You can call it H2Z,

like horse to .

It should give us this horse H as a zebra, right.

And in fact, if we're training a GAN,

we need a discriminator.

So, we will add a discriminator that is going to be a binary classifier to tell us

if this image by Generator 1 is real or not.

So this discriminator is going to take in some images of zebras, probably,

or-yeah, zebras or horses [NOISE],

and it's going to also take the generated images

and going to see if which one is fake which one is real.

On the other hand, we're going to do - and the is very important.

We need to enforce the fact that this horse G1 of H

should be the same horse as H. In order to do that,

we're going to create another gen - generator which is going to take the generated image,

and generate back the input image.

And this is where we will be able to enforce the constraints that G2 of G1 of

H should be equal to H. Do you see why this loop is super important?

Because if we don't have this loop,

we don't have a constraint on the fact that the horse

should be the - the zebra should be the horse as a zebra,

the same horse as H. So we'll do that and

we have a second discriminator to decide if this image is real.

This is one step, H2Z.

Another state might be Z2H where we start with zebra,

give it to Generator 2,

nerate the horse version of the zebra.

, generate back the zebra version

of the zebra and . ?

So this is the general pattern using CycleGANs.

And what I'd like to go over is what loss should we minimize in

order to enforce the fact that we want

the horse to be converted to a zebra that is the same as the horse.

Can someone give me the terms that we need?

Someone wants to give it a try?

Go for it. Two minutes. Yes.

So you want to make sure that the picture in

the end that is of the zebra that you started off with,

matches the zebra that you started it with or

the horse that you started off matches the horse that you had originally.

Okay.

But at the same time, you also need to have Discriminator 2

identifying that the image is a real zebra or a real horse -

Yeah.

- because you don't want it to just sort of input

in the sample image and it output back to you the sample image.

Yeah, correct.

So I think you'd want to add the output of the cost function for Discriminator

2 to the cost that you get at for comparing the starting images.

Okay, that's correct. So you're saying we need

the classic cost functions that we've seen previously,

plus another one that is the matching between H and G2 of G1 of H,

and Z and G1 of G2 of Z.

Yes.

Correct. So we have all these terms.

One term to train G1,

which is the classic term we've seen,

differentiate real images from generated images.

G1 is what? Same. We are using the non-saturating costs on generating images.

Same for D2. Same for G2. These are classics.

The one we need to add to all of this is

the cycle costs which is the distance between this term,

G2 of G1 of H and H,

and the same thing for zebras.

Does that make sense? So you have the intuition to build that type of loss.

We just sum everything and it gives us the cost function we're looking for. Yeah.

Can we use the same,

uh, D1 as D2?

It's the same [inaudible] recognized [inaudible]

Oh, the same cost function for D1 and D2?

Yeah. Could you use the same -

So, the, the - you could but it's not going to work that well.

I think - So I think there's a - there's a tiny mistake here,

is that, uh, the here,

the small should be small Hi,

and the small Hi on top should be a small .

Because the Discriminator 1 is going to receive

generated samples that look like zebras because it came out of G1.

So you want the real database to - that you give it to to be zebras as well.

To force - to force the generator one to output things that look like zebras,

for the second one.

Okay? And this my favorite.

So you can convert the to a face and back to a .

[LAUGHTER] It's the most fun application I found.

It's from Naritomi , and .

So it's Japanese research lab are working hard to,

to, to do face2ramen [LAUGHTER].

And actually, in two - in two to three weeks,

you will learn, um,

, you know, to detect faces.

And if you learn that, maybe you can start a project to like

detect the face and then replace it by a .

[LAUGHTER] Because I don't know, this is also a funny,

funny work by Naritomi.

Okay. Oh, this is a super cool application as well.

So let's look at that.

Okay. So we have - so this model is a GAN that was conditioned on learning,

um, learning edges and generating cats based on the edges.

So I'm gonna - I'm gonna to try to draw a cat.

[LAUGHTER] Okay, sorry.

I cannot see [LAUGHTER].

Again, I'm not a good drawer - [LAUGHTER]. It's a cat.

Okay. It's going the model.

I hope it's gonna work. [LAUGHTER] Okay.

Yeah,

I,

I don't think it worked,

but it's supposed to work.

So you can generate cats based on,

on edges and you can do it for different things.

You can do it for a shoe.

So all these models have been trained for that. Okay.

Yeah, I have a question.

Yes, go for it.

[NOISE] So, so for this model,

would you have the specific things for the things that you want it to generate?

Like two things, so cats and shoes in this case?

Uh, sorry. Can you repeat?

Is it or do you have to train it specifically for the domains?

You have to train it specifically for the domain.

So like these models are different models that have [NOISE] been trained.

Okay.

Okay. I'm looking for my presentation,

[NOISE] I missed it. The presentation disappeared.

Okay. Another application is super resolution.

You can give a lower resolution image and generate

the super resolution version of it using GANs.

And this is pretty cool because you can get,

uh, a high resolution image,

down-sample it, and use this as the game, you know.

[NOISE] Like you have

the high resolution version of the lower - - lower-resolution image.

Um, other applications can be privacy-preserving.

So some people have been working on - you know in medical - uh,

in the medical space privacy is a huge issue.

You cannot share a among hospitals,

among medical teams it's common,

so people have been looking at generating a that looks like a medical .

If you train a model on this dataset,

it's going to give you the same type of parameters than the other one,

but this dataset is .

So they can share the data with each

other and train their model on that, without

being able to access, uh,

the information of the patient and who it is.

Um, manufacturing is important as well,

so GANs can generate, um,

very specific, uh, objects that can replace bones for humans,

to, to the human body.

So same for .

If you lose the teeth, uh, the,

the technician can take a picture and decide what's the,

the crown should look like.

The GAN can generate it.

Um, another topic is how to evaluate GANs, you know.

Um, you might say we can just look at the images and see if they

look real and it will give us an idea if the GAN is working well.

, this is hard because maybe the images you're looking at are over-fitting images

from the real samples you gave to the - to the - to the discriminator.

Uh, so how do you check that?

It's very complicated.

So human is a big one,

where you would, uh,

[NOISE] you would build a software,

push it on the cloud and people all around the world are

gonna select which images look generated,

which images look not generated to see if a human can, can,

can compare your GAN to real-world data,

and how your GAN performs.

So it would look like that.

A indicates which image is fake, which image is real.

You can - you can do different experiments like you can show very quickly

an image for a fraction of a second and ask them was it real or not,

or you can give them time.

Different experiments can be led.

Uh, there is another one that is more because human is very painful.

You know, every time you train a GAN,

you want to do that to verify if the GAN is working well. It takes a lot of time.

So instead of using humans,

why don't we use a very good network that is good at classification.

, , the network is a tremendous network that does classification.

We're going to give our image samples to

this network and see what the network thinks of this image.

Does it think that it's a dog or not?

Does it look like a dog for the network or not?

And we can scale it and make it very quick.

And there is a score that,

that we can talk next week about when we'll have time.

Uh, it measures the quality of

the samples and also it measures the diversity of the sample.

I'll go over it next week, hopefully.

Uh, there is another distance that is very popular, uh,

that has been popular recently called the Distance.

And I, I - I'll advise you to check some of

these paper if you're more interested in it for, for your projects.

So just to end, um, for next Wednesday,

we'll have, uh, C2 and three and also the whole C3 modules.

[NOISE] Uh, you'll have three .

Be careful, these two ,

C3M1 and C3M2 are longer than ca - than normal .

They're like wide case studies, so take your time,

and go over it, um,

and you have one programming assignment.

Uh, make sure you understand the BatchNorm videos,

so that we can go over the virtual BatchNorm hopefully next week together.

Um, and hands-on section this Friday, uh,

you will receive your project proposal as soon as possible, uh,

and meet with your project TAs to go over the proposal and

to make decisions regarding the next steps for your projects.

Uh, I'll stick around in case you have any questions. Okay. Thanks, guys.

知识点

重点词汇
infinite [ˈɪnfɪnət] n. 无限；[数] 无穷大；无限的东西（如空间，时间） adj. 无限的，无穷的；无数的；极大的 {cet4 cet6 ky ielts :6045}

gauge [ɡeɪdʒ] n. 计量器；标准尺寸；容量规格 vt. 测量；估计；给…定规格 {cet4 cet6 ky toefl ielts gre :6046}

detection [dɪˈtekʃn] n. 侦查，探测；发觉，发现；察觉 {cet4 cet6 gre :6133}

attacker [əˈtækə(r)] n. 攻击者；进攻者 { :6197}

download [ˌdaʊnˈləʊd] vt. [计] 下载 {gk :6382}

radically ['rædɪklɪ] adv. 根本上；彻底地；以激进的方式 {ky toefl :6472}

discriminate [dɪˈskrɪmɪneɪt] vi. 区别；辨别 vt. 歧视；区别；辨别 {ky toefl ielts gre :6572}

proximity [prɒkˈsɪməti] n. 接近，[数]邻近；接近；接近度，距离；亲近 {cet6 toefl ielts :6588}

overlapping [əʊvə'læpɪŋ] adj. 重叠；覆盖 v. 与…重叠；盖过（overlap的ing形式） {toefl :6707}

overlap [ˌəʊvəˈlæp] n. 重叠；重复 vi. 部分重叠；部分的同时发生 vt. 与…重叠；与…同时发生 {cet6 ky toefl ielts gre :6707}

unlimited [ʌnˈlɪmɪtɪd] adj. 无限制的；无限量的；无条件的 {cet6 :6742}

MA [mɑ:] abbr. 文学硕士（Master of Arts）；磁放大器（magnetic amplifier）；主报警信号（main alarm） { :6756}

inaudible [ɪnˈɔ:dəbl] adj. 听不见的；不可闻的 { :6808}

algorithm [ˈælgərɪðəm] n. [计][数] 算法，运算法则 { :6819}

algorithms [ˈælɡəriðəmz] n. [计][数] 算法；算法式（algorithm的复数） { :6819}

mimic [ˈmɪmɪk] vt. 模仿，摹拟 n. 效颦者，模仿者；仿制品；小丑 adj. 模仿的，模拟的；假装的 {toefl ielts gre :6833}

plucked [plʌkt] [纺] 粗细不匀 { :6870}

dental [ˈdentl] n. 齿音 adj. 牙科的；牙齿的，牙的 {ky toefl :7161}

vice [vaɪs] prep. 代替 n. 恶习；缺点；[机] 老虎钳；卖淫 adj. 副的；代替的 vt. 钳住 n. (Vice)人名；(塞)维采 {gk cet4 cet6 ky ielts :7210}

latent [ˈleɪtnt] adj. 潜在的；潜伏的；隐藏的 {cet6 ky toefl ielts gre :7284}

numerical [nju:ˈmerɪkl] adj. 数值的；数字的；用数字表示的（等于numeric） {cet6 ky toefl ielts :7312}

gradient [ˈgreɪdiənt] n. [数][物] 梯度；坡度；倾斜度 adj. 倾斜的；步行的 {cet6 toefl :7370}

gradients [ˈgreɪdi:ənts] n. 渐变，[数][物] 梯度（gradient复数形式） { :7370}

binary [ˈbaɪnəri] adj. [数] 二进制的；二元的，二态的 { :7467}

Et ['i:ti:] conj. （拉丁语）和（等于and） { :7820}

compute [kəmˈpju:t] n. 计算；估计；推断 vt. 计算；估算；用计算机计算 vi. 计算；估算；推断 {cet4 cet6 ky toefl ielts :7824}

intuition [ˌɪntjuˈɪʃn] n. 直觉；直觉力；直觉的知识 {cet6 ky toefl ielts gre :7905}

conditional [kənˈdɪʃənl] n. 条件句；条件语 adj. 有条件的；假定的 { :8076}

converged [kən'vɜ:dʒd] v. 聚集，使会聚（converge的过去式） adj. 收敛的；聚合的 { :8179}

converge [kənˈvɜ:dʒ] vt. 使汇聚 vi. 聚集；靠拢；收敛 {cet6 toefl ielts gre :8179}

encode [ɪnˈkəʊd] vt. （将文字材料）译成密码；编码，编制成计算机语言 { :8299}

encoding [ɪn'kəʊdɪŋ] n. [计] 编码 v. [计] 编码（encode的ing形式） { :8299}

validation [ˌvælɪ'deɪʃn] n. 确认；批准；生效 { :8314}

downside [ˈdaʊnsaɪd] n. 负面，缺点；下降趋势；底侧 adj. 底侧的 { :8709}

implicitly [ɪm'plɪsɪtlɪ] adv. 含蓄地；暗中地 { :8775}

quiz [kwɪz] n. 考查；恶作剧；课堂测验 vt. 挖苦；张望；对…进行测验 {gk cet4 cet6 ky :8784}

loo [lu:] n. 厕所，洗手间；赌金；卢牌戏（一种纸牌赌博） vt. 使罚赌金 n. (Loo)人名；(德、法)洛 { :8889}

derivative [dɪˈrɪvətɪv] n. [化学] 衍生物，派生物；导数 adj. 派生的；引出的 {toefl gre :9140}

neural [ˈnjʊərəl] adj. 神经的；神经系统的；背的；神经中枢的 n. (Neural)人名；(捷)诺伊拉尔 { :9310}

activations [,æktɪ'veɪʃən] n. [电子][物] 激活；活化作用 { :9314}

activation [ˌæktɪ'veɪʃn] n. [电子][物] 激活；活化作用 { :9314}

neuron [ˈnjʊərɒn] n. [解剖] 神经元，神经单位 {cet6 toefl :9397}

salient [ˈseɪliənt] n. 凸角；突出部分 adj. 显著的；突出的；跳跃的 n. (Salient)人名；(西)萨连特 {toefl gre :9408}

en [en] n. 半方；字母N prep. 在…中 n. (En)人名；(芬、柬)恩 { :9798}

metric [ˈmetrɪk] adj. 公制的；米制的；公尺的 n. 度量标准 {cet4 cet6 ky ielts :10163}

propagate [ˈprɒpəgeɪt] vt. 传播；传送；繁殖；宣传 vi. 繁殖；增殖 {cet6 toefl ielts gre :10193}

propagated [ˈprɔpəɡeitid] 传播 { :10193}

inception [ɪnˈsepʃn] n. 起初；获得学位 n. 《盗梦空间》（电影名） {gre :10325}

grad [græd] n. 毕业生；校友 n. (Grad)人名；(英、法、德、罗、瑞典)格拉德 { :10355}

pixels ['pɪksəl] n. [电子] 像素；像素点（pixel的复数） { :10356}

pixel [ˈpɪksl] n. （显示器或电视机图象的）像素（等于picture element） { :10356}

generalize [ˈdʒenrəlaɪz] vi. 形成概念 vt. 概括；推广；使...一般化 {cet6 ky toefl ielts gre :10707}

tweak [twi:k] n. 扭；拧；焦急 vt. 扭；用力拉；开足马力 { :10855}

wha [ ] [医][=warmed,humidified air]温暖、潮湿的空气 { :11046}

saturating [ˈsætʃəreitɪŋ] v. 浸湿，浸透( saturate的现在分词 ); 使…大量吸收或充满某物 { :11157}

infinity [ɪnˈfɪnəti] n. 无穷；无限大；无限距 {cet6 gre :11224}

delve [delv] n. 穴；洞 vi. 钻研；探究；挖 vt. 钻研；探究；挖 n. (Delve)人名；(英)德尔夫 {gre :11237}

axes [ˈæksi:z] n. 轴线；轴心；坐标轴；斧头（axe的复数） { :11322}

malicious [məˈlɪʃəs] adj. 恶意的；恶毒的；蓄意的；怀恨的 {cet6 toefl gre :11330}

washer [ˈwɒʃə(r)] n. [机] 垫圈；洗涤器；洗衣人 { :11379}

seminal [ˈsemɪnl] adj. 种子的；精液的；生殖的 adj. 有创造力的，对未来有影响的；重大的 {gre :11387}

optimize [ˈɒptɪmaɪz] vt. 使最优化，使完善 vi. 优化；持乐观态度 {ky :11612}

propagation [ˌprɒpə'ɡeɪʃn] n. 传播；繁殖；增殖 {cet6 gre :12741}

multiplication [ˌmʌltɪplɪˈkeɪʃn] n. [数] 乘法；增加 {cet6 :12748}

personalized [ˈpəːs(ə)n(ə)lʌɪzd] adj. 个性化的；个人化的 v. 个性化（personalize的过去式）；个人化 { :13175}

buggy [ˈbʌgi] n. 童车；双轮单座轻马车 adj. 多虫的 {toefl gre :13418}

entropy [ˈentrəpi] n. [热] 熵（热力学函数） { :13494}

blurry [ˈblɜ:ri] adj. 模糊的；污脏的；不清楚的 { :13819}

zebra [ˈzebrə] n. [脊椎] 斑马 adj. 有斑纹的 {zk gk cet4 cet6 ky :13912}

zebras ['zɪbrəz] n. 斑马( zebra的名词复数 ) { :13912}

logistic [lə'dʒɪstɪkl] adj. 后勤学的；[数] 符号逻辑的 { :14538}

dimensional [dɪ'menʃənəl] adj. 空间的；尺寸的 {toefl :15066}

adversarial [ˌædvəˈseəriəl] adj. 对抗的；对手的，敌手的 { :15137}

transferable [trænsˈfɜ:rəbl] adj. 可转让的；[数] 可转移的 { :16039}

APP [æp] abbr. 应用（Application）；穿甲试验（Armor Piercing Proof） n. (App)人名；(英)阿普 { :16510}

mu [mju:] n. 希腊语的第12个字母；微米 n. (Mu)人名；(中)茉(广东话·威妥玛) { :16619}

optimization [ˌɒptɪmaɪ'zeɪʃən] n. 最佳化，最优化 {gre :16923}

firewall [ˈfaɪəwɔ:l] n. 防火墙 vt. 用作防火墙 { :17087}

iteration [ˌɪtəˈreɪʃn] n. [数] 迭代；反复；重复 { :17595}

summation [sʌˈmeɪʃn] n. 和；[生理] 总和；合计 {gre :17935}

dataset ['deɪtəset] n. 资料组 { :18096}

permutations [pɜ:mju:'teɪʃnz] n. [数] 排列（permutation的复数） { :18648}

iguana [ɪˈgwɑ:nə] n. 鬣蜥蜴 { :18852}

BOT [bɒt] n. 马蝇幼虫，马蝇 n. (Bot)人名；(俄、荷、罗、匈)博特；(法)博 { :18864}

难点词汇
unsupervised [ˌʌn'sju:pəvaɪzd] adj. 无人监督的；无人管理的 { :19787}

perturbation [ˌpɜ:təˈbeɪʃn] n. [数][天] 摄动；不安；扰乱 { :19948}

rephrased [ri:ˈfreizd] v. 改述，改撰( rephrase的过去式和过去分词 ) { :20756}

rephrase [ˌri:ˈfreɪz] vt. 改述；重新措辞 { :20756}

encrypted [inˈkriptid] v. 把…编码；把…加密（encrypt的过去分词） { :21117}

generative [ˈdʒenərətɪv] adj. 生殖的；生产的；有生殖力的；有生产力的 { :21588}

epsilon [ˈepsɪlɒn] n. 希腊语字母之第五字 { :22651}

annotation [ˌænə'teɪʃn] n. 注释；注解；释文 { :22939}

deterministic [dɪˌtɜ:mɪ'nɪstɪk] adj. 确定性的；命运注定论的 { :23481}

unconstrained [ˌʌnkən'streɪnd] adj. 不勉强的；非强迫的；不受约束的 { :23653}

IM [ ] abbr. 感应电动机（Induction Motor） abbr. 即时通信（Instant Messaging） { :24105}

encoder [ɪn'kəʊdə] n. 编码器；译码器 { :24604}

GE [ʒei] abbr. 美国通用电气公司（General Electric Co.）；总能量（gross energy） n. (Ge)人名；(朝)揭；(俄)格 { :25836}

xavier ['zʌvɪə] n. 泽维尔（男子名） { :26299}

embeddings [ɪm'bɛd] v. [医] 植入；埋藏（embed的ing形式） { :27523}

logarithm [ˈlɒgərɪðəm] n. [数] 对数 { :27896}

ve [ ] 委内瑞拉 abbr. 虚拟环境（Virtual Environment） { :28191}

transferability [ˌtrænsˌfɜ:rə'bɪlətɪ] n. 可转移性；可转让性 { :28407}

scalar [ˈskeɪlə(r)] n. [数] 标量；[数] 数量 adj. 标量的；数量的；梯状的，分等级的 { :28925}

ramen [rɑmən] n. （方便）拉面，拉面 { :29936}

unpaired ['ʌn'peəd] adj. 不成双的；无对手的；无配偶的 { :30070}

scalable ['skeɪləbl] adj. 可攀登的；可去鳞的；可称量的 { :30540}

thi [ ] abbr. 温度-湿度指数（Temperature Humidity Index） { :30640}

sigmoid ['sɪgmɔɪd] n. 乙状结肠（等于sigmoidal）；S状弯曲 adj. 乙状结肠的；C形的；S形的 { :31478}

doppelganger ['dɒpelɡɑ:nɡər] n. 面貌极相似的人；幽灵 { :31488}

discriminator [dɪ'skrɪmɪneɪtə] n. [电子] 鉴别器；辨别者 { :34448}

classifiers [ ] (classifier 的复数) n. 分类者, 分粒器, 分级机, 汉语中的量词 [计] 分类符, 分类器 { :37807}

classifier [ˈklæsɪfaɪə(r)] n. [测][遥感] 分类器； { :37807}

iterate [ˈɪtəreɪt] vt. 迭代；重复；反复说；重做 {toefl gre :38640}

Goodfellow [ ] [人名] [英格兰人姓氏] 古德费洛绰号，意气相投的伙伴，来源于中世纪英语，含义是“好+伙伴”(good+fellow) { :39174}

initialization [ɪˌnɪʃəlaɪ'zeɪʃn] n. [计] 初始化；赋初值 { :40016}

VER [vɜ:] n. DOS命令：显示DOS版本号 { :45633}

iteratively [ ] [计] 迭代的 { :48568}

生僻词
anonymized [ ] 隐去姓名资料使匿名

backpropagate [ ] [网络] 反向传播

Coursera [ ] [网络] 免费在线大学课程；免费在线大；斯坦福

crypto ['krɪptəʊ] n. 秘密赞同者；秘密党员

denoise [di:'nɔiz] 降噪, 消除干扰

denoising [ ] [网络] 去噪；去噪声；小波去噪

derivate ['derɪveɪt] n. 导数；派生词；派生的事物 adj. 引出的；系出的

ensembling [ ] [网络] 综合

Frechet [ ] [计] F-拓扑

generalizable ['dʒenərəlaɪzəbl] adj. 可概括的，可归纳的

growingly [ ] adv. growing的变形

kurakin [ ] [网络] 库拉金；寄给阿金

linearize ['lɪnɪəraɪz] vt. 使线性化

linearizing ['liniәraiziŋ] 线性化的

logit ['lɒdʒɪt] 分对数

medias [ ] n. （西）丝袜

minibatch [ ] [网络] 迷你

minibatches [ ] [网络] 小批量

minimax ['mɪnəˌmæks] n. 极小化极大，极小极大（极大中的极小），鞍点

oppositive [ә'pɔzitiv] a. 反对的；相反的

outputted ['aʊt.pʊt] n. 产量；产品；【电】发电力；供给量 [网络] 输出；产出；输出量

overfit [ ] [网络] 过拟合；过度拟合；过适应

overfitting [ ] n. 过适；[数] 过度拟合 v. 过适（overfit现在分词）

quizzes [kwiziz] n. 小测验（quiz复数形式）；智力比赛 v. 测验；盘问（quiz的第三人称单数形式）

relu [ ] [网络] 关节轴承

resi [ ] [网络] 万鼎工程；菜；树脂

safet [ ] abbr. self-aligned MESFET process 自对准金属-半导体场效应晶体管工艺

softmax [ ] [网络] 柔性最大传递函数；前回收的日志文件的百分比；西风狂诗曲系列篇章

someth [ ] [网络] 官方完整

tako [ ] [地名] [日本] 多古

takuya [ ] [网络] 拓也；寺田拓哉；卓也

trainable [t'reɪnəbl] adj. 可训练的，可教育的

versa ['vɜ:sə] adj. 反的

wx [ ] abbr. weather 天气; weather report 气象报告; watts second 瓦特秒; waxy 蜡（状）的

zhang [ ] n. 张，章（中国姓氏）

zi [,zi 'aɪ] abbr. 美国本土，后方地带（等于zone of interior）

词组
a batch [ei bætʃ] un. 一批 [网络] 同批产品；罩式炉；详细

a dot [ ] [网络] 阿顿；阿突

a flip [ ] [网络] 翻筋斗

A minus [ ] [网络] A减

an algorithm [ ] [网络] 规则系统；运算程式

and vice versa [ ] [网络] 反之亦然；反过来也一样；科技中的政府

AX (axis) [ ] 线

binary classification [ ] 二元分类

classify as [ ] [网络] 归类为；分类为；出库类型

column vector [ ] un. 列向量；列向和；纵矢量 [网络] 列矢量；行向量；向量了

correlate to [ ] [网络] 使相互关联；相关；与…相互关联

cross entropy [ ] 交叉熵

descent algorithm [ ] 下降算法

door mat [ ] un. 门前擦鞋棕垫；蹭鞋胶垫；门口踏脚垫 [网络] 门垫；门前棕垫；乡村小花脚踏垫

dot product [dɔt ˈprɔdʌkt] un. 点积；标量积 [网络] 点乘；数量积；内积

Epsilon sign [ ] 《英汉医学词典》Epsilon sign 艾泼斯龙征

et al [ ] abbr. 以及其他人，等人

et al. [ˌet ˈæl] adv. 以及其他人；表示还有别的名字省略不提 abbr. 等等（尤置于名称后，源自拉丁文 et alii/alia） [网络] 等人；某某等人；出处

et. al [ ] adv. 以及其他人；用在一个名字后面 [网络] 等；等人；等等

flip that [ ] None

forward propagation [ ] 正向传播

generator output [ ] 电机输出[功率]

gradient descent [ ] n. 梯度下降法 [网络] 梯度递减；梯度下降算法；梯度递减的学习法

gradient descent algorithm [ ] [网络] 梯度下降算法；梯度陡降法；梯度衰减原理

gradient sign [ ] 坡度标

high gradient [ ] 高梯度

in the proximity of [ ] na. 在…附近 [网络] 在...附近

latent code [ ] 隐性分类码

linear network [ ] [网络] 线性网络；线性网路；线性神经网络

linear operation [ ] [网络] 线性运算；它并不是个线性操作；线性演算

linear regression [ ] un. 线性回归；直线回归 [网络] 线性回归分析；线性回归法；线性衰退

logarithm function [ ] n.对数函数

logistic regression [loˈdʒɪstɪk rɪˈɡrɛʃən] n. 逻辑回归 [网络] 吉斯回归；逻辑斯回归；罗吉斯回归

maximum probability [ ] un. 最高概率 [网络] 最大概率；最大机率

minus one [ ] [网络] 桃花源；幸福意外；谢谢你捧场

minus sign [ˈmainəs sain] n. 负号 [网络] 减号；减号的故事；负符号

negative infinity [ ] 负无穷大,负无限大

neural network [ˈnjuərəl ˈnetwə:k] n. 神经网络 [网络] 类神经网路；类神经网络；神经元网络

neural networks [ ] na. 【计】模拟脑神经元网络 [网络] 神经网络；类神经网路；神经网络系统

object detection [ ] [科技] 物体检测

optimization problem [ ] un. 最佳化问题 [网络] 最优化问题；次要最佳化问题

optimization problems [ ] [网络] 最佳化问题；最优化问题；最适化问题

optimization process [ ] un. 最优化过程 [网络] 优化历程；最佳化处理

overlap with [ ] vt.与...相一致

pixel image [ˈpiksəl ˈimidʒ] [医]像素显像

pixel value [ ] [网络] 像素值；像素数值；像素单元值

probability distribution [ ] un. 概率分布 [网络] 机率分布；机率分配；确率分布

probability distributions [ ] [网络] 概率分布；学过几率分布；机会率分布

probability of [ ] na. (飞弹不被击落的)概率 [网络] 变异概率

row vector [ ] un. 行向量；单行矩阵；行矢量 [网络] 列向量；列矢量；列向量使用

salient feature [ ] un. 特征 [网络] 特点；鲜明特征

salient features [ ] na. 特点 [网络] 特征；特色；突出特点

small perturbation [ ] 小微扰

test validation [ ] [网络] 测验效度

the algorithm [ ] [网络] 算法

the ax [ ] 斧子

the downside [ ] [网络] 不利方面；缺点

the equilibrium [ ] [网络] 平衡；那种平静

the FA [ ] [网络] 英格兰足总；英国足球协会；英国足总

the vice [ ] [网络] 罪恶谷

time derivative [ ] 时间导数,时间微商

to compute [ ] [网络] 计算；用计算机计算

to download [ ] 下载

to forge [ ] [网络] 煅炼；稳步前进；假造

to propagate [ ] [网络] 传播；传种；推展

to skip [ ] [网络] 略过；跳越；跳过

to update [ ] [网络] 更新；重要更新公告；每月更新

vice versa [ˌvaɪs ˈvɜ:sə] adv. 反之亦然；反过来也一样 [网络] 小爸爸大儿子；反过来亦然；反过来的

web app [ ] [网络] 网页应用；网络应用；应用程序

惯用语
does that make sense
does that makes sense
i don't know
i mean
if you
in fact
let's say
one question
plus 0
so yeah
you know
you're fooling the network

单词释义末尾数字为词频顺序
zk/中考 gk/中考 ky/考研 cet4/四级 cet6/六级 ielts/雅思 toefl/托福 gre/GRE
* 词汇量测试建议用 testyourvocab.com