02

文本

████    重点词汇
████    难点词汇
████    生僻词
████    词组 & 惯用语

[学习本文需要基础词汇量：6,000 ]
[本次分析采用基础词汇量：6,000 ]

Hello everyone? Welcome to the second lecture for CS230.

So as I, I said earlier, uh,

you can go on menti., uh,

from your or your computers,

and enter this code, 845709.

Uh, we will use this tool for interactive questions

during the lecture and we will also use it to, to track attendance.

Uh, I'll add it at the end of the lecture,

but, uh, if you have time do it now.

[NOISE] Let's start the lecture,

while you guys are doing that.

Okay. So today's lecture is going to be about deep learning ,

and the goal is to give you a systematic way to think about projects,

everything related to deep learning.

It includes how to collect your data,

how to label your data,

how to choose an architecture,

but also how to design a proper loss function to .

So all of these decisions are decisions you're going to have to do, during your projects.

And we'll try to give you here an of,

uh, this systematic way of thinking for different projects.

It's going to be high level,

more than other lectures,

but we hope it gives you a good start for your project.

We will start with the ten minute on, uh,

what you've seen in the two first,

in the first week, uh, about .

So as you know you can think of, uh, machine learning,

deep learning in general, as

a function that takes an input that can be an image,

a speech, a natural language,

or a CSV file,

give it to a box and get an output that can be classification.

Is it a cat, zero,

is, is there a cat on this image,

output one, or is there no cat on this image, output zero?

And I think a good way to remember what is a model is to

define it as architecture plus parameters.

Architecture is, uh, the design that you choose.

So is the first one you've seen.

You will see shallow , deep ,

then you will see ,

and record .

So these are all types of

architectures and you can choose to make them deeper or shallower.

Parameters are the core parts.

They're the numbers that make your function

take this cat as inputs and convert it to an output.

So these are millions of numbers,

and the goal of machine learning deep learning,

is to find all these numbers.

So we're all, uh,

trying hard to find numbers basically,

millions of numbers in matrices.

If you give this cat and you forward it,

so we it through the model to get an output.

You will have to compare this output to the ground truth.

Uh, the function used to do so is called the loss function.

You've seen an example of a loss function this week.

That is the loss function.

Uh, we will see more or loss functions, uh, later on.

Uh, Computing the of this loss function,

is going to tell you how much should I move my parameters in order ,

uh, in, in order to make the loss go down.

So in order to make this function recognize cats better than before.

We do that many, many times,

until you find the right parameters to your architecture,

you can then give your cats and get an output.

What is very interesting and deep learning is that many things can change.

You can change the input.

We talked about natural language speech,

structured and data in general.

You can change the output, uh,

It can be a ,

it can be a multi-class .

I can ask you, give me the breed of the cats,

instead of asking you give me just the cat,

which makes the problem more complicated.

It can also be a regression problem.

I, I give you the cat and I ask you give me the age of the cat,

which is much more complicated again. ?

Okay. Another thing that can change is the architecture,

we talked about it earlier.

, the loss function.

I think the last is function is something that,

that people struggle with to understand what loss function to,

to choose, uh, for a specific project

and we're going to put a huge emphasis on that today.

Okay. And, of course,

in the architecture you can change the ,

in this loop you can choose a specific .

We're going to see in about three weeks,

all the that can be Adam,

, batch , RMSprop and momentum.

, all the parameters.

What is the learning rate of this loop?

What is the batch that I'm using for my ?

We are going to see all that together,

but there's a bunch of things that can change in this scheme.

Any questions on that, ?

So far so good.

Okay. So let's take the first architecture that we've seen together, .

As you know, an image in computer science can be represented by a 3D matrix.

Each matrix represents a certain color.

RGB, red, green, blue.

We can take all these numbers from these 3D matrix and put it in a vector.

We it in order to give it to our .

We forward it.

We multiply it by w, which is,

our parameter and b, which is our bias.

Give it to , get an output.

If the network is trained properly,

we should get a number that is more than 0.5

here to tell us that there is a cat in this image.

So this is the basic scheme.

Now, uh, my question for you is,

if I want to do the same thing but,

uh, I want to have a that can classify several animals.

So on the image there could be a ,

there could be an elephant or there could be a cat.

How would you modify this architecture?

Yes?

[NOISE] []

Yes, exactly. So that's a good point.

We could add several units.

So several , one for each animal and we will call it, multi-logistic regression.

So it could be something like that.

So we have a fully connection here,

before we were all,

all the inputs were connected to this ,

and now we added two .

And each is going to be responsible for one animal.

How do we know which is responsible for which animal?

Is the network going to figure it out on its own,

or do we have to help it?

[NOISE] []

Exactly, the label is important.

So what is going to tell your model this should focus on cat,

this should focus on elephant,

this neuron should focus on ?

Is the way you label your data.

So how should we label this data, now,

if we were to do this specific tasks.

Any ideas? Yeah.

Uh, [NOISE] One-hot vector.

One-hot vector. Okay. So one-hot vector means,

a vector with all zeros and one,

one. Any other ideas?

[NOISE] One, two, three.

[NOISE] One, two, three.

So I assume you,

you say that each would correspond to a certain animal [NOISE]?

Okay. Any other ideas?

Modifying the loss function.

Modifying the loss function.

You mean, you want to put more weight on one animal,

so you modify the loss function?

Or what exactly - [NOISE]

It was more like towards the one-hot , but []

I see, with the one-hot . So I agree with the one-hot .

I think there's a to the one-hot .

What is of the one-hot ?

[NOISE] []

Yes. So you're saying that the data without - if we have a lot of animals,

the data - the labels only contain zero and one,

one, so there's a huge there [NOISE].

I don't think that's an issue because these are

independent from each other right now.

So yeah it, it could run into an issue if you have,

uh, you have really a lot of animals, .

But there is another problem with it.

The problem is that,

do you think if you one,

if you one-hot your labels,

you would be able to detect an image with a and an elephant on the image?

You will not be able to do so.

You need a multi-hot encoding.

So in this case, if there is a cat on image I will use a one-hot.

I would say zero,

one, zero as my label.

But if I have a dog and a cat on the image I would say, one, one, zero.

Okay. The one-hot encoding works very well when you

have the constraint of having only one animal per image.

And in this case, you would not use an called ,

you would use another one, which is?

[NOISE] .

, yeah. The function,

we're going to see it together.

And for those of you who took 229,

you've probably heard of it.

Okay. So what I wanted to explain here is,

the way you choose your is very important and it's a decision you should make,

prior to start the project.

Okay. In terms of ,

uh, In, In this class we're going to use the following.

A, one will all the of the first layer.

So the would,

would the layer and the lower script will the,

their index of the neuron in the layer.

Okay? And of course you can stack these on top of each other to make the,

the network more complex,

depending on the task you're solving.

Okay. Now, the concept I wanted to introduce in this was the concept of encoding.

Uh, you probably - some of you have probably seen this image before.

If you have, uh, a network that is not too shallow,

you will notice that what the first

see are very precise representations of the data.

So there are level representations of the data.

X3i is probably, one of the three channels of the 3D matrix, just one number.

So what this neuron sees,

is going to be a level representation of the image.

Okay? What this neuron sees, the second layer,

the one in the hidden layer,

is going to see the representation by all the neurons in the first layer.

These are going to be more high level, more complex.

Because the first neurons will see ,

they are going to output a little more detailed information,

like, I found an edge here,

I found an edge there, and so on.

Give it to the second layer.

The second layer is going to see

more complex information and is going to give it to the third layer,

which is going to assemble some high level complex features that could be eyes,

nose, mouth, depending on what network you've been training.

So this is an of what's

happening in each layer when the network was trained on,

uh, face recognition. Yes.

Um, doesn't this only apply to [] [NOISE] networks,

because the combination [NOISE] [inaudible] does not necessarily, uh, [inaudible] .

Yeah, yeah, yeah. So I think if I,

here I give a you fully-connected network, but that's true.

These type of visuals,

ah, are more, ah,

observed in neural networks because these are filters,

but this happens also in this type of network,

it's just harder to .

Okay. So, this is what we call an encoding.

It means if I extract the information from this layer,

so all the numbers that are coming out of these edges,

I extract them, I will have a complex representation of my input data.

If I extract the numbers that are at the end of the first layer,

I will have a lower level representation of my data.

That might be edges, okay?

We're going to use this encoding,

ah, throughout this lecture.

Any questions on that?

Okay. So let's build on concrete applications.

We're going to start, ah, with a short warm-up with the Day'n'Night classification,

and then quickly move to Face and Face recognition.

And after that, we'll do some Art generation and finish with a Trigger-word .

If we have time, we'll, well talk about how to ship a model,

which is shipping architecture plus parameters, okay,

with an emphasis, as I said,

on the architecture, the loss,

the training strategy, to help you make decisions during your project.

[NOISE] So, let's start with the first game.

[NOISE] Ah, we're given an image and we have to

build a network that tells us if the image is taken during the day,

label zero, or was taken at night, label one.

[NOISE] So, first question is,

what do we need to collect?

Imaging captured.

Um?

Imaging that are captured during the day and during the night and it's labeled.

Okay. Labeled images captured during the day and during the night.

, though probably,

oh, yeah, let me ask the question. How many images?

[LAUGHTER] That was wrong, actually.

[LAUGHTER] How many images,

like how do you get this number?

[NOISE] Can someone give

me an estimate of how many images you need in order to solve this problem,

and explain how you get this estimate.

A number that's similar to a number of parameters.

You're saying a number similar to a number of parameters that you've in the network?

Yeah.

So I think it's better to think of it in the other way around.

The network comes after,

so you, right now, you don't know what networks you will use.

So you cannot decide the number of data points based on your parameters.

Later on, based on how your network is flexible,

you can add more data,

and a - ah, that's probably what you meant.

But, at first, you want to get,

you want to get the number. Yeah.

More o - more images than within an image?

More images than within an image.

Ah, I - I don't think that, that, that's,

that that has anything to do with the within an image.

You can have a very simple task, like,

you have only images that are red and green,

and you want to classify red and green.

[NOISE] The image can be giant,

you can have a lot of , it's not gonna change the number of data points you need.

Maybe images that have resources [inaudible]?

Okay. So, you're talking about resources,

so m - the more images we have,

probably the more resources we will need, is that what you mean?

Yeah, there is something like that.

, , ah,

you want to try to the complexity of the task.

So, , we did a problem that was cat recognition.

Detect if there is a cat on an image or not.

In this problem, we remember that with 10,000 images,

we managed to train a pretty good .

How do you compare this problem to the cat problem?

You think it's easier or harder?

I think it's easier.

Easier. Yeah, . That's probably easier.

[NOISE] So in terms of complexity,

these tasks looks less complex than the cat recognition task,

so you would probably need less data.

That's a rule of thumb. The second rule of thumb and why I get to this image is,

what do we exactly want to do?

Do we want to classify pictures that were taken outside,

which seems even easier?

Or do we want also the network to classify complicated pictures?

What, what do I mean by complicated pictures?

Inside your house. Um?

Inside your house.

So like, , on a picture you have a window on the right side.

A human would be able to say it's the day because I see the window,

but for the network, it's going to take much longer to learn that,

much longer than for pictures taken outside.

What else? What are other complicated, okay, in the back.

Uh, like dawn or or edges, um -

Dawn, , , sunset, ?

It's complicated because you have to define it and you have to teach your network what,

what does that mean, is it night or day.

Okay. So, depending on what task you want to solve,

it's going to tell you if you need more data or less data.

, for this task,

if you take outside pictures,

10,000 images is going to be enough,

but if you want the network to detect as well,

you probably need 100,000 images or something.

And this is based on comparing with projects you did in the past,

so it's gonna come with experience.

Now, as you know, when you have a ,

you need to split it between train,

, and test sets.

Some of you have heard that.

We are going to see it together even more.

You need to train your network on a specific sets and test it on another one.

How do you think you should split these 10,000 images?

Um? 50-50 between train and test?

80-20.

80-20? I think we, we,

we go more towards 80-20 because the test sets is made for analyze,

to analyze if your network is doing well on real-world data or not.

I think 2,000 images is enough to get that sense, probably,

and you want to put complicated examples in this as well,

so I would go towards 80-20.

And the bigger the ,

the more I would put in the train set,

so if I have one million images,

I would put even more like, 98 percent, maybe,

in the train set, and two percent to test my model, okay?

Now, I here.

What do I mean by bias?

You just have a correct, like, balance between classes.

Yes. You need a correct balance between classes.

You don't want to give 9,000 dark images and 1,000 day images.

You want to balance between these two to teach your networks to recognize both classes.

Okay. What should be the input of your network?

Um? The .

Yeah. So, this is an example of a .

It's the during the day.

[NOISE] Harder question.

What should be the resolution of this image, and why do we care?

The more resolution [inaudible] [NOISE]

Okay. That's great. So, you said,

let me repeat for SCPD students as well,

as low as you can,

in order to achieve good results.

Why do we want low resolution?

It's because in terms of , it's going to be better.

Remember, if I have a 32 by 32 image,

how many there are?

If it's color, I have 32 times 32 times three.

If I have 400 by 400,

I have 400 by 400 by three. It's a lot more.

So I want to minimize the resolution in order to

still be able to achieve good performance.

So what does it mean to still achieve good performance?

How do I get this number?

I'd continue with a similar resolution as opposed to the, uh, partial [inaudible].

Okay. Similar resolution as you expect in real life to work on?

Yeah. Probably, . What else?

What other rule of thumb can you use in order to choose this resolution?

Perhaps, um, we compare it to the performance of the [inaudible] we can tell if it's there [inaudible].

Yeah.

Great idea. Compare to human performance.

So what I do, so there is one way to do it,

which is the way, I would say.

We will train models on different resolutions and then compare the results,

or you can be smart and use human performance as a comparison.

So I would print this image or

several images like these in different resolutions on paper.

And I would go see humans and say classify those,

classify those, and classify those.

And I would compare human performance on all these three types of resolution,

in order to decide what's the minimum resolution that I can use,

in order to get perfect human performance.

So by doing that, I got that 64 by 64 by three was enough resolution,

for a human, to detect if an image is taken during the day or during the night.

And this is a pretty small resolution in imaging,

but it seems like a small, like an easy task.

If you have to find a d - d - a breed of a cat,

you probably need more because some cats are very,

look very alike, and you need a high resolution to distinguish them,

and maybe training for the human as well.

I know only three breeds of cats so I wouldn't be able to do it anyway.

What should be the output of the model?

Labels about the image.

Labels, so Y for day,

Y equal one for night. .

What should be the last of the network?

[NOISE] The last function?

.

. We saw that takes a number

between - and ,

puts it between zero and one so that we can interpret it as a probability.

What architecture would you use?

.

. ,

later this quarter, you will see that convolutionals perform well in imaging,

so we would directly use a ,

but I think a shallow network,

, would do the job pretty well.

You don't need a deep network because you the complexity of this task.

[NOISE] And what should be the loss function, finally?

[NOISE]

It could be, um,

maximum number of functions like, uh, log-likelihood.

Yeah. So, the log-likelihood.

So, it's also called the class,

that's the on you're talking about [NOISE].

So, the way you get this number and you'll prove it in CS 229.

We're not going to prove it here.

But basically, you interpret your data in a way and you

take the of the data which gives you this formula,

for those of you who did the math behind.

You can ask in office hours,

TA is going to help you understand it more properly.

Okay. And of course,

this means that if y ,

we want y hat the prediction to be close to zero.

If y equal one we want y hat the prediction to be close to one.

Okay. So, this was the warm up.

Now we're going to Face .

Any you question on day and night classification. Yes.

You said that you increase the data without the percentage that

changes so you have

a kind of [inaudible].

So, your - the question is about how you choose

the size of the test set versus the train set.

, you would first say how many images do I need or data

points in order to be able to understand what my model do in the real world.

This can depend on the task.

Like if I talk about - if I - if I tell you about speech recognition,

you want to figure out if your model is doing well for all accents in the world.

So, your test set might be very big and very distributed.

, you might have a few examples that are during the day,

few during the night and a few ,

on sunset, and also .

Three of those is going to give you a number.

So, there's no good number.

There is like you have to it.

Okay one more question.

How do you chose that loss function [inaudible]?

Yeah, . So, how do you choose the loss function?

We're going to see in the next, uh,

in the next slides how to choose loss functions but for this one specifically,

you choose this one because it - it - it's a, it's

a for classification problem.

It's easier to than other loss functions.

So, there is a proof but - but I will not go over it here.

If you know L1 loss,

that compares Y to Y hat this one is harder to a classification problem,

we would use it for regression problems.

Okay. [NOISE] So, our new gain is the school wants to use

face to student IDs in facilities like the gym.

So, you know, when you enter the gym,

you your ID and then, uh,

I guess the person sees your face on the screen based on

this ID and looks at your face in real and comparison let's say.

So, now we want to put a camera and have you and

the camera is going to compare this image to

the image in the database. ?

To let you in or not. So, what's -

what do we need to solve this problem? What should we collect?

Yeah. Okay. Between the ID and the image.

Yeah, so probably schools have databases because when you enter

the school you submit your image and you also are given a card, an ID.

So, you have this mapping.

Okay. What else do we need?

So, pictures of every student labeled with their names, that's what you say.

So, this is a picture of .

This is a picture when he was younger.

And that's the one he gave to the school when he arrived.

What should be the input of our model? Is it this picture?

.

.

. I'm asking just like the input of the model.

Like we probably need more photos of him as well but

what's - what's going to be the image we give to the model?

Exactly the person standing for .

Exactly, the person standing in front of the camera when entering the gym.

So, this is the entrance of the gym

and is trying to enter the gym. So, it's him.

Okay. What should be the resolution?

Those of you who have done projects in imaging,

what do you think should be the resolution?

256 by 256.

256 by 256, any other idea more precisely.

I think in general [NOISE] you will go over 400,

so 400 by 400.

What's the reason? Why do we need

64 for - for day and night and 400 for face ?

The video takes different shapes.

Yeah. There's more details to detect.

So, like distance between the eyes probably,

size of the nose, mouth,

uh, general - general features of the face.

These are harder to detect for a 64 by 64 image.

And you can test it, you can go outside and show

two pictures of people that look like each

other and ask people can you differentiate those two person or not.

And you'll see that with less than that sometimes it's - people are struggling.

?

. .

We should have talked about it in day and night actually.

. Because if you remove the color,

you basically divide by three the number of pixels, right?

So, if we could do it without color,

we would do it without color.

, color is going to be important because, uh,

probably you want your camera to work in, uh,

different settings, day and night as well.

So, the is different,

the and also we all have

different colors and we need to all be detected, compared to each other.

Yeah. I might go somewhere in an island and come back, uh,

you know, full of color but,

uh, but I still want to be able to access the gym.

Uh, output. What should be the output?

The question on the resolution, is that a minimum resolution or is that like a -

I think if you have - in ,

you would take more resolution but that's a trade-off between and resolution.

So, output is going to be one,

if it's you and zero if it's not you in which case they will not let you in.

Okay. Now, uh, the question is what architecture should we use to solve

this problem now that we collected

the data set of mapping between student IDs and images.

The question is how do you know how many images you need to train the network -

The question is - [OVERLAPPING] [inaudible].

How do you know how many,

many images you need to train the network.

You don't know, you can find an estimate.

It's going to depend on your architecture.

But in general, uh,

the more complex a task, the more data you will need.

And we will see something called error analysis in

about four weeks which is once your network works,

you're going to give it a lot of examples.

Detect which examples are by

your network and you're going to add more of these in the training set.

So, you're going to boost your .

Okay. Talking about the architecture.

If I ask you, what's the easiest way to compare two images, what would you do?

Like these two images,

the database image and the input image.

Some sort of .

Some sort of , what do you mean by that.

Taking the input run, uh,

set a specific function on it and then there.

Okay. Take an - take this,

run it into a specific function,

take this run it into a specific function and compare the two values.

That's great. That's a good idea.

And the more basic one is just the distance, uh, between the pixels.

Just the distance between the pixels and you get if it's the same person or not.

Unfortunately, it doesn't work and

a few reasons are the background lighting can be different.

And so if I do this minus this,

this pixel which is let's say dark is going to have a value of zero,

this pixel which is white is going to have a value of 255,

the distance is but it's still the same person.

Is a problem. Person can wear makeup,

can , can be younger on a picture,

the ID can be .

So, it doesn't work to just compare these two pictures together,

we need to find a function that we will apply these - these - these two images

to and will give us a more - a better representation of the image.

So, that's what we're going to do now.

What we're going to do is that will information,

use the encoding that we talked about of the picture in the vector.

So, we want a vector that would represent teachers like distance between eyes,

nose, mouth, color, all these type of stuff,

hair,, uh, in a vector.

So, this is the picture of left from the ID.

We will run it to a network and we hopefully can find a good encoding of this network.

Then we will run the picture of at

the facility run it in the deep network, get another vector.

And hopefully if we train the network properly,

these two vector should be close to each other.

Lets say we have a threshold that is 0.5,

0.4 is the distance between these two.

Is less than the threshold.

So, I would say is the right person.

Is you. Does this scheme make - make sense.

What is the 1.28 d represent?

What does the 1.28 d vector represent.

The real question is can I say that the third entry corresponds to something specific?

It's complicated to say but depending on

what network you choose and the training process you choose,

it will give you a different network, a different vector.

So, that's what we're going to talk about now.

The question is how do I know that this vector is good?

Like right now, if I take a random network,

I give my image to it,

is going to output a .

This vector is not going to contain any useful information.

I want to make sure that this information is useful and that's

how I will design my loss function.

Okay. So, just to ,

we gather all students faces encoding in

a database once we have this and given a new picture,

we the distance between - between

the new picture and all the vectors in the database if we find a match.

Oh sorry. We compare this vector of

the input image with the vector corresponding to the ID image.

If it's small, we consider that is the same person.

Okay. Now talking about the loss and the training to figure out

is this vector corresponds to something meaningful.

First, we need more data because we need our model to understand

in general the features of the face and a university that

has a 1000 students is probably not going to be enough to have

1000 image in order to push a model to understand all the features of the face.

Instead we will go online find open with millions of pictures of faces

and help the model learn from

these faces to then use it inside the facility. There was a question in the back.

Why couldn't [inaudible] work out like we did with the, like the -

[inaudible] but every student is uh, one?

That's another option. So the question is why can we tell continues the one-hot encoding.

We could build a that has n ,

n corresponding to the number of students in

the school and you take an image you

run it to the network is going to tell you which student it is.

What's the issue with that?

Every year students enter the school you will have to modify your network every

year because you have more students and you need a higher ,

a larger .

You - we don't wanna all the time our networks.

Okay, so what's - what,

what we really want, if,

if we wanna put it in words,

is that's uh, oh, there's a mistake here.

What we really want is,

if I give you two pictures of the same person,

I want to similar encoding,

I want the vector to be similar.

If I give you two pictures of different persons,

I want different ,

I want the vector to be very different and we are going to

rely on these two assumptions and these two thoughts in order to generate uh,

our loss function by giving it ,

means three pictures: one that we call anchor,

that is the person, a person,

one that we call positive,

that is the same person as the anchor but a different picture of

that person and the third one that we call negative,

that is a picture of someone else.

And now what we wanna do is to minimize the encoding distance between the anchor and the

positive and maximize the encoding distance between the anchor of, and the negative.

Does, the - these two thoughts makes sense?

So now my question for you is,

?

,

so please go on menti and enter the code and there are three options here A,

B and C,choose which of

these you think should be the right loss function to use for this problem.

Uh, you have it on your phone as well, like issue, yeah,

it's small on the screen but you can see it on, on its ?

It's better here? [NOISE]

We can't see the URL [inaudible]. It's too small.

[NOISE]

A45709, can you see it on your phone?

So by of A,

I mean the encoding vector of the anchor,

by of P, I mean the encoding vector of

the positive image after you run them through the network.

[NOISE]

Okay 30 more seconds.

[NOISE]

Okay.

I - 20 more seconds.

Okay let's see what we have.

Okay. So, two-thirds of the people think that's,

that it's the first answer A, so I,

I read it for everyone,

the loss is equal to the L2 distance between the encoding of A and the encoding of

P minus the L2 distance between the encoding of A and the encoding of N. So,

someone who has answered this,

do you wanna give uh, an explanation? Yes.

We're are trying to minimize the first difference between

N the positive and we're trying to maximize

difference between A and the negative let me

, so the [inaudible].

Yes, that's correct.

So what you said I repeat it [NOISE] for [inaudible] students.

We wanna maximize the distance between the encoding of A and the equity of the negative,

that's why we have the here,

because we want the loss to go down and to go down we put a and we

maximize this term and on the other hand

we wanna minimize the other term because it's a positive term,

okay so I agree with answer.

Okay, that was the first time you use this tool,

it's gonna be quicker next time.

Okay, so we have uh, we have uh,

figure out what's the loss function should be and now thinking about it.

Now that we designed our loss function,

we're able to use an ,

run an image in the network,

sorry run, run three images in the network, like that.

Gets three outputs encoding of A,

encoding of P, encoding of N,

the loss, take the of the loss

and update the parameters in order to minimize the loss.

Hopefully after doing that many times we would get an encoding that

represents features of the face because

the network will have to figure out who are the same people,

who are different people.

Does it make sense? This is called the loss.

And I cheated a little bit in the,

in the , I didn't write this alpha.

The true loss function contains a small alpha, you know why?

Yes?

So we don't have negative loss?

[NOISE] Yeah that - that's not exactly the role of the alpha,

in order to not have negative loss what,

what you can do is to use a maximum of the loss and zero and train on

the maximum of the loss and zero but there is another reason why we have this alpha.

Yes?

[inaudible] to have uh, difference between like false

negative and false positive like which one do you prefer?

Which one do you prefer based on false negative and false negative,

no i - it - it's not about that.

So sometimes you have an alpha in loss function to put a weight on

some classes but this is an additional alpha,

it's not a alpha.

So, it has nothing to do with that. Yeah?

To large weight.

To [NOISE] large weight,

so you're talking about [NOISE] .

If we had weights in

this formula next to the alpha like alpha times the norm of the weights,

this would be ,

but here this term doesn't weight.

[inaudible].

It's not gonna affect the ,

it's not gonna affect, it's not gonna affect the weights,

but the reason we have it here is because let's say the encoding function is uh,

let's say the encoding function is just a function zero.

What we are going to have is that we're going to have encoding of A equals

zero and here

zero and so we will have basically a perfect loss of zero uh,

and we still didn't train our network,

we just learned the function .

So this alpha is called the Margin and it pushes

your network to learn something meaningful in order to,

to stabili - stabilize itself on, on zeros. Okay?

[NOISE] [inaudible]?

Yeah, so it also

has to do with the but because we didn't talk about

yet we only saw zero ,

I think in concentration to - together.

Another way to, to,

to avoid uh, the networks to stabilize or to,

to become stable on zero is to change

the scheme and in two weeks we're

going to see difference schemes together.

[NOISE] Yeah?

[inaudible]. [NOISE]

So, the question is how do we know that

this network is going to be to rotations of the image,

or scaling of the image,

or translation of the image?

We know it's because in the ,

we are going to give let's say your picture and your picture scales,

and we're going to tell the network this is the same person.

So, the network will have to learn that the scale doesn't mean it's not the same person.

You have to learn this feature. Okay. One more question

and then we move on. In front, yes.

So why is it starting at zero a problem?

Can't we just make it negatives loss value?

Yeah. . Why is it a problem to,

to, to stay at - to stabilize at zero?

It's because its common to keep then the loss function positive,

and in the paper that you can find, this FaceNet paper,

they don't train exactly this loss,

they train the maximum of this loss and zero.

Yeah. Okay. So you train and you get the right function.

Now, let's make the problem a little more complicated.

What we did so far was face verification,

we're going to do face recognition.

What's the difference? The difference is there is no more ID.

So now you just have a camera in the facility,

you enter, the camera looks at you and find you.

How would you design this new network?

Yes, in the back.

[inaudible] you've added in an element now of recognition as well,

because now before you'd sort of stand in

front of it and it new that every picture had a face,

now it needs to detect the face.

Okay. So you're saying maybe we need to add an element to

that is a - element.

That's true in general for face recognition.

Uh, let's say you have a picture that is quite big.

You want to use the first network that identifies the face,

like finds it on the picture, detects it,

and then crop the face and give it to another network, .

That could also be used in verification as well. Yeah.

[inaudible] because they are

taking more and more time to go through all the faces in your database.

Great. So the difference maybe with what you're saying is

maybe we can use a that you've trained.

But instead of looking one-to-one comparison we look at one to N comparison.

So we have the pictures of all the students in the database.

What we can do is run all these database pictures in the model,

get a vector that represents them,

right? We get the vectors.

Now, you enter the facility,

we get your picture, we run it through the model,

we get your vector and we can compare this vector to

all the vectors in the database to identify you.

What's the complexity of this?

It's the number of students.

You have for every prediction to go over the whole database.

And a common network like model that you can use to do that is K-Nearest Neighbors.

So, of course, if you have only one picture per students,

it's not going to be very precise.

But if you collect three pictures per students and you

run a two nearest neighbors algorithm,

it will decide that if the two pictures are the same it's

likely that this person is the same as the two person on the picture.

Okay? Now, let's make it a little more complicated.

You probably saw that on your,

on your phones, uh,

sometimes you take a picture and it recognizes that it's,

uh, your grandmother or your grandfather or your mother and father.

Uh, what's happening behind is that there's some clustering happening.

It means we have a bunch of images and we wanna cluster them together.

So this is also another algorithm that you see in CS229 and CS229A,

which is K-Means algorithm.

And this is a clustering algorithm by

taking all the vectors that we have in the database.

We can find, uh - ,

sorry, you have a - you have a your phone,

you have thousands of pictures of let's say 20 different people.

What you want is to cluster all the pictures of the same person separately.

What you will do is that you will all the pictures in vectors,

and then you will run a - clustering algorithm like

K-means in order to cluster those into groups.

These are the vectors that look like each other,

these are the vectors that look like each other.

Okay? And then you can simply give

to the users with all the pictures of your mom,

all the pictures of your dad and so on.

How to, uh, define the K in this case.

Sometimes like obviously all the people [inaudible].

. How - how do you define the K?

So someone has an idea actually.

[inaudible].

Yeah. So one - one way is to, ,

to try different values,

trainer clustering algorithm and look at a certain loss you defined how small it is.

There's actually called X-means,

that is used - X-means,

you might search for that if you want - to find, uh,

to find the K. There is also a method called the Elbow Method and that you want to

search for as well to figure out the K. Okay.

And, , maybe we need just to detect

the face first and then crop and give it to .

One more question on, on face verification and connection.

So would you also use the,

like factor of [inaudible].

Sorry, can you - can you repeat louder?

Do you also need to use that vector that you trained for [inaudible]?

Do you need to use the vector that you trained for classification?

Um, sorry, I do, I do not understand.

So you mean could -

Yeah. So is the vector after you've changed the [inaudible]?

Oh, so where is the encoding coming from?

That's what you mean in, in the network?

Yeah.

Okay. .

So you have a deep network and you want to

decide where should you take the encoding from.

, the more complex the task,

the deeper you would go.

But for face verification,

what you want and you know it as a human,

you want to know features like, uh,

distance between eyes, nose and stuff,

and so you have to go deeper.

You need the first layers to figure out the edges,

gives the edges to the second layer,

the secondary to figure out the nose, the eyes,

give it to the third layer,

the third layer to figure out the distances between the eyes,

the distance in between the ears.

So you would go deeper and get the encoding

deeper because you know that you want high level features.

Okay. Art generation, given a picture and make it look beautiful.

As usual, data. What do we need?

A little complicated because we have to define what beautiful is.

[NOISE] So data some beautiful pictures?

I don't know, maybe my concept of beautiful is different than yours.

[NOISE] A certain style that we want.

Data in the certain style that we want. That's a good point.

So we might say that beautiful means paintings,

like paintings are usually beautiful.

So you want to have a, that kind of a style. Yeah, .

So let's say we have any data that we, we want.

What we're going to do and the way we define this problem

is let's take an image that we call the content image,

and here again you have the .

And let's take an image that we call the style image,

and this is a painting that we find beautiful.

What we want is to generate an image

that looks like it's the content of the content image,

but painted by the painter of the style image.

So this style image is and here we have painted by ,

even if, uh, he was dead when this was created.

So that's our goal and this is what we would call art generation.

There are other methods, but this is one.

So how do we do that?

What architectures do we need?

And please try to use what we've seen in the past two applications together.

[NOISE] What training scheme, what application,

what, what architecture

[NOISE].

No one wants to try?

Yes.

[inaudible]

Yeah.

[inaudible]

So you're saying we,

we give - we take some style images,

we give it as input to a network and the network outputs yes or no,

like one or zero?

So do we want to generate?

We wanna generate an image, yes.

Okay. So given [inaudible].

Okay. Yes, probably.

So what you're proposing is we get an image,

that is the content image,

and we have a network that is the style, style network,

which will s - style this image and

we will get the content but styled version of the content.

Right. So it will take certain features of that style and use this to change the output.

[NOISE] Yeah. So we use certain features of

the style and change this style according to what the network is.

So this is actually done.

This is one method. That's not the one we will see today.

But [NOISE], uh, the issue with this method, which is a small issue,

is that you have to train your network to learn one style.

Network learns one style, you give the content,

it gives you the constant with the specific style of the model.

What we want to do is to have no model that is restricted to a specific style.

I wanna be able to give a painting of and get this picture painted by .

So the difference here is that we're not,

we're not going to learn parameters of a network like we did

for face verification or for day and night classification.

We're going to learn an image.

So, you remember when we talked about

of the gradient to the parameters, were not going to do that.

We're going to all the way back to the image. Let's see how it works.

So, first, we have to understand what content means and what's style means.

To do that, we're going to use encoding.

We're going to, to, to,

to use the ideas that we talked about later.

Giving the content image to a network that is very good

will allow us to extract some information about the content of this image.

We specifically saw together that earlier layers will detect the edges.

The edges are usually a good representation of the content of the image.

So I might have a very good network,

give my content image,

extract the information from the first layer,

this information is going to be the content of the image.

Now, the question is how do I get the style?

I wanna give my style image and find a way to extract the style.

That's what we're going to learn later in this course,

it's a technique called .

And the important thing to remember is that,

the style is non-localized information.

If I show you, uh, the,

the pictures in the previous slide, oh sorry, here,

you see that in the generated picture,

although on the style image there was a tree on the left side,

there's no tree on the generated image.

It means when I extracted the style,

I just extracted non-localized information.

What's the technique that has used to paint?

I didn't want to extract these tree that was on the style image, don't want the content.

Okay. So we're going to take a network that understands images very well,

and they're common online.

You can find ImageNet - classification networks online,

that were trained to recognize more than thousand - thousands of objects.

This network is going to understand basically anything you give it.

If I give it the it's going to find all the edges very easily,

it's going to figure out that there is - it's during the day,

it's going to figure out their buildings on the sides and all the features of

the image because it was trained for months on thousands of classes.

Let's say we have this network,

we give our content image to it and we extract information from the first few layers.

This information we call it contents C,

content of the content image, ?

Now, I give the style image and I will use

another method that is called the to extract style S

style of the style image, okay?

And now the question is; ?

So let's go on Menti.

So same code as usual, just open it.

If you wanna repeat - you can repeat the code if you want,

845709, and these are the three proposals for the loss function.

So reminder, content C means content of the content image,

style S means style of the style image,

style G means style of the generated image,

content G means content of the generated image.

Take like a minute.

It's too small?

Oh, the code, up, 845709.

So why do need to

have an image

in that class?

You don't actually need to

classify an image on [inaudible] So why do you need to use ImageNet, [inaudible] ?

Why - so just repeating the question,

why do we need to use ImageNet?

Because we, we don't really need to classify an image and it's going to waste time.

Uh, the reason we use ImageNet is because ImageNet understands our pictures.

So if, if you give the content image

to a network that doesn't understand pictures very well,

you're not going to get their edges very well. So you want a network -.

I don't care about the classification of the pictures.

You don't care about the classification output,

you just cut the network in the middle,

extract the layers in the middle.

Okay. Let's see what the answers are according to you guys.

So if we are getting style - style of it, you are not training anything, right?

, I repeat, we're not training anything here.

We're getting a model that exists and we use this model.

But we are going to talk about the training after.

Okay. Someone who has answered the second,

uh, question and I, I will read it out loud,

the loss is the L2 difference between the style of

the style image and the generated style, plus,

the L2 distance between the - the generators content and the contents content. Yeah.

We want to maximize both the [inaudible]. [NOISE]

, we wanna minimize both terms here.

So we want the content of the content image to

look like the content of the generated image,

so we wanna minimize the L2 distance of these two.

And the reason we use a plus is because we also wanna

minimize the difference of styles between the generated and the style image.

So you see we don't have any terms that says style of the content image

minus style of the generated image is minimized. This is the loss we want.

Okay, up now.

Okay. So just going over the architecture again.

So the loss function we're going to use will be the one we saw.

And so one thing that I want to emphasize here is we're not training the network,

there is no parameter that we train.

The parameters are in the ImageNet classification network,

we use them we don't train them.

What we will train is the image.

So you get an image and you start with white noise,

you run this image through

the classification network but you don't care about the classification of this image.

ImageNet is going to give a random class to this image, totally random.

Um, instead, you will G and style G, okay?

So from this image,

you run it and you extract information

from this network using the same techniques that you've

used to C and style S. So content C and style S you have it, you have it.

You're able the loss function because now

you have the four terms of the loss function.

You compute the ,

instead of stopping in the network,

you go all the way back to the pixels of the image and you

decide how much should I move the pixels in order to make this loss go down,

and you do that many times.

You do that many times. And the more you do that,

the more this is going to look like the content of the content image,

and the style of the style image. Yeah, one question.

So for each new example of content and style images

you need to do a new training like this?

Yeah. So of this network,

is although it has the flexibility to work with any style,

any content, every time you wanna generate an image you have to do this .

While the other network that you talked about

doesn't need that because the model is trained to,

to convert the content to a style,

you just give it and cool.

Do you have to train the network on many,

kind of like images or you only need to do those kind of like ?

Which network you talk about, this network?

Yes.

Yeah. So do we need to train this network on Monet images? Usually not.

This network is trained on millions of images.

It's basically seen everything you can imagine. Yeah.

So you only need to give one art piece to it and then it will

be able to back-propagate properly into any [inaudible].

Uh, what you mean back-propagate properly.

Here you're not training the network.

You are getting this image.

Computing the and going back to the image,

only updating the image you don't update the network.

Where does the -?

It comes from Content C and style S,

it comes from the S. So,

the loss function you bake - the baseline is you have Content C and

style S because you've chosen a content picture and a style picture and now every,

at every step you will find the new Content G and Style G. updates,

[NOISE] give it again get the new Content G and Style G,

update again and so on.

[NOISE] No, the, the art never touches with - just one time.

The art image just touches the neural network you can,

you extract style S and then that's all, you don't use it again.

Okay let's do one more question here.

Why do you start white noise instead of the content or the style?

. Why do you start with white noise instead of the content or the style?

Actually do you think it's better to start with the content or the style?

Probably the style.

Probably the style? I think probably the content because uh, the,

the edges at least look like the content is going to to help

you - your network quicker.

Yeah, that's true you don't have to start with white noise,

in generally the baseline is start with white noise so that anything can happen,

if you give it the content to start with is going to have a bias

towards the content but if you train longer.

Okay one more question and then we can move on.

So this style and content [inaudible].

ImageNet doesn't understand what's content and style but

ImageNet finds the edges on the image and so you can

give the content image and extract the few first layers to get

information about them because when it was trained on classification,

it needed to find the edges.

To find that a dog is a dog,

you first need to find the edges of the dog so it's,

it's trained to do so and for the style,

it's complicated to understand the style but the network finds all the features on

the image and then we use of post-processing technique that is

called the in order to extract what we call style.

It's basically ah, a cross-correlation of all the features of the network.

We will learn it together later on.

Okay, let's move on to the next application because we don't have too much time.

So this is the one I prefer, ah,

given a 10 second audio speech detect the word activate,

so you know we talked about trigger ,

there are many companies that have this wake word thing where you have

a device at home and when you say you're certain word it activates itself.

So here is the same thing for the word activate.

What data do we need? Do we need a lot or not?

Probably a lot because there are many accents and

one thing that is counter-intuitive is that,

if two humans, like let's say,

let's say two - two women speak as a human you would say these voices are,

are pretty similar, right?

You can detect the word.

What the network sees is a list of numbers that are totally different from

one person to another because

the frequencies we use in our voices are totally different from each other.

So the numbers are very different although as a human we feel that it's very similar.

So we need a lot of 10 seconds , that's it.

What should be the distribution?

It should contain as many accents as you can, as many, uh,

female-male voices, uh, kid-adults uh, and so on.

What should be the input of the network?

It should be a 10 second that we can represent like that.

The 10 second is going to contain some positive words, in green.

Positive word is activate and it's also going to

contain negative words in pink like kitchen,

lion, whatever, words that are not activate and we want only to detect the positive word.

What should be the sample rate?

Again same question you would test on humans,

ah, you would, you would,

you would also talk to an expert in speech recognition to

know what's the best sample rate to use for speech processing,

what should be the output?

Any the ideas?

[NOISE]

Okay, yeah any other?

Classification, yes no.

Classification, yes no. So zero or one.

Actually let's make a test, let - let's do a test.

So we have three audio speech here,

speech one, speech two, speech three.

I don't know if we have the sound here. Do we have the sound?

[NOISE] Maybe we have it now, okay let's try.

[FOREIGN]

So this is

labeled one [LAUGHTER] Nobody speaks Italian in the,

in the, in the room, second-one.

[FOREIGN]

Okay what's the wake word?

Has anybody found what was the, the trigger word?

We need more data.

We need more.

[LAUGHTER]

So you know what's funny is,

to be this is the right scheme to label,

like it's definitely possible but it seems that

even for humans this is super hard.

We're not able to, to find what's, what's happening,

like I don't know,

even if I did this slide I don't even remember. No kidding.

Now let's try something else, Okay?

So now we have a different that

tells us also where the wake word is happening.

Let's hear it again.

[FOREIGN]

Okay, what's the trigger word?

Pomeriggio.

Pomeriggio means uh, afternoon in Italian.

Okay. So you see - what I,

I am trying to illustrate is uh,

compare the human to the computer and you

will get what's the right to use and of

course the here is going to be better

for the model rather than the first one and we just proved it.

Uh, the, the important thing is to know that the first one would also work,

we just need data.

We need a lot more data to make the first work

than we need for the second one, ?

, we will use something like that.

[inaudible] . [NOISE]

, actually this is not the best labeling scheme.

, should the one come before or after the word was said?

What do you guys think? Before?

After.

After, yeah. You will see that uh,

are going basically to look at uh,

the data just as human do,

like temporarily from the beginning to the end.

In this case you need to hear the word in order to detect it,

so you're going to put the one right after the word was said.

Another issue that we have with this is that there are too many zeros,

it's highly so the network is pushed to always predict zeros.

So what we do as ,

and there's a lot of like that happening in papers if you read them.

We're going to add several ones after the word was say,

I would add 20 ones, basically, okay?

So this is our labeling scheme now.

What should be the last of our network?

[NOISE] ,

yeah, sigmoid but .

For every time step you would use a sigmoid to output zero or one, basically.

Don't worry if you don't understand - specifically what networks were using,

you're going to learn it in a few weeks.

So the architecture should,

should be like a , probably.

Uh, convolutional neural networks might work as well,

we'll see it later on in the course and

the loss function should be the same as before but we should make a .

For every time step we should use a loss function like that and we should sum them

over all the . Sounds good?

So, another insights on these projects - I'll take it

after - is what was critical to the success of this project.

I think there are two things that are really

critical when you when you build such a project.

The first one is, to have a straight strategic data acquisition pipeline,

so let's talk more about that.

We said that our data should be 10 second

that contain positive and negative words from many different access.

How would you collect this data?

[NOISE]

That's right. [NOISE]

You say you pay people to give you 10 seconds of their voice?

[LAUGHTER]

[inaudible] I think you,

you can take your phone,

go around campus and that's actually how we did it,

we took our phones, we went around campus and we got some .

So one way to do it is that,

to go and get 10 seconds from different

before with a large distribution of access and then what do you do?

You label? You label by hands?

That's one method, is it long or short?

Is it quick or not? It's super slow, yeah.

[inaudible]

Oh, in movies.

Uh, that's a good idea actually.

You could like based on the licensing of the movie.

[LAUGHTER] You could like, ah,

take an audio from a movie and you get the and you are looking for activate.

And every time the say,

"Activate", you could label your data.

That's super fun. That's super good actually.

You could label automatically using that.

Yet. So, that's a good idea.

I think there's another way to do it that is closer to

that which is we're going to collect three databases.

The first one is going to be the positive word database,

the second one is going to be the negative word database,

the third one is going to be the background noise database.

So, I take the background, 10 seconds.

I insert from one to three negative words and I insert

from one to three positive words

making sure it doesn't a negative word.

Okay? What's the main advantage of this method?

generation of samples.

Yeah, generation of samples and labeling.

I can label. I know where I inserted my positive words.

[NOISE] So, I just add ones where I inserted it.

I can generate millions of data examples like that just

because I found the right strategy to, to create data.

You see the difference between the two methods.

The one where you have to go out and collect data and the one where you just go out,

collect positive words, negative words,

and then find background noise on or wherever you have the right license to use.

It's, it's a big difference and this can make,

[NOISE] can make your company succeed compared to another company.

It's very common. All right.

So, I would go on campus,

take one second of positive words,

put it in the database in green.

Take one second audio clips of negative words of the same people as well,

put it in the pink database and get background noise from anywhere I

can find it's very cheap and then create the data, label it automatically.

And you know, with like five positive words, five negative words,

five backgrounds, you can create a lot of data points.

Okay. So, this is

an important technique that you might want to think about in your projects.

The second thing that is important for the success of

such a project is the architecture search and tuning.

So, all of you, you will have complicated projects where you

would be lost regarding the - architecture to use at first.

It's a complicated process to find the architecture but you, you should not give up.

And the first thing I would say is talk to the experts.

So, let me tell you the story of this project.

Um, first I, I started

like looking at the literature

and figuring out what network I could use for this project.

And I ended up using that for, for the beginning part.

I use a to extract features from the speech.

Who's familiar with or ?

So, for the others, think about audio speech as a 1D signal.

But every 1D signal can be a sum of and

with a specific frequency and for each of these.

And so, I can convert a 1D signal into a matrix for - with, with,

with basically [NOISE] with basically one axis that is the frequency,

one axis that is the time,

going from, going from 0 to 10 seconds.

And I will get the value of all the,

the of this frequency.

So, maybe this one is a strong frequency,

this one is a strong frequency,

this one is a low one and so on.

For every time step. This is the of an audio speech.

You're going to learn a little bit more about that.

So, after I got the which is better than the 1D signal for the network,

I would use an LSTM which is a and add

a sigmoid layer after it to get probabilities between zero and one.

I will threshold them, everything more than

0.5 I will consider that it's a one everything less it's a zero.

I tried for a long time fitting this network on the data, it didn't work.

But one day I was working on campus and I,

I, I, I found a friend that was an expert in speech recognition.

He's worked a lot on all these problems

and he exactly knew that this was not going to work.

He could told me - he could have told me.

So, he told me, "There's several issues with this network.

The first one is your in the , they're wrong.

Go on my , you will find what I used for this .

You will find specifically what sample rate,

what window size, what frequencies I used."

So, that was better. Then he said,

"One issue is that your is too big.

It's super hard to train. Instead, you should reduce it."

So, I've used - so,

he told me to use a to reduce the number of time steps of my audio clip.

You will learn about all these layers later.

Ah, and also use batch Nor which is a specific type of layer that,

that makes the training easier.

, you get your sigmoid layer and you output zeros and ones.

But because the outputs time-steps is smaller than the input, you have to expand it.

So, you need an expansion algorithm,

just a script that expands every two zeros.

Let's say every one in two ones and so on.

And now I get another architecture that I managed to train within a day.

And this was all because I was lucky enough to find

the experts and get advice from this person.

So, I think you will run into the same problems as I run into during your projects.

The important thing is spend more time figuring out who's the expert

and who can tell you the answer rather than trying out random things.

I think this is a - an important thing to think about.

Okay. So, don't give up and also use our analysis which we are going to see later.

Ah, we have two more minutes.

So, I'm not gonna go over this one.

I'm just going to talk about it quickly.

There is another way to solve way chord .

And the other way is to use the loss algorithm.

Instead of using anchor positive and negative faces,

you can use audio speech of one second.

Anchor is the word activate.

Positive is other word activate said differently and negative is another word.

You will train your network activates in

a certain vector and then compare

the distance between vectors to figure out if activate is present or not.

Okay. We have about two more minutes.

So, I'm going to [NOISE] Oh, sorry.

My bad [LAUGHTER] just on me [LAUGHTER].

Ah, just to finish,

ah, with two more slides.

Ah, now that you've seen some loss function,

I want to show you another one and I want you to tell me

what application does these beautiful loss correspond to.

This one of the most beautiful loss I - I've seen in my life.

[LAUGHTER] So, someone can tell me what's the application,

what problem are we trying to solve if we use this loss function?

Speech recognition.

Speech recognition, no.

It's not the case. Good trial. Yes.

Regression.

Regression. .

It's a regression problem but it's a specific regression problem.

Bounding box.

Good. Bounding box the .

This is .

So, I, I put the paper here you can check it

out but how do you know that it's ?

I've done it before.

Oh, you've done it before.

[LAUGHTER] Okay.

So, this is the loss function of a network called YOLO.

And the reason you can find out

its bounding boxes is because if you look at the first term,

you would see that it's comparing x to true x predicted x to - to true x,

predicted y to true y.

This is the center of a bounding box, .

Second term is W and H. W and H stands for width and height of a bounding box.

And it's trying to minimize the distance between

the true bounding box and the predicted bounding box basically.

The third term has an - indicator function with objects.

It's saying, "If there is an object,

you should have a high probability of ."

The fourth term is saying that if there is no object,

you should have a lower probability of .

And finally the final term is telling you you have to find the class that is in this box.

Is it a cat? Is the dog? Is it an elephant?

Is whatever. So, this is an loss function.

Actually do you know why, why you will have a square root here?

[NOISE] [inaudible] that.

The reason we have the square root is

because you want to

more errors on small bounding boxes rather than big bounding boxes.

So, if I give you an image of a human like that and a cat like this,

you can have - So,

this box the one inside is the ground truth,

is very tight box.

This one same and the box that are predicted are the predictions.

So, these are the predictions and the other ones are the ground truth.

What is interesting is that a two pixel error on

this cats is much more

important than the two pixel error on this human because the box is smaller.

So, that's why you use a square root to

more the errors on small boxes than on big boxes.

Okay. And finally the final slide, okay.

Let's go over that. So, just recalling what we have for next week.

Ah, you have two modules to complete for next Wednesday, ah,

which are C1M3 with the following and the following programming assignments,

C1M4 with one and two programming assignments.

You're going to build your first deep neural network.

This is all going to be on the web - it's

already and we'll publish the slides now.

Ah, you have TA project that is mandatory this week.

So, TA project are mandatory this week

to start the week before the project proposal,

the week before the project - no after the project proposal,

after the and before the final project submission.

Okay. And Friday TA sections,

you're going to do some neural style transfer and R generation,

ah, filling the AWS form.

I don't know if it's been done yet.

We're, we're going to try to give you some credits,

ah, for your projects with GPUs.

[NOISE] Okay. Thanks guys.

知识点

重点词汇
gauge [ɡeɪdʒ] n. 计量器；标准尺寸；容量规格 vt. 测量；估计；给…定规格 {cet4 cet6 ky toefl ielts gre :6046}

indoor [ˈɪndɔ:(r)] adj. 室内的，户内的 {cet4 cet6 ky :6107}

pyramid [ˈpɪrəmɪd] n. 金字塔；角锥体 vt. 使…渐增；使…上涨；使…成金字塔状 vi. 渐增；上涨；成金字塔状 {gk ky toefl :6129}

detection [dɪˈtekʃn] n. 侦查，探测；发觉，发现；察觉 {cet4 cet6 gre :6133}

denote [dɪˈnəʊt] vt. 表示，指示 {cet6 ky toefl ielts gre :6148}

overview [ˈəʊvəvju:] n. [图情] 综述；概观 { :6253}

robust [rəʊˈbʌst] adj. 强健的；健康的；粗野的；粗鲁的 {cet6 ky toefl ielts gre :6419}

randomly ['rændəmlɪ] adv. 随便地，任意地；无目的地，胡乱地；未加计划地 { :6507}

synthetic [sɪnˈθetɪk] n. 合成物 adj. 综合的；合成的，人造的 {cet4 cet6 ky toefl ielts :6608}

modeling ['mɒdlɪŋ] n. [自] 建模，造型；立体感 adj. 制造模型的 { :6704}

overlap [ˌəʊvəˈlæp] n. 重叠；重复 vi. 部分重叠；部分的同时发生 vt. 与…重叠；与…同时发生 {cet6 ky toefl ielts gre :6707}

unlimited [ʌnˈlɪmɪtɪd] adj. 无限制的；无限量的；无条件的 {cet6 :6742}

Picasso [pi:'kɑ:ssɒ:] n. 毕加索(法国画家) { :6799}

inaudible [ɪnˈɔ:dəbl] adj. 听不见的；不可闻的 { :6808}

algorithm [ˈælgərɪðəm] n. [计][数] 算法，运算法则 { :6819}

twilights [ ] (twilight 的复数) n. 暮光, 曙光, 黎明, 黄昏, 微光, 朦胧状态 a. 微明的 { :6896}

twilight [ˈtwaɪlaɪt] adj. 黎明，黄昏；薄暮；衰退期；朦胧状态 n. 黄昏；薄暮；衰退期；朦胧状态 {cet6 ielts :6896}

folders [ˈfəʊldəz] n. [轻] 文件夹，方法；折页，档案夹（folder复数形式） { :7114}

recurrent [rɪˈkʌrənt] adj. 复发的；周期性的，经常发生的 {ky :7228}

sunrise [ˈsʌnraɪz] n. 日出；黎明 n. (Sunrise)人名；(德)松里泽 {gk cet4 cet6 ky toefl :7313}

gradient [ˈgreɪdiənt] n. [数][物] 梯度；坡度；倾斜度 adj. 倾斜的；步行的 {cet6 toefl :7370}

flatten [ˈflætn] vt. 击败，摧毁；使……平坦 vi. 变平；变单调 n. (Flatten)人名；(德)弗拉滕 {cet6 gre :7436}

validate [ˈvælɪdeɪt] vt. 证实，验证；确认；使生效 {toefl gre :7516}

gigantic [dʒaɪˈgæntɪk] adj. 巨大的，庞大的 {cet6 ky toefl ielts :7783}

imbalance [ɪmˈbæləns] n. 不平衡；不安定 {toefl :7792}

compute [kəmˈpju:t] n. 计算；估计；推断 vt. 计算；估算；用计算机计算 vi. 计算；估算；推断 {cet4 cet6 ky toefl ielts :7824}

extraction [ɪkˈstrækʃn] n. 取出；抽出；拔出；抽出物；出身 {cet6 :7879}

intuition [ˌɪntjuˈɪʃn] n. 直觉；直觉力；直觉的知识 {cet6 ky toefl ielts gre :7905}

brightness ['braɪtnəs] n. [光][天] 亮度；聪明，活泼；鲜艳；愉快 {cet6 :7963}

automated ['ɔ:təʊmeɪtɪd] adj. 自动化的；机械化的 v. 自动化（automate的过去分词）；自动操作 {toefl :8095}

converge [kənˈvɜ:dʒ] vt. 使汇聚 vi. 聚集；靠拢；收敛 {cet6 toefl ielts gre :8179}

hack [hæk] n. 砍，劈；出租马车 vt. 砍；出租 vi. 砍 n. (Hack)人名；(英、西、芬、阿拉伯、毛里求)哈克；(法)阿克 {gre :8227}

hacks [hæks] n. (Hacks)人名；(德)哈克斯老马（hack的复数）出租汽车 { :8227}

encode [ɪnˈkəʊd] vt. （将文字材料）译成密码；编码，编制成计算机语言 { :8299}

encoding [ɪn'kəʊdɪŋ] n. [计] 编码 v. [计] 编码（encode的ing形式） { :8299}

notation [nəʊˈteɪʃn] n. 符号；乐谱；注释；记号法 {cet6 toefl ielts :8312}

validation [ˌvælɪ'deɪʃn] n. 确认；批准；生效 { :8314}

visualize [ˈvɪʒuəlaɪz] vt. 形象，形象化；想像，设想 vi. 显现 {cet6 ielts :8673}

downside [ˈdaʊnsaɪd] n. 负面，缺点；下降趋势；底侧 adj. 底侧的 { :8709}

quiz [kwɪz] n. 考查；恶作剧；课堂测验 vt. 挖苦；张望；对…进行测验 {gk cet4 cet6 ky :8784}

milestone [ˈmaɪlstəʊn] n. 里程碑，划时代的事件 n. (Milestone)人名；(英)迈尔斯通 {toefl gre :9040}

gram [græm] n. 克；鹰嘴豆（用作饲料） n. (Gram)人名；(英、法、德、丹、挪、瑞典)格拉姆 {gk :9092}

derivatives [dɪ'rɪvətɪvz] 派生物 { :9140}

neural [ˈnjʊərəl] adj. 神经的；神经系统的；背的；神经中枢的 n. (Neural)人名；(捷)诺伊拉尔 { :9310}

activations [,æktɪ'veɪʃən] n. [电子][物] 激活；活化作用 { :9314}

activation [ˌæktɪ'veɪʃn] n. [电子][物] 激活；活化作用 { :9314}

neuron [ˈnjʊərɒn] n. [解剖] 神经元，神经单位 {cet6 toefl :9397}

neurons [ ] n. 神经元，神经细胞（neuron的复数形式） { :9397}

estimation [ˌestɪˈmeɪʃn] n. 估计；尊重 { :10164}

propagate [ˈprɒpəgeɪt] vt. 传播；传送；繁殖；宣传 vi. 繁殖；增殖 {cet6 toefl ielts gre :10193}

outdated [ˌaʊtˈdeɪtɪd] adj. 过时的；旧式的 v. 使过时（outdate的过去式和过去分词） {toefl ielts :10226}

subtract [səbˈtrækt] vt. 减去；扣掉 {cet4 cet6 ky toefl ielts gre :10238}

pixels ['pɪksəl] n. [电子] 像素；像素点（pixel的复数） { :10356}

pixel [ˈpɪksl] n. （显示器或电视机图象的）像素（等于picture element） { :10356}

verification [ˌverɪfɪ'keɪʃn] n. 确认，查证；核实 { :10537}

sequential [sɪˈkwenʃl] adj. 连续的；相继的；有顺序的 {gre :10797}

amplitude [ˈæmplɪtju:d] n. 振幅；丰富，充足；广阔 {cet6 gre :10877}

integer [ˈɪntɪdʒə(r)] n. [数] 整数；整体；完整的事物 { :10941}

wha [ ] [医][=warmed,humidified air]温暖、潮湿的空气 { :11046}

infinity [ɪnˈfɪnəti] n. 无穷；无限大；无限距 {cet6 gre :11224}

delve [delv] n. 穴；洞 vi. 钻研；探究；挖 vt. 钻研；探究；挖 n. (Delve)人名；(英)德尔夫 {gre :11237}

penalize [ˈpi:nəlaɪz] vt. 处罚；处刑；使不利 {gre :11309}

brute [bru:t] n. 畜生；残暴的人 adj. 残忍的；无理性的 {cet4 cet6 gre :11576}

subtitle [ˈsʌbtaɪtl] n. 副标题；说明或对白的字幕 vt. 在…上印字幕；给…加副标题 {ielts :11609}

subtitles [ˈsʌbˌtaɪtlz] n. [图情] 副标题（subtitle的复数）；说明或对白的字幕 v. 给…加副标题；给电影等加字幕（subtitle的三单形式） {ielts :11609}

optimize [ˈɒptɪmaɪz] vt. 使最优化，使完善 vi. 优化；持乐观态度 {ky :11612}

mo [məʊ] abbr. 卫生干事，卫生管员（Medical Officer）；邮购（Mail Order）；方式（Modus Operandi）；邮政汇票（Money Order） { :12537}

null [nʌl] adj. 无效的，无价值的；等于零的 n. 零，[数] 空 {toefl ielts :12712}

computation [ˌkɒmpjuˈteɪʃn] n. 估计，计算 {toefl :12745}

computational [ˌkɒmpjuˈteɪʃənl] adj. 计算的 { :13207}

swipe [swaɪp] n. 猛击；尖刻的话 vt. 猛击；偷窃；刷…卡 vi. 猛打；大口喝酒 {gre :13529}

claude [klɔ:d] n. 克劳德（男子名） { :14262}

unbalanced [ˌʌnˈbælənst] adj. 不平衡的；错乱的；不稳定的；收支不平衡的，未决算的 v. 使失去平衡；使（精神）错乱（unbalance的过去分词） { :14269}

logistic [lə'dʒɪstɪkl] adj. 后勤学的；[数] 符号逻辑的 { :14538}

retrain [ˌri:ˈtreɪn] vt. 重新教育；再教育 vi. 再训练；再教育 n. (Retrain)人名；(法)雷特兰 { :15253}

giraffe [dʒəˈrɑ:f] n. 长颈鹿 {zk gk :15421}

pre [ ] abbr. 炼油工程师（Petroleum Refining Engineer） { :15593}

Bertrand [ˈbə:trənd] n. 伯特兰（男子名） { :15657}

decomposed [ˌdi:kəm'pəʊzd] adj. 已腐烂的，已分解的 { :15704}

triplets ['trɪpləts] n. [妇产] 三胞胎；一窝三子 { :15816}

triplet [ˈtrɪplət] n. 三个一组；三连音符；三元组中的一个；三胞胎之一 n. (Triplet)人名；(法)特里普莱 { :15816}

hash [hæʃ] n. 剁碎的食物；混杂，拼凑；重新表述 vt. 搞糟，把…弄乱；切细；推敲 n. (Hash)人名；(阿拉伯、保、英)哈什；(西)阿什 { :16754}

convex [ˈkɒnveks] n. 凸面体；凸状 adj. 凸面的；凸圆的 {toefl gre :16763}

optimization [ˌɒptɪmaɪ'zeɪʃən] n. 最佳化，最优化 {gre :16923}

monet [ ] n. 莫奈（人名，法国画家） { :16971}

programmatic [ˌprəʊgrəˈmætɪk] adj. 节目的；标题音乐的 { :16989}

louvre [ˈlu:və(r)] n. （法）罗浮宫（等于louver） { :17338}

luminosity [ˌlu:mɪ'nɒsətɪ] n. [光][天] 光度；光明；光辉 {toefl :17466}

labeling ['leɪblɪŋ] n. 标签；标记；[计] 标号 v. 贴标签；分类（label的现在分词） { :17997}

datasets [ ] (dataset 的复数) [电] 资料组 { :18096}

dataset ['deɪtəset] n. 资料组 { :18096}

难点词汇
unstructured [ʌnˈstrʌktʃəd] adj. 无社会组织的；松散的；非正式组成的 { :19413}

diction [ˈdɪkʃn] n. 用语；措词 {toefl :19662}

sines [ ] [人名] 赛恩斯; [地名] [葡萄牙] 锡尼什 { :20093}

recap [ˈri:kæp] n. 翻新的轮胎 vt. 翻新胎面；扼要重述 { :23344}

hyper [ˈhaɪpə(r)] n. 宣传人员 adj. 亢奋的；高度紧张的 { :23957}

COM [ ] abbr. 组件对象模型（Component Object Model）；计算机输出缩微胶片（Computer-Output Microfilm） { :26388}

probabilistic [ˌprɒbəbɪˈlɪstɪk] adj. 概率性的；或然说的，盖然论的 { :27390}

mentorships [ ] [网络] 师徒制；导师制 { :27920}

mentorship ['mentɔːʃɪp] n. 导师制，辅导教师；师徒制 { :27920}

stochastic [stə'kæstɪk] adj. [数] 随机的；猜测的 { :28398}

encodings [ ] 编码 { :28602}

cur [kɜ:(r)] n. 杂狗；坏蛋 { :28712}

fourier ['furiei] n. 傅里叶（法国空想社会主义者，社会改革家）；傅立叶（姓氏） { :29684}

cosines ['kəʊsaɪn] n. [数] 余弦 { :29985}

cha [tʃɑ:] n. （英俚）茶 n. (Cha)人名；(柬)乍；(中)查(普通话·威妥玛) { :30647}

convolution [ˌkɒnvəˈlu:ʃn] n. [数] 卷积；回旋；盘旋；卷绕 { :30767}

sigmoid ['sɪgmɔɪd] n. 乙状结肠（等于sigmoidal）；S状弯曲 adj. 乙状结肠的；C形的；S形的 { :31478}

multiplicative ['mʌltɪplɪkeɪtɪv] adj. [数] 乘法的；增加的 n. 乘法；倍数词 { :36971}

regularization [ˌregjʊlərɪ'zeɪʃən] n. 规则化；调整；合法化 { :37553}

classifier [ˈklæsɪfaɪə(r)] n. [测][遥感] 分类器； { :37807}

cutoff ['kʌtɔ:f] n. 切掉；中断；捷径 adj. 截止的；中断的 { :38847}

initializations [ ] n. 〔计〕初始化；预置初始状态；最初设定 { :40016}

initialization [ɪˌnɪʃəlaɪ'zeɪʃn] n. [计] 初始化；赋初值 { :40016}

RPS [ ] [医] 肾加压物质 { :49725}

生僻词
backpropagate [ ] [网络] 反向传播

backpropagates [ ] [网络] 反向传播

backpropagation [ ] n. 反向传播算法 [网络] 反向传播了；反向传播法；传播网络

classific [klə'sɪfɪk] adj. 识别类目的，表现出类之特色的

clus [ ] 牙合的

convolutional [kɒnvə'lu:ʃənəl] adj. 卷积的；回旋的；脑回的

enc [ ] abbr. 工程指挥，技术控制（Engineering Command）

gener [ ] [网络] 产生；制造；出生

GitHub [ ] [网络] 源码托管；开源项目；控制工具

hyperparameter [ ] [网络] 超参数；分别有一个带有超参数

hyperparameters [ ] [网络] 超参数；超參數

iden [ ] abbr. identification 辨认; 证明; identify 识别; identity 一致

misclassified [ ] (misclassify 的过去分词) vt.对…进行错误的分类

objectness [ ] 反对（object的变体）

onetime ['wʌntaɪm] adj. 从前的；一度的 adv. 一度；从前

optimizers [ ] (optimizer 的复数) [计] 优化程序 [化] 最优控制; 最优控制器

outputted ['aʊt.pʊt] n. 产量；产品；【电】发电力；供给量 [网络] 输出；产出；输出量

penalization ['penəlaɪˌzeɪʃn] 压抑疗法

smartphones [ ] 智能手机（smartphone的复数）

softmax [ ] [网络] 柔性最大传递函数；前回收的日志文件的百分比；西风狂诗曲系列篇章

spe [ ] abbr. 美国石油工程师协会（Society of Petroleum Engineers）；专用设备（Special-purpose Equipment）

spectrogram ['spektrəʊgræm] n. 光谱图；声谱图；光谱照片

spectrograms [ ] (spectrogram 的复数) n. 光谱图, 声谱图 [化] 光谱图

stye [staɪ] n. 麦粒肿；睑腺炎

timestamp ['taɪmstæmp] n. 时间戳；时间邮票

xy [ ] n. 正常男性染色体组型

youtube ['ju:tju:b] n. 视频网站（可以让用户免费上传、观赏、分享视频短片的热门视频共享网站）

词组
a hack [ ] [网络] 网络攻击

A minus [ ] [网络] A减

a ton [ ] [网络] 一吨

a ton of [ ] [网络] 许多；大量的

activation function [ ] 激活函数

an algorithm [ ] [网络] 规则系统；运算程式

at dawn [ ] na. 天一亮 [网络] 在黎明；破晓时分；拂晓时

audio clip [ ] [网络] 音频剪辑；音频复制文件；音频剪切

audio recording [ˈɔ:djəu rɪˈkɔ:dɪŋ] un. 录音 [网络] 录音功能；数位录音；声音记录

audio recordings [ ] 录音

brute force [ ] na. 暴力；“brute force/strength”的变体 [网络] 蛮力；血溅虎头门；强力

classification algorithm [ ] [网络] 分类算法；分类演算法

Claude Monet [ ] [网络] 莫内；莫奈；克劳德·莫奈

clustering algorithm [ ] un. 聚类算法 [网络] 丛集演算法

computational power [ˌkɔmpju(:)ˈteiʃ(ə)n(ə)l ˈpauə] [网络] 计算能力；计算能量；计算力

convex function [ ] [网络] 凸函数；上凸函数；凸函數

convex functions [ ] [网络] 凸象函数；凸像函数

decompose into [ ] [网络] 分解为

delve into [ ] [网络] 钻研；深入研究；探究

encoding information [ ] 编码信息

equal zero [ ] 等于零[E/Z]

extract content [ ] 浸出物量,提出物的含量,抽提率

Fourier transform [ ] un. 〔数〕傅里叶变换 [网络] 傅立叶转换；傅立叶变换；傅利叶转换

Fourier transforms [ ] [网络] 傅里叶变换；傅立叶系列；富氏变换

gradient descent [ ] n. 梯度下降法 [网络] 梯度递减；梯度下降算法；梯度递减的学习法

Gram matrix [ ] [网络] 格拉姆矩阵；格兰姆矩阵

grow a beard [ ] na. 生胡子；留胡子 [网络] 蓄须；养胡子；蓄胡子

labeling scheme [ ] 代码电路;标号方案

logistic regression [loˈdʒɪstɪk rɪˈɡrɛʃən] n. 逻辑回归 [网络] 吉斯回归；逻辑斯回归；罗吉斯回归

Louvre Museum [ ] [网络] 卢浮宫；罗浮宫；罗浮宫博物馆

maximum likelihood [ˈmæksiməm ˈlaiklihud] 最大概似法

maximum likelihood estimation [ ] 最大似然估计

minus infinity [ ] [网络] 负无穷大；负无限大

minus sign [ˈmainəs sain] n. 负号 [网络] 减号；减号的故事；负符号

minus zero [ ] un. 负零 [网络] 零下；负数

neural network [ˈnjuərəl ˈnetwə:k] n. 神经网络 [网络] 类神经网路；类神经网络；神经元网络

neural networks [ ] na. 【计】模拟脑神经元网络 [网络] 神经网络；类神经网路；神经网络系统

object detection [ ] [科技] 物体检测

on the website [ ] [网络] 在网站上

optimization algorithm [ ] un. 最佳化算法；最优化算法 [网络] 最佳化演算法

optimize for [ ] vt.为...而尽可能完善

output neuron [ ] 输出神经元

output vector [ ] 输出矢量

overlap with [ ] vt.与...相一致

pixel image [ˈpiksəl ˈimidʒ] [医]像素显像

plug in [plʌɡ in] v. 插入；插插头；插上 [网络] 插件；连接；插上电源

plus infinity [ ] [网络] 正无穷大；正无限大

project milestone [ ] [网络] 项目里程碑；专案里程碑；专案时程计画

random vector [ ] un. 随机矢量；随机向量

recurrent neural network [rɪˈkɜ:rənt ˈnjuərəl ˈnetwə:k] [计]递归神经网络

sigmoid function [ˈsiɡmɔid ˈfʌŋkʃən] [医]S型函数

square bracket [skwɛə ˈbrækit] n. 方括号 [网络] 方括弧；中括号；s方括号

synthetic data [sinˈθetik ˈdeitə] 综合数据

the algorithm [ ] [网络] 算法

the downside [ ] [网络] 不利方面；缺点

the louvre [ ] [网络] 卢浮宫；罗浮宫；罗浮宫博物馆

the pipeline [ ] 管道

to compute [ ] [网络] 计算；用计算机计算

to encode [ ] [网络] 编码；内码；骗码

to plug [ ] [网络] 插销

to update [ ] [网络] 更新；重要更新公告；每月更新

ton of [ ] 大量,许多

training loop [ ] 培训回路

unstructured data [ ] un. 非结构化数据 [网络] 非结构化资料；结构数据；非结构数据

verification algorithm [ ] 验证算法

word detection [ ] [计] 字检测

write bias [ ] 录像偏压

zero in [ˈziərəu in] na. 调整(枪炮的)射距；把(火力)对准目标 [网络] 归零；瞄准；瞄准锁定

Zero Minus [ ] [网络] 绝对零点

惯用语
and finally
as you said
does that make sense
fully-connected or convolutional
good question
i agree
i think
in general
in this case
is color important
let's say
more photos of him
so yeah
that's a good question
that's true
what should be the loss function

单词释义末尾数字为词频顺序
zk/中考 gk/中考 ky/考研 cet4/四级 cet6/六级 ielts/雅思 toefl/托福 gre/GRE
* 词汇量测试建议用 testyourvocab.com