07

文本

████    重点词汇
████    难点词汇
████    生僻词
████    词组 & 惯用语

[学习本文需要基础词汇量：8,000 ]
[本次分析采用基础词汇量：6,000 ]

Hi everyone. Uh, welcome to lecture number seven.

Um, so, up to now, uh,

I believe, can you hear me in the back? Is it easy?

Okay. So, in the last set of module that you've seen,

you've learned about and how they

can be applied to imaging, notably.

Uh, you've played with different types of layers including pooling,

max pooling, average pooling, and layers.

You've also seen some classification, uh,

with the most classic ,

uh, all the way up to and, and ResNets.

Uh, and then you jumped into advanced application like with YOLO,

uh, and the Fast R-CNN,

Faster R-CNN series with an optional video.

And finally, uh, face recognition and

style transfer that we talked a little bit about in the past lectures.

So, today, we are going to build on top of everything you've seen in this set of modules,

to try to the and interpret them.

Because you, you, you notice after seeing, uh,

the set of modules up to now that a lot of, uh,

improvements of the are based on trial and error.

So, we try something, uh,

we do search,

sometimes the model improves, sometimes it doesn't.

We use a to find the right set

of methods that would make our model improve.

It's not satisfactory from a scientific ,

so people are also searching how can we find, uh,

an effective way to improve our ,

not only with trial and error,

but with theory that goes into the network and .

So, today, we will focus on that.

We first, uh, we'll see three methods,

, ,

and class maps,

which are used to kind of understand what was the decision process of the network.

Given this output, how can we map back

the output decision on the input space to

see which part of the inputs were for this output.

And later on, we will even more in

details into the network by looking at ,

what happens at an ,

at a layer level,

and at a network level with another set of methods,

class model ,

search, and .

We will spend some time on the because it's,

uh, it's a cool, it's a cool type of, uh,

to know and it will give you

more on how the works from a .

Uh, if we have time, we'll go over a fun application called Deep Dream, um,

which is super cool visuals for some of you who know it.

Okay? Let's go.

Menti code is on the board,

if you guys need to, to sign up.

So, uh, as usual,

we'll go over some context,

trial information and small case studies,

so don't hesitate to participate.

So, you've built an animal for a pet shop,

um, and you gave it to them.

It's, it's super good.

It's been trained, uh, on ImageNet plus some other data.

And what, what is a little worrying is that

the pet shop is a little reluctant to use your network,

because they don't understand the decision process of the model.

So, how can you quickly show that the model is actually looking at a specific animal,

let's say a cat, if I give it an input that is a cat.

We've seen that together,

one time, everybody remembers?

So, I'll go quickly. Uh, you have a network,

here is a dog given as an input to a CNN.

The CNN assuming the constraint is that there is one animal per image was trained with

a output layer and we get a probability distribution over all animals,

, dog, car, uh, cat and crab.

And what we want is to take the of

the score of dog and it to the input

to know which parts of the inputs were for this score of dog.

? Everybody remembers this?

, the interesting part is that this value is the same shape as x.

So, it's the size of the input.

It's a matrix of numbers.

If the numbers are large in absolute value,

it means the corresponding to these locations had an impact on the score of dog.

Okay? What do you think the score of dog is?

Is it the output probability or no?

What - - - what do I mean by s of dog?

[NOISE]

Yeah?

Score of the dog?

It's the score of the dog, yeah.

But is it, uh, 0.85, that's what I mean?

[NOISE] No, there are actually formulas used the 0.85, going to the []

Yes. So, it's the,

it's the score that is pre-softmax.

It's the score that comes before the .

So, as a reminder, here's a layer and this is how it could be presented.

So, you get as a vector, that is a set of scores that are not necessarily probabilities,

they are just scores between and .

You give them to the and

the softmax, what it's going to do is that it's going to output

a vector where the sum of

all the probabilities in this vector are going to sum up to one.

Okay? , the issue is if instead of

using the of what we called Y hat last time,

we use the score of dog,

we will get a better representation here.

The reason is in order to maximize this number,

score of dog divided by the sum of the score of al - all animals,

or like maybe I,

I should write of score of dog divided

by sum of of the score of all animals.

One way is to minimize the

- the scores of all the other animals rather than maximizing the score of dog.

So, you see, so maybe moving a certain will minimize the score of fish.

, this will have a high influence on Y hat,

the general output of the network.

But it actually doesn't have an influence on the score of dog

one layer before. Does it make sense?

So, that's why we would use, uh,

the scores pre-softmax instead of

using the scores post-softmax that are the probabilities.

Okay. And what's fun is here you cannot see that,

the slides are online if you wanna - if you wanna look at it on your computers.

But you have some of the that are roughly

the same positions as the dog is on the input image that are stronger.

So, we see some white here.

And this can be used to segment the - the dog probably.

So, you could use a simple to find where the dog was based on this ,

uh, , the score map.

It doesn't work too - too well in practice,

so we have better methods to do ,

but this can be done as well.

So, this is what is called ,

and it's a common technique to quickly, uh,

, uh, what the network is looking at.

, we will use other methods.

So, here's another story.

Now you've built the animal ,

they're still a little scared,

but you wanna prove that the model is actually looking

at the input image at the right position.

You don't need to be quick but you have to be very precise.

[NOISE] Yeah?

So, going back from the last slide,

is the that , one pixel border?

No, the is literally distinct here.

Okay.

It's the values of the de - the .

Oh, okay. So, it's like the gradient's at []

So, you - you take the score of dog,

you the all the way to the inputs,

it gives you a matrix that's exactly the same size as the x.

And you use - you use like a specific color scheme to see which are the strongest.

Perfect, thank you.

Okay. So, here we have our CNN.

The dog is forward and you get a score of,

uh, probability score for the dog.

Now, you want a method that is more

precise than the previous one but not necessarily too fast.

And this one, we've talked about it a little bit, it's sensitivity.

So, the idea here is to put a gray square on the dog here.

And we this image with the gray square at this position through the CNN.

What we get is another probability distribution

that is probably similar to the one we had before,

because the gray square doesn't seem to impact too much of the image.

At -, uh, at least from a human perspective,

we still see a dog, right?

So, the score of dog might be high, 83 percent probably.

What we can say, is that we can build

a probability map corresponding to the class dog and ha - and we

will write down on this map how confident is

the network if the gray square is at a specific location.

So, for our first location,

it seems that the network is very confident,

so let's put a red square here.

Now, I'm going to move the gray square a little bit.

I'm shifting it just as we do for a and I'm going

to send again this new image in the network.

It's going to give me

a new probability distribution output and the score of dog might change.

So, looking at the score of dog,

I'm going to say, okay,

the network is still very confident that there is a dog here, and I continue.

I shift it again, here same,

network's still very confident that there is a dog.

Now, I shift the, the, the square, um,

down, and I see that partial,

that the - the face of the dog is partially .

Probability of dog will probably go down,

because the network cannot see one eye of the dog.

It's not confident that there's a dog anymore.

So, probably, the confidence of the network went down.

I'm going to put a, a square that is tending to be blue, and I continue.

I shift it again and here we don't see the dog face anymore.

So, probably the network might,

might classify this as a chair, right?

Because the chair is more obvious than the dog now.

, the probability score of dog might go down.

So, I'm gonna put a blue square here and we're going to continue.

Here, we don't see the tail of the dog,

it's still fine, the network is pretty confident, .

And what I will look at now is

this probability map which tells me roughly where the dog is.

So, here we used a pretty big filter compared to the size of the image.

The smaller the, sorry,

the pretty big gray square,

the smaller the gray square,

the more precise this probability map is going to be.

? So, this is,

if you have time, if you can,

you can take your time with the pet shop to explain them, uh,

what's happening, you would do that. Yeah?

Would you ever, in an type of situation have an increase in the probability not just a decrease, say, you removed the noise from the picture?

We will see that in the next slide. That's correct.

So let's see more examples.

Here, we have three classes and these, these,

these images has been - have been generated by Matthew Zeiler and Rob .

This paper, and Understanding Networks,

is one of the paper that has

led the research in in and interpreting .

So, I'd advise you to take a look at it,

and we will refer to it a lot of time in this lecture.

So, now we have three examples.

One is a ,

which is this type of cute dog, a car wheel,

which is the true class of the second image,

and an ,

which is this type of dog here on the last image.

So, if you do the same thing as we did before that's what you would see.

So, just to clarify,

here we see a blue color.

It means when the gray square was positioned here or centered at this location,

the network was less confident that the true class was .

, if you look at the paper they explained that when a gray square was here,

the confidence of went down because the conference,

because the confidence of tennis ball went up.

, the dog has a tennis ball in the mouth.

And another interesting thing to notice is on the last picture here.

You see that there is a,

a red color on the top left of the image.

And this is you exactly as what - as what you mentioned Adam is that,

when the square was on the face of the human,

the network was much more confident that the true - that the true class was the dog.

Because you removed a lot of

meaningful information for the network which has the face of the human.

And similarly, if you put the square on the dog,

the true class that the network was outputting was human probably, does make sense?

Okay. So, this is called sensitivity,

and it's the second method that you now have seen for

interpreting where the network looks at on an input.

So, let's move to class maps.

So, I don't know if you remember,

but two weeks ago, Pranav when he

discussed the techniques that he has used in ,

he explained that you get a - he did a chest x-ray.

And he manages to,

to tell the doctor where the network is looking at when predicting

a certain disease based on his chest X-ray, right? You remember that?

So, this was done through class maps,

and that's what we're going to see now.

So, one important thing to notice is that we discussed that

classification networks seem to have a very good ability,

and we can see it with the two methods that we previously discussed.

, for those of you who have read the yellow paper,

that you've studied in this set of modules.

The YOLOv2 has first been trained on classification,

because classification has a lot of data,

a lot more than .

Has been trained on classification,

builds a very good ability and then has been fine-tuned,

and on .

Okay. And so the core idea of class map is to show that

CNNs have a very good ability

even if they were trained only on image level labels.

So, we have this network.

There is a very classic network used for classification.

We give it a kid and a dog.

Uh, this class activation map is coming from MIT,

the MIT lab with Bolei in 2016.

And forward this image of a kid with

a dog through the network which has some ,

, MAX POOL, classic series of layers, several of them.

And at the end, you usually the last output volume of the ,

and run it through several fully connected layer

which are going to play the role of the ,

and send it to a softmax,

and get the probability output.

Now, what we're going to do is that we are going to prove that

this CNN is to .

So, we're going to convert this same network in another network.

And the part which is going to change is only the last part.

of using plus fully connected is

that you lose all , right?

You have a volume that has ,

although it's been going through some max pooling,

so it's been down sampled and you lost some part of the .

kills it, you it you

run it through a fully connected layer, and then it's over.

You - it-s, it's super hard to find out where

the activation was corresponds to on the input space.

So, instead of using plus fully-connected,

we're going to use global average pooling.

We're going to explain what it is.

A fully connected softmax layer and get the probability output.

And we're going to show that now this network can

be trained very quickly because we just need to train one layer,

the fully connected here,

and can show where the network looks at.

The same as the previous network.

So, let's talk about it more in detail.

Assume this was the last layer of our network,

and it outputs a volume,

a volume that is sized to four by four by six.

So, six filters were used in the last .

And so we have six feature maps now. ?

I'm going to convert this using

a - global average pooling to just a vector of six values.

What is global average pooling?

It's just taking these feature maps.

Each of them averaging them into one number.

So, now instead of having a four by four by six volume,

I have a one by one by six volume,

but we can call it a vector. ?

So, what's interesting is that this number,

actually holds the information of the whole feature map that

came before in, one number being averaged over it.

I'm going to put these in a vector,

and I'm going to call them .

As usual a_1, a_2,

a_3, a_4, a_5, a_6.

As I said, I'm going to train a fully-connected layer here with the softmax activation,

and the outputs are going to be the probabilities.

So, what is interesting about that?

It's that the feature maps here as you know will contain some visual patterns.

So, if I look at the first feature map,

I can plot it here,

so these are the values.

And of course, this one is much more than four by four.

It's not a four by four it's much more numbers.

But this - you can say that this is the feature map,

and it seems that the have found something here.

There was a visual pattern in the inputs that activated the feature map,

and the filters which generated this feature map here in this location.

Same for the second one, there's probably two objects or

two patterns that activated the filters that generated this feature map, .

So we have six of those.

And after I've trained my fully connected layers here - my fully connected layer,

I look at the score of dog.

Score of dog is 91 percent.

What I can do is to know this 91 percent,

how much did it come from these feature maps?

And how can I know it?

It's because now I have a direct mapping using the weights.

I know that the weight number one here,

this edge you see it,

is how much the score was dependent on the orange feature map?

? The second weight,

if you look at the green edge,

is the weights that has multiplied

this feature map to give birth to the outputs of a dog.

So, this weight is telling me how much this feature map the green one

has influence on the output. ?

So, now what I can do is to sum all of these,

a weighted sum of all these feature maps.

And if I just do this weighted sum,

I will get another feature map.

Something like that. And you notice that,

this one seems to be highly influenced by the green one,

the green feature map, yeah.

It means probably the weight here was higher.

It probably means that the second feature of

the last was the one that was looking at the dog.

?

Okay. And then, once I get this feature map,

this feature map is not the size of the input image, right?

It's the size of the height and width of the output of the last CONV.

So, the only thing I'm going to do is,

I am going to up sample it back simply,

so that it fits the size of the input image,

and I'm going it on the input image to get my class activation map.

The reason it's called class activation map is because

this feature map is dependent on the class you're talking about.

If I was using, uh,

let's say I was using car here,

if I was using car,

the weights would have been different, right?

Look at the edges that connect

the first activation to the activation of the previous layer.

These weights are different. So, if I sum

all of these feature maps I'm going to get something else.

? So, this is class activation maps.

, there is a dog here and there is a human there.

And what you can notice is,

probably if I look at the class of human the weights number one might be very high,

because it seems that this visual pattern that

activated the first feature map was the face of the kid.

Okay. So, what is super cool is that you can get your network,

and just change the last few layers into

global average pooling plus a softmax fully connected layer.

And you can do that, and very well.

It requires a small fine tuning.

Yeah.

So are these like , but for the activation?

It's a different vocabulary,

I would use saliency maps for the up to

the and class activation maps related to one class [NOISE].

Uh, it's not the at all,

it's just an up sampling to the,

to the input space based on the feature maps of the last CONV layer.

So it's mostly just examining the weights and sort of doing like operation on a,

on them, not so much that different from .

Yes.

Good [NOISE].

Any other questions on class activation maps? Yes.

Does taking the average not kill the ?

Yeah. That's a good question. So, taking

the average, does it kill the ?

So, let me, let me write down the formula here.

This is the score that we're interested in,

let's say dog plus C. What you could say is that this score

is a sum of K equals one to six of WK,

which is the, the weight that,

that connects the output activation to the previous layer,

times what's times A of the previous layer.

Uh, let's say we, we,

we use a that is like K is the feature map

and IJ is the location and I sum that over the locations.

Can you see in the back? Roughly? So, what I'm saying is that here,

I have my global average pooling that happened

here and I can divide it by the certain number,

so divided by 16, four by four.

Okay. I can switch the two sums,

so I can say that this thing is a sum over IJ

the locations times sum over K equals one to six of what,

WK times AK, so the of

the feature map in position a - IJ and times the , 116.

Does it make sense? Does this makes sense?

So I, I still have the,

the, the location, I still moved,

I still moved the sum around and what I could do is to say that this thing is

the score in location IJ of the class activation map,

is a class score for this location IJ and I'm summing it over all locations.

So, just by flipping what the average pooling was doing over the locations,

I can say that my weighting, using my weights,

all the activation in a specific location for all the feature maps,

I can get the score of this position in regards to the final output.

? So, we - we're not losing the .

[NOISE] The reason we're not losing it is because we know,

we know what the feature maps are.

Right. We know what they are and we know that they've been averaged exactly,

so we exactly can map it back.

Were you giving only one way to each [].

Yeah. Because we, we assume that each filter that

generated these feature maps detects one, one specific thing.

So, like if, if this is the feature map it means assuming the filter was detecting dog,

that we're going to see just,

just something here meaning that there is a dog here and if there was a dog on

the lower part of the image we would also

have strong in these parts. [NOISE] I,

I, I say if you wanna see more of the map behind it,

check the paper, but this is the behind it.

You can flip the using

the global average pooling and show that you keep the spatial information.

The thing is you do the global average pooling,

but you don't lose the feature maps because you know where they

were from the output of the count, right?

So, you're not, you're not deleting this information.

? Yeah.

So, the of, uh,

the activation is K divided by 16 is instead of taking the average, right, for that [].

Yeah.

[NOISE] Okay, let's move on and watch a cool video on how act - class activation maps work.

This video was from .

And it's, uh, it's live so it's very quick.

So, you can see that the network is looking at this speed boat.

Okay. , the three methods we've

seen are methods that are roughly mapping back the output to

the input space and helping us which parts of the inputs were the

most to lead to this output and the decision of the network.

Now we're going to try to more into details in the, in the,

in the of the network and try to interpret how does

the network see our world, not necessarily related to a specific input, but in general.

Okay. So, the pet shop now trusts your model

because you - you've used sensitivity, saliency maps, and

class activation maps to show that the model is looking at the right place,

uh, but they got a little scared when you did that.

And they asked you to explain what the model thinks a dog is.

So, you have this trained neural network

and you have an output probability.

Yeah, let me take one in the back. Yeah.

Um, what are some good ways to like non-image data?

Non-image data that's a, that's a good question.

It's actually, so the reason we're seeing images was

- most of the resources being [NOISE] focusing on images,

um, if you look at let's say time series data.

So, either speech or natural language,

the main way to realize those is, uh,

with the attention method,

uh, are you familiar with that?

So, in the next set of modules that you're going to

start this week and you will just study

in the next two weeks you will see a called attention models,

which will tell you which part of a sentence was important,

let's say to output a number like assuming you're doing .

You know some languages,

they don't have a direct one to one mapping.

It means I might say,

uh, I love cats,

but in another language maybe [NOISE] this same sentence

will be cats I love or something near that, its fit.

And you want an [NOISE] attention model to - to

show you that the cat was referring to the second.

I think it's, it's, it's okay.

Okay, sorry guys [NOISE].

[NOISE] So, going back to the presentation.

Now, we're going to - inside the network.

And so the new thing is the pet shop

is a little scared and asked you to explain what the network thinks a dog is.

What's the representation of dog for the network?

So, here, we're going to use a method that we've

already seen together called ,

which is defining an objective,

that is technically the score of the dog,

minus a term.

What the term is doing,

is it's - it's saying that x should look natural.

It's not necessarily L2 ,

it can be something else,

and we - we will discuss it in the next slide,

but don't think about it right now.

What we will do is we will

the back-propagation of this objective function all the way back to the input,

and perform to find the image that maximizes the score of the dog.

So, it's an ,

takes longer than the class activation map.

And we repeat the process, forward x,

the objective, back-propagate,

and update the pixels and so on.

You guys are familiar with that?

So let's see what - what we can doing that.

So, actually, if you take an image net - classification network,

and you perform this on the classes of goose or or , ,

, you can see what the network is

looking at or what the network thinks that is.

So, for the , you can see some - some on a white background,

somehow, but these are - are still quite hard to interpret.

It's not super easy to see and even worse here on the screen,

better on your computers.

But you can see , some here,

you can see orange color for .

It means that pushing the pixels to an orange color would

actually lead to a higher score of the in the output.

If you use a better than L2,

you might get better pictures.

So, this is for ,

this is for , and this is for .

So, a few things that are interesting to see,

is that in order to maximize the score of ,

what the network is many .

It means that's 10 leads to

a higher score of the class than one for the network.

Talking about , what does L2 regularization say?

It says that for ,

we don't want to have extreme values of pixel.

It doesn't help much to have one pixel with an extreme value,

one pixel with a low value and so on.

So, we're going to

all the pixels so that all the values are around each other,

and then we can re-scale it between zero and 20 - 255 if you want.

One thing to notice is that the process doesn't

constrain the inputs to be between zero and 255.

You can go to potentially,

while an image is stored with numbers between zero and 255,

so you might want that as well.

This is another type of regularization.

One thing that led to beautiful pictures was what Jason Yosinski and his team did is,

they forward an image, the score,

the objective function, back-propagated,

updated the pixels, and them, the picture.

Because what - what is not useful for ,

is if you have high frequency variation between pixels,

it doesn't have to visualize,

if you have many pixels close to each other that have many different values.

Instead, you want to have a smooth transition among pixels,

and this is another type of regularization called .

Okay? So, this method actually makes a lot of sense in - in - in scientific terms.

You're - you're maximizing an objective function that

gives you what the network sees as flamingo,

which would maximize the score of flamingo.

So, we call it also class model . Yes?

So, does a more realistic class model,

correspond to a more accurate model? [NOISE]

Um, does a more realistic class model correspond to a more accurate.

So, it's hard to map the accuracy of the model based on this visualization,

but it's a good way to that the network is looking at the right thing.

Yeah. We're going to - to see more of this later.

I think the most interesting part is actually on this slide is,

we - we did it for the class score,

but we could have done it with any activation.

So, let's say I stop in the middle of the network,

and I define my objective function to be this activation.

I'm going to back and find the input that we maximize this activation.

It will tell me what is this activation.

What does this activation fire for?

So, that's even more interesting I think than looking at

the inputs and then the output. Does that make sense?

That we could do it on any activation?

Yep.

[NOISE] Any questions on that? [NOISE]

Okay. So, now, we're going to do another trick which is data-set search.

It's actually one of the most useful, I think.

Not fast, but very useful.

So, the pet shop loved the previous technique,

and asks if there are other alternatives to - to

show what - what an activation in the middle of a network is thinking.

You take an image, forward it to the network, get your output.

Now, what you're going to do is select a feature map,

let's say this one, where at this layer,

and the feature map is of size five by five by 256.

It means that the CONV layer here had 256 filters, right?

You are going to look at these feature maps and select probably,

uh, yeah, what you're going to do is select one of the feature maps, okay?

We select one out of 256 feature maps,

and we're going to learn - run a lot of data,

forward propagate it through the network,

and look which data points have had the maximum activation of this feature map.

So, let's say we do it with the first feature map.

We notice that these are the top five images that really fired this feature map,

like high on the feature map.

What it tells us, is that probably this feature map is

detecting shirts. Could do the same thing,

let's say we take the second feature map,

and we look which data points have maximized the activations of this feature map,

out of a lot of data.

And we see that this is what we got,

the top five images.

Probably means that the other feature map seems to be activated when seeing edges.

So, the second one is much more likely to

appear earlier in the network obviously than later.

So, one thing that you may ask is,

do these images seem cropped?

Like I don't think that this was an image in the data-set,

it's probably a of the image.

What do you think this crop corresponds to?

[NOISE]

Any idea how we cropped

the image, and why these are cropped?

[NOISE] Like, why - why didn't I show you the full images?

How was I able to show you the cropped?

[NOISE].

[] and so that anything outside is not [inaudible]

That's correct. So, let's say we pick an activation,

an activation in the network.

This activation for a neural network

doesn't see the entire input image.

Right? Doesn't see it.

What it sees is a of the input's image.

Does that make sense? So, let's look at another slide.

Here, we have a picture of units,

64 by 64 by 3.

It's our input. We run it through a five-layer ConvNet.

, we get an volume that is much smaller

in height and width, but bigger in depth.

If I tell you what this activation is seeing.

If you map it back, you look at the stride and the filter size you've used,

you could say that this is the part that this filter is seeing.

This - this -, uh, this activation is seeing.

It means the pixel that was up there had no influence on this activation,

and it makes sense when you think of it.

You're - you're - the - the easiest way to think about it is looking at the - the top picks,

the - the - the top entry on the volume, top-left entry.

You have the input image, you put a filter here.

This filter gives you one number, right?

This number, this activation only depends on this part of the image,

but then if you add a after it,

it will take more filters.

, the deeper you go,

the more part of the image the activation will see.

So, if you look at an activation in layer 10,

it will see much - a much larger part of the input

than an activation in layer one. ?

So, that's why - that's why probably the pictures that I showed here,

these ones are very small part - crops,

small crops of the image,

which means the activation I was talking about here is probably earlier in the network.

It sees a much smaller part of the input.

[inaudible] [NOISE]

Yeah, yeah. So, what you look at it which activation was maximum.

You look at this one and then you match this one back to crop. Does that make sense?

Okay, so here's units again,

up and same, this one would correspond more in the center of the image.

This makes sense?

Okay cool. So, let's talk about now.

This is gonna be the hardest part of the lecture,

but probably helping with - with more on . You remember that?

That was the networks scheme.

And we said that giving a code to the generator,

the generator is able to output an image.

So, there is something happening here that we didn't talk about.

Is how can we start with a 100 vector and

output a 64 by 64 by 3 image? That seems weird.

We could use, you might say,

a fully connected layer with a lot of , right, to up-sample.

, this is one method,

another one is to use a network.

So, will the information

in a smaller volume in height and width deeper in - in depth,

while the deconvolution will do the reverse.

It will up-sample the height and width of an image.

So, that would be useful in this case.

Another case where it would be useful is .

You remember our case studies, uh,

for life cell,

images of cells.

Give it to a network.

It's going it.

So, it's going to lower the height and width.

The interesting thing about this in the middle

is that it holds a lot of meaningful information.

But what we want ultimately,

is to get a mask,

and the mask in height and width has to be the same size as the .

So we need a deconvolution network to up-sample it.

So, deconvolution are used in these cases.

Today the case we're going to talk about is visualization.

Remember the gradient method we talked about.

We define an objective function by choosing an activation in the middle of the network,

and we want the objective to be equal to this activation to find

the input image that maximizes its activation through an .

Now, we don't want to use an .

We want to use a reconstruction of this activation

directly in the input space by one .

So, let's say I select this feature map out of pool,

255, sorry, 5 by 5 by 256.

What I'm going to do is,

I'm going to identify activation of this feature map. Here it is.

It's this one, third column second row.

I'm going to set all the others to zero.

Just this one I keep it,

because it seems that this one has detected something.

Don't wanna talk about the others.

I'm going to try to in the input space what this activation has fired for.

So, I'm going

the reverse of pooling, , and .

I will unpool, I will un-ReLU,

let's say, doesn't - this word doesn't exist, so don't use it.

But un-ReLU and deconv.

And I will do it several times because this activation went through several of them.

So I will do it again and again until I see, oh,

this specific activation that I selected in

the feature map fired because it saw the ears of the dog.

And as you see, this image is cropped again.

It's not the entire image,

it's just the part that the activation has seen.

And if you look at where the activation is located on the feature map,

it makes sense that this is the part that corresponds to it.

, the higher level is this.

We are going to it and see what do we mean by unpool,

what do we mean by Un-reLU,

and what do we mean by de-conv.

Okay. Yes.

So, if we had [inaudible].

Would we have just gotten a reconstruction of the whole image?

So, the difference is, you mean if we don't all the activations?

It shows that this reconstruction would .

It would be more . [NOISE] Yeah.

Doesn't, doesn't necessarily mean you will not get the full image,

because probably the other activations probably didn't even fire,

means they didn't detect anything else.

It's just that it's gonna - it's gonna add some noise to this reconstruction.

Okay, so let's talk about deconvolution a little bit on the board.

[NOISE] So, to start with deconvolution,

and you, you guys can take notes if you want.

We are going to spend about 20 minutes on the board now to discuss deconvolution, okay?

[NOISE] To understand the deconvolution,

we first need to understand deconvolution.

We've seen it, uh, from a

computer science perspective, but actually,

what we are going to do here is we are going to frame

deconvolution as a vector .

You're going to see that it's actually possible.

So let's start with a 1D conv.

For the 1D convolution,

I will take an input x which is of size 12,

x1, x2, x3, x4,

x5, x6, x7, x8.

So, 8 plus 2 padding,

which gives me the 12 that I mentioned.

So, the input is a one-dimensional vector which has padding of two on both sides.

I will give it to a layer that will be a 1D conv.

And this layer would have only one filter.

And the filter size will be four.

We will also use a stride equal to two.

[NOISE] So, my first question is,

what's the size of the output?

Can you guys it on your - on your and,

and tell me what's the size of the output.

[NOISE]. Input size 12,

[NOISE] filter of size four,

stride of two, padding of two.

Five, yeah I heard you, yeah.

So, remember use , sorry,

equals minus f plus 2p divided by stride and you will get five.

So, what I'm going to get is Y1,

Y2, Y3, Y4, Y5.

[NOISE] So, I'm going to focus on this specific convolution for now.

And I'm going to show now that we can define it as,

as a between a matrix and a vector.

So, the way to do it is,

I guess the easiest way is to write the system of equation

that is underlying here. What is Y1?

Y1 is the filter applied to the four first values here. This makes sense?

So, if I define my filter as being y W1,

W2, W3, and W4,

what I'm gonna get is that Y1 equals W1 plus W2 times

zero plus W3 times x1 plus W4 times x2.

This makes sense? Just the convolution,

operation, and then sum all of it.

Y2 is going to be same thing,

but with a stride of two, going two down.

So, it's going to give me W1 times x1 plus W2 times

x2 plus W3 times x3 plus W4 times x4.

Correct? Everybody is following?

No. .

We will do it for all the y's until Y5,

and we know that Y5 is operation between

the filter and the four last number here, summing them.

So, it will give me W1 times x7 plus W2 times

x8 plus W3 plus W4 .

[NOISE]

Okay. Now what we're going to do is to try to write down

y as between w and x.

We need to find what this w matrix is.

And looking at this system of equation,

it seems that it's not impossible. So let's try to do it.

I will write my Y vector here, Y_1,

Y_2, Y_3, Y_4, Y_5.

And I will write my matrix here and my vector x here.

So first question is,

what do you think will be the shape of

this w matrix? Um?

5 by 12.

5 by 12. Correct. We know that this is 5 by 1,

this is 12 by 1,

so of course w is going to be 5 by 12.

Right?

So, now, let's try to fill it in 0,

0, x_1, x_2, x_3, ,

, , x8, 0, 0.

Can you guys see in the back or no?

Yeah? Okay. Cool. Ah, so,

I'm going to fill in this matrix regarding this system of equation.

I know that the Y1 would be w_1 times 0,

w_2 times 0, w_3 times x_1, w_4 times x_2.

So this vector is going to multiply the first row here.

So I just have to place my ws here.

w_1 will come here, multiply 0,

w_2 will come here, w_3 would come here,

and w_4 would come here.

And all the rest would be filled in with 0s, right?

I don't want any more .

How about the second row of this matrix?

I know that Y_2 has to be equal to this with this row.

And I know that it's going to give me w_1x_1 plus w_2x_2 plus w_3x_3.

x_1 is the third input on this vector, third - third entry.

So, I would need to shift what I had in

the previous row with a stride of two, it will give me that.

? So if I use the of this row with that,

I should get the second equation up there.

And so on and you understand what happens, right?

This pattern will just shift with the stride of two on the side.

So, I would get zeros here and I will get my w_1,

w_2, w_3, w_4 and then zeros.

And all the way down here.

And all the way down here,

what I will get is w_4,

w_3, w_2, w_1 and zeros.

So the only thing I wanna mention here

is that the as you see can be

framed as a times a vector. Yes.

So why did you have - on the right side of the top row,

in the left side, that's when multiplying the - [NOISE]

For the top row, why the zeros are on the right side?

Yes.

Because I don't want Y hat - Y_1 to be dependent on x_3 to x_8.

So I want these to be zero priors.

Okay. Oh, because of the stride and the window size.

Okay.

Thank you.

So why is this important for

the intuition behind the deconvolution and the existence of the deconvolution?

It's because if we manage to write down y equal ,

we probably can write down x equal w ,

y if w is an and this is going to,

to be our deconvolution.

, what's the,

what's the shape of this new matrix?

. Um?

.

Yes. .

We have 12 by 1 on one side,

5 by 1 on the other, it has to be 12 by 5. So it's flipped compared to

w. So one thing we're going to do here is we're going to make an assumption.

First assumption is that w is an .

And on top of that, we're going to make a stronger assumption which

is that w is an .

And without going into the details here,

same as when we proved in sections,

we made some assumptions that are not always true.

This assumption is not going to be always true.

One, one intuition that you can have is,

if I'm using a filter that is,

ah, assume the filter is an .

So like, ah, ,

zero, zero, .

In this case, would be .

Why? A matrix that is means that if I take two of the columns here,

I dot-product them together, it should give me zero.

Same with the rows, you can see it.

So, what's interesting is that, ah,

if the stride was four,

there will be no between these two rows.

It would give me an .

Here a stride is two but if I replace this w_1 by ,

zero, zero, ,

so plus one, zero, zero,

and minus,

, zero, zero, ,

you can see that the would be zero.

The zeros will multiply the ones and the ones

will multiply the zeros, it gives me a zero .

So, this is a case where it works.

Practices doesn't always work.

The reason we're making this assumption is because we wanna make a reconstruction, right?

So, we wanna be able to have this w minus one, this,

this, this and is not going to be exact.

But at, at first-order ,

we can assume that will still be useful to us,

even if this assumption is not always true.

In the case where w is ,

I know that the of w is w .

Or another way to write it,

is that for ,

w times w is the .

So, what it tells me is that x is going to be w time y, times y.

So, let's see what we get from that.

Let me write down the Menti code.

So, let's say now we have our x and we wanna our,

or we will have our y and we want to generate our x using this method.

So, I would, what I would write is to understand the 1D deconv.

We can use the following illustrations,

where we have x here,

which is zero, zero, x_1,

x_2, x_3, all the way down to x_8.

Okay? And I will have my w matrix here,

w and my Y vector,

Y_1, Y_2, Y_3, Y_4, and Y_5 here.

And so, I know that this matrix will be the of the one I have here, right?

So, I can just write down the transpose.

The transpose will be w_1, w_2, w_3, w_4.

Okay? I will shift it down with a stride

of two and so on.

[NOISE]

And this whole thing will be W Transpose.

So, th - the small issue here is that this in

practice is not - is going to be very similar to a convolution,

but because, uh, but it's going to be a tiny little different interval of implementation.

Another question I might ask is,

how can we do the same thing with the same pattern as we have here?

It means the stride is going from left to right,

instead of going from up to down.

I'm going to introduce that with a technique called sub-pixel convolution.

And for those of you who read papers and segmentation in visualization,

this is a type of convolution that is used for reconstruction.

So, let's see how it works.

I just wanna do the same operation,

but instead of doing it with a strike going from up to down,

I want to do it from a strike going from left to right.

O - one, one thing you wanna,

you wanna notice here,

is that, uh, the two lines that I wrote here are cropped.

And the reason is because we're using a padded input.

Here, we will just crop the two top lines.

And same for the two last lines.

They will be cropped. Look at that.

W1 will multiply Y1,

and this one will multiply Y2 and so on.

So, this will give me W1 times Y1,

but I don't want that to happen because I wanna get the padded zero here.

So, I will just crop that.

In this matrix it's actually going to be smaller than it seems,

and is going to generate my X1 through X8 and then I will

pad the top values and the bottom values.

Okay, just the height.

So, let's look at the sub-pixel convolution. I have my input.

And I will do something quite fun.

I will perform a sub-pixel operation on Y. What does it mean?

I will insert zeros almost everywhere.

I will insert them, and I will get 0,

0, Y1, 0, Y2,

0, Y3, 0, Y4,

0, Y5 and 0, 0.

Even more, one more 0 here, one more 0 here.

So, this vector is just the vector Y with

some zeros inserted around it and also in the middle between the elements of Y.

Now, why is that interesting?

It's interesting because I can now write down my convolution by flipping my weight.

[NOISE]

So, let me explain a little bit what happened here.

What we wanted is,

in order to be able to efficiently compute the deconvolution

the same way as we've learned the convolution.

We wanted to have the weights

scattered from left to right with a stride moving from left to right.

What we did, is that we used a sub-pixel version of Y by inserting the middle,

and we divided the stride by two.

So, instead of having a stride of two as we had in our convolution,

we have a stride of one in our deconvolution.

So, notice that I shift my weights from one at every step,

when I move from one row to another.

Second thing is, I flipped my weights.

I flipped my weights. So, instead of having W1, W2,

W3, W4, now I have W4, W3, W2, W1.

And what you could see is looking at that,

first, look at this row,

the first row that is not cropped.

The result of the dot product of this row with this vector is going to be Y1 times W3,

plus Y2 times W1.

Yeah? Now, let's look what happened here.

I look at my first row here,

the dot product of this first row with my Y here is going to be - sorry,

sorry, we - these two are cropped as well.

And same here. So, looking at my first non-cropped row

here as product with this vector what I get is W3 times Y1,

plus W2 - sorry, plus W1 times Y2.

So, exactly the same thing as I got there.

So, these two operations are exactly the same operations. They're the same thing.

You get the same results two different way of doing it.

One, is using a weird operation with strides going from top to bottom.

And the second one is exactly a convolution. This is a convolution.

Convolution plus flipped weights,

of zeros for the sub-pixel version of Y.

And on top of that,

padding here and there.

So, this was the hardest part.

Okay? Does it give you more intuition on the convolution here?

You know now how convolution can be framed as

a between a matrix and a vector.

And you know also that under these assumptions,

the way we will is just by flipping our weights,

dividing the stride by two, and inserting zeros.

If we just do that, we're .

For the convolution,

the following way you wanna ,

just the weights,

insert zeros sub-pixel, and finally divide the stride.

And that's the de-convolution.

So, super complex thing to understand but this is the intuition behind it.

Now, let, let's try to have an intuition of how it would work in two-dimension.

Uh, let me write it down.

The sub-pixel convolution, we already have that [inaudible] [NOISE]

Why do we use that?

Yeah.

Because in terms of implementation this is the same as what we've been using here.

It's, it's very similar,

while this one is another implementation.

So, you could do both the same,

is the same operation.

But in practice this one is easier to understand because it,

it's exactly the same operation of the convolution,

with flipped weights, of zeros and divided stride.

That's why I wanted to show that. Yeah.

So, uh, what, what happens when,

uh, the assumption [OVERLAPPING].

When - assumption doesn't hold?

Yeah.

So, the assumption doesn't hold,

but what we want is to be able to see a reconstruction.

And if we use this method we will still see a reconstruction.

Practice if we had really W minus one,

would be much better. But we don't.

So, uh, let me go over the 2D,

uh, the 2D example.

We are going to go a little over time because we have

two hours technically for - one hour and 50 minutes,

and uh, and let me go over the 2D example.

And then we will answer this question on why we need to make this assumption.

So, here is the interpretation of the 2D deconvolution.

Let me write it down here.

[NOISE]

The intuition behind the 2D deconv is, I get my inputs.

Which is five by five,

and this I call it x. I forward propagate it using a filter of size two-by-two,

in a conv layer,

and a stride of two.

This is my convolution. What I get.

So, if you do five minus two,

plus the padding which is zero,

divided by two, , oh,

I forgot the plus one actually here,

plus one and you floor it.

So - so, five minus two divided by two gives you,

uh, three divided by two plus one.

Um, no actually it will give you three by three,

yeah, three by three.

A y of three by three. That's what you get.

, this you call it y.

What you're going to do here,

is you're going to y.

In order to y,

in order to it,

you're going to use a stride of one.

And what we said is that we need to divide this stride by two, right?

So, we need a stride of one,

and the filter will be the same, two-by-two.

And you remember that what we've seen,

is that the filter is the same.

It's just that it's going to be flipped.

So, you will use a filter of two-by-two, but flipped.

, what do we get?

We hope to get a five-by-five input,

which is going to be our x, five-by-five input.

And the way we're going to do it,

is this is the intuition behind it. Yeah.

Is it up two by two? [NOISE].

Five minus two divided by two. Yeah, it's two by two.

Okay. Up two by two.

Thanks . [OVERLAPPING]. Two by two.

Five-by-five here.

That's what we hope to .

The way we will do it, is we will take the filter,

s is two by two.

We will put it here.

And we will multiply all the weights of this filter by y11.

All the weights will be y11.

So, I will get four values here,

which are going to be w4 y111,

w3 y111 and so on.

Now, I will shift this with a stride of one.

And I will put my filter again here.

And I will multiply all the entries by y12 and so on.

And you see that this entry has an .

So, it will, it will be updated at every step of the convolution.

It's not like what happened in the forward pass.

So, this is the intuition behind the 2D convolution.

3D, . You have,

uh, a volume here.

So, your filter is going to be a volume.

What you're going to do is you're going to put the volume here,

y11 and so on.

And then if you have a second filter,

you would put it again on top of it and multiply

by y11 all the weights of the filter and so on.

It's a little complicated,

but this is the intuition behind deconvolution.

Okay, let's get back to the lecture.

I'm going to take one question here if you guys need .

[NOISE] Don't worry if you don't understand deconvolution truly.

The important part is that you get the intuition here and you understand how we do it.

So, let me make a comment.

[NOISE] Why do we need to make this assumption and do we need to make it?

[NOISE] When we want to [NOISE] like we're doing here in the visualization,

we need to make this assumption because we don't want

to weights for the network.

What we know is that the activation we selected here on

the feature map is - has gone through the entire pipeline of the ConvNet.

So, to , we need to use the weights that we already have in the ConvNet.

We need to pass them to the deconvolution and .

If we're doing the segmentation,

like we talked about for the live cell

we don't need to do this assumption.

We're just saying that this is a procedure that is a deconvolution,

and we will train the weights of the deconvolution.

So, there is no need to make this assumption,

it's just we have a technique that is dividing the stride by

one and inserting zeros and then beam,

we the weights and we get an output

that is an version of the input that was given to it.

So, there's two use cases.

One where you use the weights and one where you don't.

In this case, we don't want to ,

we wanna use the weights. So let's see.

Let's see a - a version more visual of the .

So, we do the sub-pixel image.

This is my image, four by four,

I insert zeros and I pad it,

I get a nine by nine image.

I have my filter like that.

And this filter will .

I will - it will over the input,

so I will place it on my input,

and at every step I will perform a convolution up.

I will get a value here.

The value is blue because as you can see the weights that

affected the output were only the blue weights.

I would use a stride of one beam.

Now, the weights that affect my input are the green ones and so on.

And I would just as I do usually, .

And now one step down.

I see that the weights that are impacting my input are ones.

So, I would put a purple square here and so on.

So, I just do the convolution like that.

, so one thing that is interesting here is

that the values that are blue in my out six by six output,

were generated only using the blue values of the filter,

the blue weights in the filter.

The ones that are green were only

used-were only generated using the green values of my filter.

So, actually this - sub-pixel convolution

or deconvolution could have been done with four ,

with the blue weights, green weights,

purple weights and yellow weights.

And then, just - just replace such that the adjustments would be the output.

Just put the output of each of these conv and mix them to give out a six by six output.

Only thing you need to know we have an input four by four

and we get an output six by six. That's what we wanted.

We wanted to the image.

We can the weights or use the version of them.

So, let's see what happens now.

We understood what, uh,

what deconv was doing.

So, we're able to deconv.

What we need to do is also to unpool and to unReLU.

Fortunately, it's easier than the deconv.

So, we're not gonna do board work anymore.

So, let's see how unpool works.

If I give you this, uh,

inputs to the pooling - to pooling layer.

The output is obviously going to be this one,

42 is the maximum of these four numbers.

Assuming we're using a two-by-two filter with stride of two,

and , 12 is the maximum of the green numbers,

six is the maximum of the red numbers and seven the - the orange ones.

Now, question. I give you back the outputs and I tell you, give me the input.

Can you give me the input or no?

No.

No, why - why? [NOISE] You only keep the maximum.

So, you - you lost all the other numbers.

I don't know anymore the zero,

one and minus one that were the red numbers here

because they didn't pass through the maximum.

So, max pool is not ,

from a .

What we can do is its .

How can we do that? [NOISE].

Spread it out.

Spread it out. That's a good point, we could spread out the six among the four values.

That would be an .

A better way if we manage to some values,

is to something we call the switches.

We the values of the maximum,

using a matrix that is very easy to score,

of zeros and ones.

And we pass it to the unpooling.

And now we can the ,

because we know where 6 was,

we know where 12 was,

we know where 42 was and 7 was.

But it's still not because we - we lost all the other numbers.

Think about maxpool .

It's exactly the same thing.

These numbers 0, 1, - 1.

They had no impact on the loss function at the end,

because they didn't pass through the .

So, actually with the switches you can have the exact

, we know that the other values are going to be zeros,

because they didn't affect the loss during the .

That - that make sense?

Okay. So, this is maxpooling,

unpooling, unmaxpooling.

And we can use it with the switches. We can it.

Why not just the whole regional matrix?

Yeah, why don't we just the whole region there.

We could - could cache the entire thing.

But in terms of back - for in terms

of efficiency we will just use this switching because it's enough.

But not for unpooling though.

Yeah, yeah, for unpooling you're right, we could cache everything.

But then it's cheating, like you - you kept it, it's like, just give it back.

Okay. , we know how [NOISE] unpooling works. Let's look at the .

So, what we need to do, in fact,

is to pass the switches and the filters back

to the unpooling deconv in order to reconstruct.

Switches are of zeros and ones indicating where the maximums were,

and filters are the filters that I will transpose under this assumption on the board.

Okay. And so on and so on,

and I get my reconstruction.

I just need to explain the now.

I give you this input to and I forward propagate it. What do we get?

All the negative numbers are going to be zero,

and the others are going to be kept.

Now, let's say I'm doing a [NOISE] through ReLU.

What do I get if I give you that?

This is the that are coming back,

and I'm asking you what are the after the ReLU during the ?

[NOISE] How does the ReLU behave in ?

[NOISE].

Zeros? [NOISE] Which ones are zero?

Um, the negative.

The negative are ? Do you agree?

The negatives in this yellow matrix are going to be zeros during the .

Are you guys sure? [NOISE] Think always

about what was the influence of the input on

the loss function and you will find out what was the backpropagation.

Look at this number. This number here, - 2.

Did this number have,

the fact that it was - 2,

did it have any influence on the loss function?

No, it could have been - 10,

it could have been - 20.

It's not gonna impact the loss function.

So, what do you think should be the number here?

Zero.

Zero. Even if the number that is coming back,

the gradient is 10.

So, what do you think should be the ReLU backward output?

[NOISE]

Same idea as max-pooling.

What we need to do is to remember the switches.

Remember which of these values had an impact on the loss.

We pass the switches,

all these values here that are kind of a y, you know this is a y.

All these ones had no impact on the loss function.

So, when you ,

their gradient should be ,

doesn't matter them.

It's not gonna make the loss go down.

So, these are all zeros and the rest they just pass.

Why do they pass with the same value?

Because ReLU for positive numbers was 1.

So, this number 1 here that passed the ReLU during the ,

it was not modified.

Its gradient is going to be 1.

? So this is ReLU backward.

Now, in this ,

we're not going to use ReLU backward.

We're going to use something we call ReLU DeconvNet let's say.

The reason we're not, the intuition between why we're not

using ReLU backward is because what we're interested

in is to know which pixels of the input positively affected the,

the activation that we're talking of.

So, what we're going to do is that we're just going to do a ReLU.

We're just going to do a ReLU backward.

Another reason is when we reconstruct,

we wanna have the minimum influence from the

because we don't really want our reconstruction to depend on the .

We would like our reconstruction to be and

just look at this activation, reconstruct what happened.

So, that's what you're going to use.

Again, this is a has been found through trial and error

and it's not going to be viable all the time.

Okay. , we can do everything and we can reconstruct

and find out what was this activation corresponds to.

It took time to understand it,

but it's super fast to do now,

It's just one pass, not .

We could do it with every layer.

So, let's say we do it with the first block of conv, ReLU, maxpool.

I go here. I choose an activation.

I, I, I, I find the maximum activation.

I set all the others to 0.

I unpool, ReLU, deconv and I find out .

This specific activation was looking at edges like that.

So, let's the fun and see how we can visualize inside,

what's happening inside the network.

So, all the visualization we're going to see now can be found in

Matthew Zeiler's and Rob Fergus'

paper and Understanding Convolution Networks.

I'm going to explain what they correspond to, but check,

check out their papers if you want to understand more into detail.

So, what happens here is that on,

on the top left, you have nine pictures.

These are the cropped pictures of the data set that

activated the first filter of the first layer maximum.

So, we have a first filter on the first layer and we run

all the data sets and we recorded what are the main pictures that activate this filter.

These were the main ones. And we did the same thing for

all the filters of the first layer and there are nine times nine of them.

There are a lot of them, I think.

In the bottom here you have the filters,

which are the weights that were plotted.

Just take the filter, plot the weights.

This is th - this is important only for the first layer.

When you go deeper into your network,

the filter itself cannot be interpreted.

It's super hard to understand it.

Here, because the weights are directly multiplying the pixels,

the first layer weights can be .

, you see that the,

let's look at the third one,

the third filter here on the first row.

The third filter has weights that are kind of ,

like one of the .

And in fact if you look at the that maximized these filters' activation,

the feature map corresponding to this filter,

they're all like cropped images that correspond to .

That's what happens. Now, the,

the deeper we go, the more fun we have.

So let's go. Results on a of 50,000 images.

What's happened here is they took 50,000 images,

they forward to the network.

They recorded which image is the maximum,

the one that's maximized the activation of

the feature map corresponding to the first filter of layer two,

second filter and so on for all the filters.

Let's look at one of them.

We can see that's, okay,

we have a circle on this one.

It means that this,

the filter - which generated the feature map corresponding, uh,

[NOISE] to this has been activated through probably a wheel or something like that.

So, that the image of the wheel was the one that maximized

the activation of this one and then we use the deconv method to reconstruct it.

Any questions on that? Yeah.

What if the is not ReLU [inaudible].

Good question, yeah. What if the is not ReLU?

, you would just use a backward to reconstruct if it's [inaudible].

You would use the same,

the same type of method and you would try to .

Okay, let's go a little deeper.

, same layer two,

forward propagate all the images of the data set,

find the nine images that are

the maximum activate - that lead to the maximum activation of the first filter.

These are plotted on top here.

What you can see is like for this filter,

that is the sixth row first filter,

features are more to small changes.

So, this filter actually was activated to many different types of circles,

, , and so it's,

it's still activated although the circles were different sized.

Can go even deeper up third layer.

What's interesting is that the deeper you go,

the more complexity you see.

So, at the beginning we were seeing only edges,

now we see much more complex figures.

You can see a face here,

in this - in this entry.

It means that this filter activated for when

it sees this - when it has seen a data point that had this face,

then we it,

cropped it on the face.

Uh, the face is kind of red,

it means that the more red it was,

the more activation it led to.

And same top nine for layer three.

So, these are the nine images that actually led to the face.

These are the nine images that maximize the, the,

the activation of the feature map corresponding to that filter and so on.

So, here is a very funny.

[inaudible] [NOISE].

Can you stand up? [NOISE].

And realization layers,

we can switch back and forth between showing

the actual activations and showing images to produce high activation.

So, he's - he's giving his own image to the network right now.

By the time we get to the fifth convolutional layer,

the features being represent abstract concepts.

So, these are the I said. [OVERLAPPING]

For example, this seems to respond to faces.

We can further investigate this by showing a few different types of information.

First, we can create images

using new regularization techniques that are described in [OVERLAPPING].

Our paper, the one we talked about.

These show that this fires in response to a face.

[OVERLAPPING] It also taught the images from the training set to activate this the most

as well as pixels from those images most responsible for

the high activations via the deconvolution.

And this is the deconvolutionary substance.

This feature responds to multiple faces in different locations.

And by looking at the deconv,

we can see that it would respond more strongly if we had even darker eyes and lips.

We can also confirm that it cares about the head and shoulders,

but ignores the arms and .

We can even see that it fires to some extent for cat faces.

Using back-prop or deconv,

we can see that this unit depends most strongly

on a couple of units in the previous layer conv4,

and about a dozen or so in conv3.

So they're trying to track back track where - which led to [OVERLAPPING].

So, let's look at another on this screen.

So, what is this unit doing?

From the top nine images,

we may conclude that it fires for different types of clothing,

but examining the shows that it may be

detecting not clothing , but .

In the live plot, we can see that it's activated by my shirt and

smoothing out half of my shirt causes that half of the activations to decrease.

Finally, here's another interesting neuron.

This one has learned to look for printed text in a variety of sizes, colors, and .

This is pretty cool because we never asked

the network to look for or text or faces.

The only labels we provided were at the very last layer.

So, the only reason the network learned features like texts and faces in

the middle was to support final decisions at that last layer.

For example, the text detector may provide good evidence that a is in fact

a book seen on edge and detecting many books

next to each other might be a good way of detecting a ,

which was one of the categories we trained the net to recognize.

In this video, we've shown some of the features of

the DeepViz and a few of the things

we've learned by using it. You can it.

Yeah, so they had ,

which is exactly what you visualize here,

and you could test the on your model,

takes time to - to get - get it to run,

but - but if you want to visualize all the , it's very helpful.

Okay. So, uh, let's go quickly.

We'll spend about three minutes on the optional Deep Dream one because it's fun.

And yeah, feel free - feel free to jump in and ask questions.

So, the Deep Dream one is, uh,

is implemented by , and, uh,

the page - the - the is by Alexander Mordvintsev.

The idea here is to generate parts using this knowledge of

visualization and how they do that is quite interesting.

They would take an input,

forward propagate it to the network and at

a specific layer that we call the - the green layer,

then pick activation and set the gradient to be equal to this activation.

The gradient at this layer and then we back the to the input.

So, earlier what we did is that we defined a new objective function,

that was equal to an activation and we tried to maximize this objective function.

Here, they - they're doing it even stronger.

They take the activations and they set the to be equal to the activations.

And so the stronger the activation,

the stronger it's going to become later on, and so on and so on, .

So, they are trying to see what the network is

activating for and in - increase even this activation.

So, forward propagate the image,

set the gradient of the dreaming layer to be equal to its activation,

but back propagate all the way back to the inputs and update the pixel of the image.

Do that several time and every time the activations will change.

So, you have to set again the new activations to

be the - the - the gradients of the green layer and back propagate,

and ultimately, you will see things happening.

So, it's hard to see here on the screen,

but you would have a pig appearing here.

You'd have like a tree somewhere there, and some animals,

and a lot of animals are going to start appearing in this cloud.

It's interesting because it means,

let's say, you see this cloud here?

If the network thought that this cloud looked a little bit like a dog,

so one of the - the feature maps was - which

would be generated by the filter that detects dog would activate itself a little bit.

Because we set the gradient to be equal to the activation,

it's going to increase the appearance of the dog in the image and so on.

And then you will see a dog appearing after a few .

So, it's quite fun and if you you see that type of thing.

So, you see a pig-snail,

it's kind of a pig with .

Camel-bird, dog - dog-fish.

I'd advise you to like look at this on

the slides rather than on the screen, but it's quite fun.

And same, like if you give that type of image,

you would see that - because the network thought there was like a tower a little bit,

you will increase the network's confidence in the fact that there is

a tower by changing the - the image and the tower would come out.

And so on, it's quite cool.

Uh, yeah and if you're dreaming lower layers,

obviously you will see edges happening or patterns coming.

Because then the lower layers seem to detect an edge and then you will

increase its confidence on its edge so it will - it will create an edge on the image.

This is a fun video I have, Deep Dream on a video.

[MUSIC].

So, everything that the thinks is something it knows with the information it appears to be.

[MUSIC] And what's funny

is that there is so many animals in the video.

And the reason is [MUSIC].

Gets too , I'm going to stop it.

[LAUGHTER] So, one - one insight that is fun about it is,

uh, if the network and this is not only for Deep Dream,

it's also for - it's mostly for gradient .

Let's say we have an output score of a ,

and we define our objective function to be the score,

and we try to find the image that

maximizes a when we'll see something like that.

What's interesting is that the network thinks that

the is the hand with the .

Not only the dumbbell. And you can see it here, you see the hands.

And the reason is it has never seen a dumbbell alone.

So, probably in ImageNet there is no picture of a dumbbell

alone in a corner and labeled as dumbbell.

But instead, it's usually a human trying to push hard.

Okay. So, just what we've learned today,

we are now able to answer all the following questions.

What part of the input is responsible for the output beam, ,

class activation maps seem to be the best way to go.

What is the role of a given neuron feature layer?

Deconvolve, reconstruct, search in a ?

What are the top images and do gradient ?

Check - can we check what the network focuses on?

, saliency map, class activation maps?

How does the network see our world?

I would say ,

maybe Deep Dream is cool stuff.

And then what are the - the implication and - and use cases of these ?

Uh, you can use saliency maps to segment,

it's not very useful given the new methods we have.

But the deconvolution that we've seen together is

widely used for segmentation and reconstruction.

Also for generating networks to generate images in parts sometimes.

Uh, these are also helpful

to detect if some of the in your network are dead.

So, let's say you have a network and you use the and

you see that whatever the input image you give,

some feature maps are always dark.

It means that the feature that generated

this feature map by over the inputs probably never detected anything.

So, it's not being even trained.

That's a type of insight you can get.

Okay, thanks guys.

Sorry we went over time.

[NOISE]

知识点

重点词汇
detection [dɪˈtekʃn] n. 侦查，探测；发觉，发现；察觉 {cet4 cet6 gre :6133}

standpoint [ˈstændpɔɪnt] n. 立场；观点 {cet4 cet6 ky toefl :6206}

blurred [blɜ:d] v. 玷污；使…模糊不清；使感觉迟钝（blur的过去式和过去分词） adj. 模糊不清的；被弄污的 { :6364}

blurring [blɜ:rɪŋ] n. 模糊 adj. 模糊的 vi. 模糊（blur的现在分词） { :6364}

download [ˌdaʊnˈləʊd] vt. [计] 下载 {gk :6382}

synthetic [sɪnˈθetɪk] n. 合成物 adj. 综合的；合成的，人造的 {cet4 cet6 ky toefl ielts :6608}

messy [ˈmesi] adj. 凌乱的，散乱的；肮脏的，污秽的；麻烦的 {gk toefl :6651}

messier [ˈmesi:ə] adj. 肮脏的( messy的比较级 ); 混乱的; 难以处理的; 令人厌烦的 { :6651}

overlap [ˌəʊvəˈlæp] n. 重叠；重复 vi. 部分重叠；部分的同时发生 vt. 与…重叠；与…同时发生 {cet6 ky toefl ielts gre :6707}

inaudible [ɪnˈɔ:dəbl] adj. 听不见的；不可闻的 { :6808}

algorithm [ˈælgərɪðəm] n. [计][数] 算法，运算法则 { :6819}

algorithms [ˈælɡəriðəmz] n. [计][数] 算法；算法式（algorithm的复数） { :6819}

simplify [ˈsɪmplɪfaɪ] vt. 简化；使单纯；使简易 {gk cet4 cet6 ky ielts :7074}

torso [ˈtɔ:səʊ] n. 躯干；裸体躯干雕像；未完成的作品；残缺不全的东西 {gre :7317}

reconstruct [ˌri:kənˈstrʌkt] vt. 重建；改造；修复；重现 {toefl :7327}

reconstructed [ri:kən'strʌktɪd] adj. 重建的；改造的 v. 重建；改造（reconstruct的过去式） { :7327}

gradient [ˈgreɪdiənt] n. [数][物] 梯度；坡度；倾斜度 adj. 倾斜的；步行的 {cet6 toefl :7370}

gradients [ˈgreɪdi:ənts] n. 渐变，[数][物] 梯度（gradient复数形式） { :7370}

flattening ['flætnɪŋ] n. 整平；扁率；压扁作用 v. 压扁（flatten的ing形式） { :7436}

flattened ['flætnd] adj. 没精打采的；垂头丧气的 v. 平整；打倒（flatten的过去分词） { :7436}

flatten [ˈflætn] vt. 击败，摧毁；使……平坦 vi. 变平；变单调 n. (Flatten)人名；(德)弗拉滕 {cet6 gre :7436}

fonts ['fɒnts] n. 字体（font的复数） { :7448}

validate [ˈvælɪdeɪt] vt. 证实，验证；确认；使生效 {toefl gre :7516}

blog [blɒg] n. 博客；部落格；网络日志 { :7748}

wrinkles ['rɪŋklz] n. 皱纹；皱褶（wrinkle的复数形式） v. 起皱（wrinkle的第三人称单数形式） { :7819}

Et ['i:ti:] conj. （拉丁语）和（等于and） { :7820}

compute [kəmˈpju:t] n. 计算；估计；推断 vt. 计算；估算；用计算机计算 vi. 计算；估算；推断 {cet4 cet6 ky toefl ielts :7824}

computed [kəmˈpju:tid] v. 计算（compute的过去式） adj. 计算的（（compute的过去分词） { :7824}

intuition [ˌɪntjuˈɪʃn] n. 直觉；直觉力；直觉的知识 {cet6 ky toefl ielts gre :7905}

spirals [ˈspaiərəlz] n. 螺旋（线）( spiral的名词复数 ); 螺旋式的上升（或下降） v. 盘旋上升（或下降）( spiral的第三人称单数 ); （物价等）不断急剧地上升（或下降） { :8028}

whirls [hwə:lz] v. （使）飞快移动，使旋转( whirl的第三人称单数 ) { :8035}

hound [haʊnd] n. 猎犬；卑劣的人 vt. 追猎；烦扰；激励 {cet6 ky ielts :8069}

ascents [əˈsents] n. 上升( ascent的名词复数 ); （身份、地位等的）提高; 上坡路; 攀登 { :8121}

ascent [əˈsent] n. 上升；上坡路；登高 {toefl ielts :8121}

hack [hæk] n. 砍，劈；出租马车 vt. 砍；出租 vi. 砍 n. (Hack)人名；(英、西、芬、阿拉伯、毛里求)哈克；(法)阿克 {gre :8227}

encode [ɪnˈkəʊd] vt. （将文字材料）译成密码；编码，编制成计算机语言 { :8299}

encoding [ɪn'kəʊdɪŋ] n. [计] 编码 v. [计] 编码（encode的ing形式） { :8299}

notation [nəʊˈteɪʃn] n. 符号；乐谱；注释；记号法 {cet6 toefl ielts :8312}

validation [ˌvælɪ'deɪʃn] n. 确认；批准；生效 { :8314}

zoom [zu:m] vi. 嗡嗡作响; 急速上升 n. 嗡嗡声; 隆隆声; （车辆等）疾驰的声音; 变焦 vt. 使急速上升; 使猛增 {gk ky :8608}

visualize [ˈvɪʒuəlaɪz] vt. 形象，形象化；想像，设想 vi. 显现 {cet6 ielts :8673}

visualized [vɪʒʊəˌlaɪzd] adj. 直观的；直视的 v. 使形象化；想像（visualize的过去分词） { :8673}

visualizing ['vɪzjʊəlaɪzɪŋ] n. 肉眼观察 { :8673}

downside [ˈdaʊnsaɪd] n. 负面，缺点；下降趋势；底侧 adj. 底侧的 { :8709}

snail [sneɪl] n. 蜗牛；迟钝的人 vt. 缓慢移动 vi. 缓慢移动 {cet6 :8765}

cache [kæʃ] n. 电脑高速缓冲存储器；贮存物；隐藏处 vt. 隐藏；窖藏 vi. 躲藏 {gre :8893}

clarification [ˌklærəfɪ'keɪʃn] n. 澄清，说明；净化 {toefl :8909}

mcdonald [mәk'dɔnәld] 麦当劳（McDonald's）麦克唐纳（人名） { :8947}

insertion [ɪnˈsɜ:ʃn] n. 插入；嵌入；插入物 { :9116}

derivative [dɪˈrɪvətɪv] n. [化学] 衍生物，派生物；导数 adj. 派生的；引出的 {toefl gre :9140}

neural [ˈnjʊərəl] adj. 神经的；神经系统的；背的；神经中枢的 n. (Neural)人名；(捷)诺伊拉尔 { :9310}

activations [,æktɪ'veɪʃən] n. [电子][物] 激活；活化作用 { :9314}

activation [ˌæktɪ'veɪʃn] n. [电子][物] 激活；活化作用 { :9314}

Fergus ['fә:^әs] 费格斯(姓氏, 男子名) { :9390}

neuron [ˈnjʊərɒn] n. [解剖] 神经元，神经单位 {cet6 toefl :9397}

neurons [ ] n. 神经元，神经细胞（neuron的复数形式） { :9397}

microscopic [ˌmaɪkrəˈskɒpɪk] adj. 微观的；用显微镜可见的 {cet6 toefl gre :9581}

vertically ['vɜ:tɪklɪ] adv. 垂直地 { :9720}

approximate [əˈprɒksɪmət] adj. [数] 近似的；大概的 vt. 近似；使…接近；粗略估计 vi. 接近于；近似于 {cet4 cet6 ky toefl ielts gre :9895}

scientifically [ˌsaɪən'tɪfɪklɪ] adv. 系统地；合乎科学地；学问上 { :9981}

rectangle [ˈrektæŋgl] n. 矩形；长方形 {gk cet6 ky toefl ielts gre :10058}

rosier [ˈrəʊzi:ə] adj. 玫瑰色的( rosy的比较级 ); 愉快的; 乐观的; 一切都称心如意 { :10106}

metric [ˈmetrɪk] adj. 公制的；米制的；公尺的 n. 度量标准 {cet4 cet6 ky ielts :10163}

propagate [ˈprɒpəgeɪt] vt. 传播；传送；繁殖；宣传 vi. 繁殖；增殖 {cet6 toefl ielts gre :10193}

propagated [ˈprɔpəɡeitid] 传播 { :10193}

propagating [ˈprɔpəɡeitɪŋ] v. 传播（propagate的ing形式）；繁殖 adj. 传播的；繁殖的 { :10193}

approximation [əˌprɒksɪˈmeɪʃn] n. [数] 近似法；接近；[数] 近似值 { :10242}

diagonal [daɪˈægənl] n. 对角线；斜线 adj. 斜的；对角线的；斜纹的 {toefl gre :10261}

diagonals [daɪˈægənəlz] n. <数>对角线( diagonal的名词复数 ); 斜线 { :10261}

inception [ɪnˈsepʃn] n. 起初；获得学位 n. 《盗梦空间》（电影名） {gre :10325}

pixels ['pɪksəl] n. [电子] 像素；像素点（pixel的复数） { :10356}

pixel [ˈpɪksl] n. （显示器或电视机图象的）像素（等于picture element） { :10356}

generalizing [ˈdʒenərəlaizɪŋ] 归纳 { :10707}

contextual [kənˈtekstʃuəl] adj. 上下文的；前后关系的 { :10846}

synthesized ['sɪnθɪsaɪzd] adj. 合成的；综合的 v. 合成（synthesize的过去分词）；综合 { :10905}

blah [blɑ:] n. 废话；空话；瞎说 n. (Blah)人名；(捷)布拉赫 int. 废话 { :10986}

wha [ ] [医][=warmed,humidified air]温暖、潮湿的空气 { :11046}

artificially [ˌɑ:tɪ'fɪʃəlɪ] adv. 人工地；人为地；不自然地 { :11137}

infinity [ɪnˈfɪnəti] n. 无穷；无限大；无限距 {cet6 gre :11224}

delve [delv] n. 穴；洞 vi. 钻研；探究；挖 vt. 钻研；探究；挖 n. (Delve)人名；(英)德尔夫 {gre :11237}

seminal [ˈsemɪnl] adj. 种子的；精液的；生殖的 adj. 有创造力的，对未来有影响的；重大的 {gre :11387}

overlay [ˌəʊvəˈleɪ] n. 覆盖图；覆盖物 vt. 在表面上铺一薄层，镀 { :11456}

optimized ['ɒptɪmaɪzd] adj. 最佳化的；尽量充分利用 { :11612}

horizontally [ˌhɒrɪ'zɒntəlɪ] adv. 水平地；地平地 { :11924}

ex [eks] n. 前妻或前夫 prep. 不包括，除外 { :12200}

dumbbell ['dʌmbel] n. 哑铃；蠢人 { :12245}

bookcase [ˈbʊkkeɪs] n. [家具] 书柜，书架 {gk ielts :12527}

mo [məʊ] abbr. 卫生干事，卫生管员（Medical Officer）；邮购（Mail Order）；方式（Modus Operandi）；邮政汇票（Money Order） { :12537}

oftentimes [ˈɒfntaɪmz] adv. 时常地 { :12676}

propagation [ˌprɒpə'ɡeɪʃn] n. 传播；繁殖；增殖 {cet6 gre :12741}

multiplications [ ] (multiplication 的复数) n. 乘法, 增加, 乘法运算 [医] 增殖; 倍增 { :12748}

kyle [kaɪl] n. （苏）狭海峡，海峡 n. (Kyle)人名；(英)凯尔；(瑞典)许勒；(西)基莱 { :13115}

Afghan [ˈæfɡæn] n. 阿富汗语；阿富汗人 adj. 阿富汗人的；阿富汗的 { :13137}

healthcare ['helθkeə] n. 医疗保健；健康护理，健康服务；卫生保健 {ielts :13229}

segmentation [ˌsegmenˈteɪʃn] n. 分割；割断；细胞分裂 { :13396}

visualization [ˌvɪʒʊəlaɪ'zeɪʃn] n. 形象化；清楚地呈现在心 { :13979}

visualizations [ ] (visualization 的复数) n. 可见性, 形象化 [医] 使显形, 造影[术], 想象 { :13979}

husky [ˈhʌski] adj. 声音沙哑的；有壳的；强壮的 n. 强壮结实之人；爱斯基摩人 {gre :14361}

regenerate [rɪˈdʒenəreɪt] vt. 使再生；革新 adj. 再生的；革新的 vi. 再生；革新 {cet6 ky toefl :14883}

transpose [trænˈspəʊz] n. 转置阵 vt. 调换；移项；颠倒顺序 vi. 进行变换 {gre :14972}

transposed [ ] adj. 移调的；变调的 v. 调换；颠倒顺序；移项（transpore的过去分词） { :14972}

dimensional [dɪ'menʃənəl] adj. 空间的；尺寸的 {toefl :15066}

adversarial [ˌædvəˈseəriəl] adj. 对抗的；对手的，敌手的 { :15137}

retrain [ˌri:ˈtreɪn] vt. 重新教育；再教育 vi. 再训练；再教育 n. (Retrain)人名；(法)雷特兰 { :15253}

retrained [ri:ˈtreind] v. 重新教育，再教育( retrain的过去式和过去分词 ) { :15253}

unbiased [ʌnˈbaɪəst] adj. 公正的；无偏见的 {toefl :15836}

inverse [ˌɪnˈvɜ:s] n. 相反；倒转 adj. 相反的；倒转的 vt. 使倒转；使颠倒 {cet4 ky toefl gre :15867}

inverts [inˈvə:ts] v. 使倒置，使反转( invert的第三人称单数 ) { :15967}

invert [ɪnˈvɜ:t] n. 颠倒的事物；倒置物；倒悬者 adj. 转化的 vt. 使…转化；使…颠倒；使…反转；使…前后倒置 {cet6 ky toefl ielts gre :15967}

normalization [ˌnɔ:məlaɪ'zeɪʃn] n. 正常化；标准化；正规化；常态化 {cet6 ky :16091}

toolbox [ˈtu:lbɒks] n. 工具箱 { :17283}

SE [ ] abbr. 东南方（southeast） { :17431}

iterations [.ɪtə'reɪʃ(ə)n] n. 迭代次数；反复（iteration的复数） { :17595}

notepads [ ] 注释板（notepad的复数） { :17692}

equalized [ˈi:kwəlaizd] v. （使某事物）相等( equalize的过去式和过去分词 ) { :17737}

exponential [ˌekspəˈnenʃl] n. 指数 adj. 指数的 {toefl :17748}

summations [səˈmeɪʃənz] n. 总和( summation的名词复数 ); 加在一起; 总结; 概括 { :17935}

summation [sʌˈmeɪʃn] n. 和；[生理] 总和；合计 {gre :17935}

datasets [ ] (dataset 的复数) [电] 资料组 { :18096}

dataset ['deɪtəset] n. 资料组 { :18096}

ostrich [ˈɒstrɪtʃ] n. 鸵鸟；鸵鸟般的人 {gre :18490}

pelican [ˈpelɪkən] n. [鸟] 鹈鹕 { :18790}

iguana [ɪˈgwɑ:nə] n. 鬣蜥蜴 { :18852}

难点词汇
granular [ˈgrænjələ(r)] adj. 颗粒的；粒状的 {ielts :20261}

subsample ['sʌbsɑ:mpl] n. （从样品中再抽取的）子样品；二次抽样样品 vt. 对…作二次抽样 { :20642}

flamingo [fləˈmɪŋgəʊ] n. [鸟] 火烈鸟 { :21112}

flamingos [fləˈmɪŋgəʊz] n. 红鹳，火烈鸟（羽毛粉红、长颈的大涉禽）( flamingo的名词复数 ) { :21112}

generative [ˈdʒenərətɪv] adj. 生殖的；生产的；有生殖力的；有生产力的 { :21588}

localization [ˌləʊkəlaɪ'zeɪʃn] n. [计] 定位；局限；地方化 { :21883}

NY [ ] abbr. 纽约（美国一座城市，New York） { :21993}

carapace [ˈkærəpeɪs] n. 壳；甲壳 {toefl gre :23667}

occlusion [ə'klu:ʒn] n. 闭塞；吸收；锢囚锋 { :24330}

convoluting [ˈkɔnvəlju:tɪŋ] v. 回旋，卷绕，盘旋( convolute的现在分词 ) { :24355}

orthogonal [ɔ:'θɒgənl] adj. [数] 正交的；直角的 n. 正交直线 { :24671}

dalmatian [dæl'meiʃiәn] n. 达尔马西亚狗；达尔马西亚人 adj. 达尔马西亚的 { :25118}

Dalmatians [dælˈmeiʃiənz] n. 斑点狗（ Dalmatian的名词复数） { :25118}

iterative ['ɪtərətɪv] adj. [数] 迭代的；重复的，反复的 n. 反复体 { :25217}

invariant [ɪnˈveəriənt] n. [数] 不变量；[计] 不变式 adj. 不变的 { :26080}

xavier ['zʌvɪə] n. 泽维尔（男子名） { :26299}

occluded [əˈklu:did] v. 闭塞的；堵塞；咬合的（occlude 的过去分词） { :27220}

SU [ ] abbr. 后勤部队（Service Unit） n. (Su)人名；(土、柬)苏；(中)苏(普通话·威妥玛) { :27413}

regularize [ˈregjələraɪz] vt. 调整；使有秩序；使合法化 { :29422}

Gaussian ['gaʊsɪən] adj. 高斯的 { :29650}

interpretable [ɪn'tɜ:prɪtəbl] adj. 可说明的；可判断的；可翻译的 { :30754}

convolution [ˌkɒnvəˈlu:ʃn] n. [数] 卷积；回旋；盘旋；卷绕 { :30767}

convolutions [kɒnvə'lu:ʃnz] n. 回旋，盘旋，卷绕( convolution的名词复数 ) { :30767}

trippy ['trɪpɪ] adj. 由致幻药引起幻觉的 { :31207}

saliency [ˌseɪ'ljənsɪ] n. 显著；卓越；特点；凸起 { :33942}

discriminative [dɪs'krɪmɪnətɪv] adj. 区别的，歧视的；有识别力的 { :36291}

subspace ['sʌbspeɪs] n. 子空间 { :36324}

regularization [ˌregjʊlərɪ'zeɪʃən] n. 规则化；调整；合法化 { :37553}

classifier [ˈklæsɪfaɪə(r)] n. [测][遥感] 分类器； { :37807}

initialization [ɪˌnɪʃəlaɪ'zeɪʃn] n. [计] 初始化；赋初值 { :40016}

subpart [sʌb'pɑ:t] n. 子部件 { :41301}

zhou [dʒəu] n. 周（中国姓氏）；周朝（中国古代王朝） { :49559}

生僻词
backprop [bæk prɒp] un. 后撑

backpropagate [ ] [网络] 反向传播

backpropagation [ ] n. 反向传播算法 [网络] 反向传播了；反向传播法；传播网络

cla [ ] abbr. communication link analyzer 通讯连接分析器

conv ['kənv] [医][=convalescence]恢复（期），康复（期）

convolutional [kɒnvə'lu:ʃənəl] adj. 卷积的；回旋的；脑回的

convolve [kən'vɒlv] vt. 使卷；使盘旋；使缠绕 vi. 盘旋；卷；缠绕

cro [ ] n. (Cro)人名；(法、意)克罗阴极射线示波器（Cathode-Ray Oscillograph）

datas [ ] n. 数据输入

deconvolution [di:kɒnvə'lu:ʃən] n. [地质] 反褶积，[计] 去卷积

deconvolutional [ ] [网络] 去卷积

deconvolve [,di:kәn'vɔlv] vt.[计]去…卷积,展开…卷积

deconvolving [,di:kən'vɔlv] vt. 展开…卷积；去…卷积

elementwise [ ] [网络] 元素对元素

gener [ ] [网络] 产生；制造；出生

gla [ ] abbr. γ—亚麻酸（Gamma-Linolenic Acid）；大伦敦政府（Greater London Authority）；总可出租面积（Gross Leasable Area）

google [ ] 谷歌；谷歌搜索引擎

hartebeest ['hɑ:tɪbi:st] n. 大羚羊（产于非洲）

hyperparameter [ ] [网络] 超参数；分别有一个带有超参数

invertible [ɪn'vɜ:tɪbl] adj. 可逆的；倒转的

kth [ ] abbr. Kungliga Tekniska Hegskolan (Royal Institute of Technology, Stockholm) 斯德哥尔摩皇家工学院

multiplicate ['mʌltɪplɪkeɪt] adj. 多种的；多重的

nx [ ] abbr. next 接下去的; 其次的; 下一个的; nonexpendable 非消耗品

oftentime [ ] [网络] 的时间

Pomeranian [.pɒmә'reiniәn] a. 波美拉尼亚的 n. 波美拉尼亚人, 波美拉尼亚种小狗

relu [ ] [网络] 关节轴承

softmax [ ] [网络] 柔性最大传递函数；前回收的日志文件的百分比；西风狂诗曲系列篇章

thresholding [ ] [网络] 二值化；阈值处理；阈值化

upsample [ ] [网络] 内插滤波进行升采样；升频；对输入信号过采样

upsampled [ ] [网络] 升频

upsampling [ ] [网络] 提升采样；增采样；提昇采样

wx [ ] abbr. weather 天气; weather report 气象报告; watts second 瓦特秒; waxy 蜡（状）的

zeroes [ˈziərəuz] n. （数字）零( zero的名词复数 ); 零点; 零度; 没有

词组
a dot [ ] [网络] 阿顿；阿突

a fox [ ] [网络] 狐狸；一只狐狸；狐理

a hack [ ] [网络] 网络攻击

a max [ ] [网络] 最大值；最大净光合速率；最大聚集率

a toolbox [ ] 工具箱

activation function [ ] 激活函数

activation level [ ] 激动水平

activation mapping [ ] 《英汉医学词典》activation mapping 激动标测法

Afghan hound [ˌæfgæn 'haʊnd] n. 阿富汗猎狗 [网络] 阿富汗猎犬；阿富汗狩猎犬；阿富汗犬

back propagation [ˈbækˌprɔpəˈgeɪʃən] [网络] 反向传播；误差反向传播；反向传播算法

backward path [ ] un. 回程通路；反向通路 [网络] 反向路径

be messy [ ] [网络] 那会很麻烦

black dot [ ] un. 黑斑 [网络] 黑点型；黑点款；圆点图案

blah blah [ ] [网络] 等等；生活废话；磨嘴皮子

blog post [ ] [网络] 博客文章；博客帖子；部落格文章

convolution operation [ ] un. 褶积运算 [网络] 卷积运算

delve into [ ] [网络] 钻研；深入研究；探究

dot product [dɔt ˈprɔdʌkt] un. 点积；标量积 [网络] 点乘；数量积；内积

edge detection [ ] un. 边缘检测；边检测 [网络] 边缘侦测；边界检测；边沿检测

edge detector [ ] un. 边缘检测器 [网络] 边缘觉察器；边缘检测算子；信号缘侦测器

equalize to [ ] (或with)使相等；使相同；使平等

et al [ ] abbr. 以及其他人，等人

et al. [ˌet ˈæl] adv. 以及其他人；表示还有别的名字省略不提 abbr. 等等（尤置于名称后，源自拉丁文 et alii/alia） [网络] 等人；某某等人；出处

et. al [ ] adv. 以及其他人；用在一个名字后面 [网络] 等；等人；等等

flip all [ ] [WIN]全部翻转

forward propagation [ ] 正向传播

Gaussian blur [ ] [网络] 高斯模糊；高斯模糊滤镜；高度模糊

gradient descent [ ] n. 梯度下降法 [网络] 梯度递减；梯度下降算法；梯度递减的学习法

hack that [ ] [网络] 这样砍

identity matrix [ ] un. 〔数〕幺矩阵；纯量矩阵；恒等矩阵；单位矩阵 [网络] 单位化矩阵；单位阵；产生单位矩阵

intermediate layer [ ] un. 中间层；过渡层 [网络] 中层；中间界面层；中间过渡层

invertible matrix [ ] n. 非奇异方阵 [网络] 可逆矩阵；可泄矩阵；反矩阵

iterative process [ ] un. 迭代过程；迭绕法 [网络] 迭代程序；迭代估计控制；反复式

kit fox [kit fɔks] 小狐，小狐毛皮; 敏狐

machine translation [məˈʃi:n trænsˈleiʃən] n. 机器翻译；计算机翻译 [网络] 机骗译；机译；机器翻译技术

mathematical operation [ ] un. 数学运算 [网络] 数字运算；数学计算

mathematical operations [ ] [数] 数学运算

mathematical perspective [ ] 《英汉医学词典》mathematical perspective 几何透视

microscopic imaging [ ] 显微成像

minus infinity [ ] [网络] 负无穷大；负无限大

minus one [ ] [网络] 桃花源；幸福意外；谢谢你捧场

multiply by [ ] v. 乘 [网络] 乘以；乘上；使相乘

neural network [ˈnjuərəl ˈnetwə:k] n. 神经网络 [网络] 类神经网路；类神经网络；神经元网络

neural networks [ ] na. 【计】模拟脑神经元网络 [网络] 神经网络；类神经网路；神经网络系统

object detection [ ] [科技] 物体检测

orthogonal matrices [ ] 正交矩阵

orthogonal matrix [ɔ:ˈθɔɡənl ˈmeɪtrɪks] [网络] 正交矩阵；正交阵；直交矩阵

per se [ˌpɜ: ˈseɪ] adv. 本身；本质上 [网络] 自身；本来；本身餐厅

pixel image [ˈpiksəl ˈimidʒ] [医]像素显像

plus infinity [ ] [网络] 正无穷大；正无限大

plus zero [ ] un. 正零

reconstruction method [ ] 重建法

saliency map [ ] [网络] 显著性地图；显著性图；显著图

set to zero [ ] un. 调到零位；调零 [网络] 设置为零；置零；零调整

simple matrix [ ] 单纯矩阵

spatial information [ ] 空间信息

spatial localization [ ] 《英汉医学词典》spatial localization 空间定位

spiral whirl [ ] [网络] 螺旋形旋涡

spiral whirls [ ] 螺旋形旋涡

synthetic image [ ] 综合图象

the ass [ ] [网络] 驴子；菊门；深渊

the downside [ ] [网络] 不利方面；缺点

the fox [ ] [网络] 狐狸；女狐；沙狐

the matrix [ ] [网络] 黑客帝国；骇客任务；骇客帝国

the Max [ ] [网络] 麦克斯；牛魔王；电子产品配件

the purple [ ] 帝位；王位；显位；红衣主教的职位

the reconstruction [ ] [网络] 重构法；构建；战地雄心

the snail [ ] [网络] 蜗牛；井底的蜗牛；丝瓜花上蜗牛

time zero [ ] 计时起点,时间零点

to clip [ ] [网络] 夹娃娃机；擦撞；到剪切板

to compute [ ] [网络] 计算；用计算机计算

to encode [ ] [网络] 编码；内码；骗码

to overlay [ ] 覆盖

to summarize [ ] [网络] 总结；总结来说；概括

to update [ ] [网络] 更新；重要更新公告；每月更新

validation set [ ] 验证集

vector operation [ ] un. 向量运算 [网络] 矢量操作；矢量运算；向量操作

visualization method [ ] 显像法

visualize doing [ ] 历历描绘......于心

zero in [ˈziərəu in] na. 调整(枪炮的)射距；把(火力)对准目标 [网络] 归零；瞄准；瞄准锁定

zero in on [ˈziərəu in ɔn] （使）瞄准…，（使）对准…，对…集中火力[注意力]

Zero Minus [ ] [网络] 绝对零点

zero out [ ] na. 给…以免税待遇 [网络] 清零了；取消；置零

zero zero [ˈziərəu ˈziərəu] 零

惯用语
12 by 5
and in fact
and now
and so
and so on
does that make sense
does that makes sense
in practice
occlusion sensitivity
plus one
same thing
so now
that makes sense

单词释义末尾数字为词频顺序
zk/中考 gk/中考 ky/考研 cet4/四级 cet6/六级 ielts/雅思 toefl/托福 gre/GRE
* 词汇量测试建议用 testyourvocab.com