33 minute read

Thus far, in the last we discussed about we mostly talked about many different compus neural networks for the images. And in the last three lectures we talked about videos. And in the video we talked about just

there’s some basic component of the video like the motion. And we also talked about tracking an object in a video. But today we are going to talk more basic concepts in the video which is the classification in the video. So we are going to take the entire video at

at once and try to recognize whether some concept is observed in the video or not. And the concept in the video is basically the action.

What we have done for the images it’s all about recognizing the special patterns in the image. As we observe this kind of image we can see a lot of different things. So what do you see in this image? Based on the patterns at the background it looks like a playground.

And based on the uniforms that each person is wearing maybe it’s likely to be a baseball. So we can see that perhaps this could be some snapshots taken from the baseball game that’s what we can see from the image. And maybe these two people are colliding. So perhaps it’s in the middle of some games

some dramatic actions happening in this image. But if you consider the video you can get more. So we can actually see that this is actually the baseball playground. But what’s actually happening in this

it’s actually not a part of the game but it’s more like a fighting incident happening in the game. So by looking at like multiple frames we can get more context that’s more than the special patterns but also recognizing the temporal dynamics

happening at the pixops. So in the videos we want to eventually discover the special temporal pattern.

So the action is the some semantic class of the special temporal patterns in the video. It’s very similar to the class label but the label is not about the objects or the concepts but it’s more like the special temporal.

So each action is characterized by special contents plus its dynamics or emotions. Some examples include dancing, cycling, baseball play, swimming something like that. In the below you see some snapshots taken from the action recognition benchmark dataset and you can see that now

the pixels are moving. So there is another dimension which is the temporal dimension.

And by taking this entire frames as an input, we want to consider whether some action is observed in the video or not. So basically it’s the classification task but the label is defined by what is special tempor.

So how do you recognize the spatial and temporal contexts? You are pretty familiar with recognizing the spatial context.

So maybe considering the image I mean considering each frame as an image maybe we can still apply what we have learned so far. It’s something like humbles neural networks to recognize the contents inside of its brain. But how do you recognize the temporal contents?

That’s something unclear. So we talked about the emotion and tracking.

It’s not really capturing the motion it’s not capturing some special temporal volume features for the special temporal volumes. It actually I mean all the operaties we learned so far actually handles each frames independently as the independent images. We only carries the information

across the time like we just propagate some posters from the previous frame. But when it comes to the current frame we just recognize the frame as a image so we haven’t really dealt with the temporal features.

So how do we design the model to handle these temporal features? So in today’s lecture we are going to learn fairly old algorithms but standard algorithms in recognizing the actions. So these models will.

So we are going to eventually learn only one architecture in this lecture. And the basic idea is pretty simple.

We’re going to learn but we’re going to learn some legacy to reaching that direction. So we’re going to first learn how to deal with the motion as an additional input and then later we are going to develop this concepts

so that we can discover the motions. Instead of just explicitly extracting the motion and using it as an additional input we are going to take the entire frame as an input and we are going to extract a feature.

And we hope that in some of the feature we have some sort of not emotions discovered by the model.

So we’re going to start with the very early approach to understand what kind of features are useful for recognizing the actions. So we’re going to study the basic concept of representing a video and we’re going to learn the impact of the architectures for recognizing the video content.

En especially we are going to learn the impact of leveraging the temporal features.

So we’re going to start with representing a video but before doing that we’re going to first think about how do we represent it on image if you think about the video it’s a collection of the frames. So each frame is just an image

where it has width, height and channel. So the video is basically just a stack of frames. It’s stacked along the temporal dimension. So technically speaking the image is the 3d tensor. It has three dimensions

with tide and channel. So technically to be correct the video is four dimensional tens. The 4th dimension is the temporal dimension. But we are going to but because we cannot visualize the four dimensional tensor we are going to use this kind of representation. We are going to consider

the temporal dimension is basically stacked along this channel dimension. So it’s similar to just considering that each frame is the RGV image and we are just showing you the three dimensional tensor. We just eliminated the channel for the rgvis and we’re only showing you the width

the spatial dimension is the height and the temporal dimension. But don’t be confused we’re actually dealing with the

so to make it more simplified we’re going to use this kind of representation. So this x axis is the temporal axis the y axis is the special axis. So y is basically a times w perhaps times the c.

With this representation we are going to try different architectures to extract some information out of the video. So we’re going to start with the simplest option which is dealing with the video as a just frameize

data. So maybe the simplest option to recognize the contents from the video is just taking a random frame from the video and just apply the image convolution neural effects. So this way we can extract some feature inside of some frames.

So it’s just only the special pattern but the special pattern itself actually tell us a lot of different things. So if you think about just normal baseball game they usually happens with some special structures right?

People are wearing some specific uniforms the playground has specific patterns like the diamond patterns. So we can by just looking at the image I mean the single frame you can still tell a lot of things about the video. So that’s the motivation behind this architecture

okay we are going to just randomly sample the frame and just apply the image commotion ansis. So the classification is based only on the special context. This could be a reasonable choice since the special patterns are closely related to the content

but it cannot capture the dynamic information in the video. So it only kept looking at the single frame so it doesn’t look at the dynamics or the motions in the video. If you think about the very first example we saw in this lecture it’ll eventually fail to classify that action as a baseball game

right although the actual content was fighting. So the spatial bias is not only I mean cannot solve the entire problem it could be bias. So if you think about multiple I mean if you want to leverage the fact that

the video is the collection of the frames maybe you want to apply a little bit advanced model that takes multiple frames and apply the convolution neural network for each frame independently. This will give us the feature for each frame. We’re going to concatenate this feature

as a one gigantic vector and then apply the fully connected layer. We’re going to lefer this model as a late huson because the the temporal diffusion of the information happens only at the later stage of the model.

And we’re going to share the parameter of the conulution neural network that is applied for each frame independently. Classification is happened at the last layer by concatenating all these features. If you look at this

if you compare this model with the previous version it looks at multiple frames although the previous version only looked at the first frame. So basically can in principle perhaps handles some dynamics in the multiple frames because we are looking at the multiple frame at once.

The problem is that this information I mean temporal fusion happens only in the later part of the model. So maybe it’s not you know powerful enough to discover more interesting temporal features because almost I mean most of the features

especially the spatial features are already gone at the later part of the features in the convolutional labric because of the down sampling layers. So we are basically applying the connation after all down sentiment which means that maybe there could be

there are more interesting features at the input observations that can be gone at the fusion stage.

It’s not. So that’s one downside. Maybe we missed some interesting features at the lower levels. Another thing is that it’s taking the multiple frames but not the entire video OK so in this illustration you are just taking the two frames.

So it’s still not sufficient to capture the motion more fine grained temporal features so it misses a lot of information in both temporal and spatial dimensions.

Another approach would be the early fusion. So instead of fusing the temporal information at the later part of the model we are going to do that at the very first layer of the model. So at the very first layer of the model we are going to define some operator called 3d convolution.

So this concept of 3d convolution is the most important one in this lecture and I’m going to talk about it later again in a little bit different context. So if you just think about the 3d convolution filter first 3d convolution filter applies to the image

instead of the intermediate feature the feature I mean the convolution filter looks like this. So the input if you think about the very definition of the convolution we learned so far it’s channel dimension

it’s the same as the input dimension. And then these two dimensions are about the size of the filter the scar size of the filter. So this filter is applied into the image in a sliding window manner in a special dimension right?

In the 3d composition filter we are going to assume that we also defined some filter size over the channel dimension as

and now the filter is not only sliding window I mean not only do the sliding window in a special dimension but it’ll also do the sliding window over the temporal dimension as well okay because now the channel of the filter

it’s much smaller than the channel of the b. So here the l is the number of frames in the video. So we mentioned earlier that we are going to treat

temporal dimension as sort of concat med dimension over the channel dimension. So in that illustration our representation of the data.

So you can just treat this 3nl as an independent dimension or as a continent dimension basically the same. And then we are going to define our codel

and the p is much smaller than l which means that so ignore this part three. So let’s just consider this okay our data is h times over times l. Our kernel is k times k times b l is much larger than b. So our data like this guy’s data we are going to lefer this as a be

okay the third okay so the t is basically the three dimensional tensor.

It’s also three dimensional tensil. In case of the image usually the channel of the filter the same as the channel of the data that was the image. But in in the case of the video for 3d convolution

we are going to define the channel to be smaller.

So now the filter can not only move in the temporal dimension but can also move along the channel dimension. So the convolution happens at three dimensional space not only the two dimensional space. As a result you’ll get another tensor as an output of discombol.

So that’s roughly the concept of 3d convolution. So far we mentioned that our third maybe the channel dimension is the temporal dimension all right and the filtering is now happening at the temporal dimension as well. In other words we are applying the convolution over the temporal feature

I mean temporal dimension as well to discover some temporal patterns. And then what we are going to do is we are going to eventually apply the 3d convolution

at the very first layer and then the rest of the convolution neural lefn is just 2d convolution

because the output of the 3d convolution and the output of the 3d convolutions are all the 3d 3d sensors. So eventually in terms of the layer the 2d convolutions it’s the same as I mean it doesn’t matter right

yeah that’s roughly the idea here. So compared to the previous architecture like the late fujan in this case the information fusion across the temporal dimension happened at the very last layers. In this early fusion approach the temporal fusion happens at only the

the very first layer of the effort and then the rest of it. We’ll just treat this special temporal feature as a feature. For the CNX.

Idea is pretty simple right it’s a great convolution.

So this ol fusion approach consider the motion information by using the 3d comblx filters and it integrate all temporal information at once at the input level and the rest of it will just using the scanar filter. The benefit of this is that now we can

discover the temper features at more fine grain mns very similar to the very early layers of the convolution. Neural network are discovering some low level digital primitives like the edges or textures. This 3d convolution capture some temporal features

abstract away all the special information only at the input layer. So it has to handle all the special temporal patterns at the very early layers which means that you cannot do the hierarchical abstractions which is the key in the convulsion neural networ.

Finally we’re going to do something called slow fusion. So the idea is we are going to apply this 3d convolution

this image combosion your electric which has some receptive field. Now it has the receptive field over the temporal DNA because the convolution is the local operator. And now this notion of the locality also defined over the temporal dimension because of the 3d convolution.

So we are not only have some receptive fields defined over the special dimension but also the temporal dimension. For instance in this illustration you can see that at the very first layer the temporal receptive field is forced. We are looking at I mean if the size of the convolution filter

the 3d combulution filter is four over the temporal dimension and the receptive field over the temporal dimension is four you can see that at the later layers the receptive field will grow by another 3d convolution operators.

You can see that even at the very last layer it’s not the entire video it’s a sliding window. The entire CNN 3d pen lctor is sliding window over spech temperat dimension. So the receptive field is not defined over the entire video.

So it includes more and more long term temporal dynamics at the later part of the model but we still lose some temporal information before the classification.

So I’m going to skip this part because this is no longer a valid.

You can see that this is the results. So we’re going to compare the performance of all these different architectures in the benchmark.

So if you’re just looking at the single frame and apply the classification of the action the accuracy is something like this. So it’s like just top one accuracy and this is the top five accuracy. It actually works reasonably well because the special pattern tells us a lot of

and if you apply the early fusion it actually loses some performance. It shows that collecting all the information about the video at the very first layer will actually extract away too much information.

It actually compared to the image it actually has more frames as an input right but the output dimension of this first convolution layer will be the same as the image convolution layer. So it basically

compress the information with more compression ratio so it may lose more information. Other than that there is another downside of using the ol fusion because in case of in case of a single frame approach it basically treats the video same as the image. In other words it can use

the pre trained convulsion neural network large scale I mean over the large scale pre trained data set for the videos it can fully elaborate the pre train features. In case of the ol fusion model. It cannot fully elaborate that features because it has to learn

the very first 3d convolution layers from scratched. Although the other layers can be initialized

because we cannot fully leverage the pre trend parameters this model also loses some performance. So the early fusion even underperformed in the single frame model which is surprise sort of surprise. The late chusan model similarly performs on par

I mean on par with the single frames. It means that collecting all the information at the very least the last stage of the model is actually not giving us much information because all the spatial and temporal information are already gone at the very end.

Last part of the model collecting the information at the last will not give us much but slow fusion give us a slightly better performance especially over this. It means that I mean it shows that basically it gives us

there is some interesting features over the temporal dimension which could be harnessed by the model

if we. So that’s the basic models. And maybe similar to the images we can also apply some concept of pre training. In case of the images we pretrained our comple general network for some large scale data with unrelated tasks and then we leveraged that

features. At the fine tuning stage. We can apply the same concept. So we can train our video action recognition models over the large scale action recognition benchmark and apply these pre trained models over the mean downstream action recognition tasks

that turned out to give us a performance boost. So if we train our model from spreads over new action recognition data set the performance is around 41 percent. But if we start our model from if the free trained action recognition data set then you can actually get a significant boost.

And this observation is actually consistent with the ematives.

So pre training on sport one million data set and fine tuning on the other action recognition data set actually give us the

and this is some qualitative examples of action recognition results. It’s basically not based on the you know just a single frame but it’s actually based on the video and you can see that it’s doing a reasonable job.

So to summarize we learned about some preliminary I mean very early approach early studies on designing the combos neural networks for action recognition. And the lessons from here is that the first the special pattern itself is already quite

informative. Second, if we harness the temporal dimension further we can discover more interesting features which turned out to be improving the action ren and flurries similar to the images. The pre training helps.

One kind of disappointing aspect from this study is that the improvement obtained by considering the motion was not super significant much lower than our expectations right

this is the performance. So if you compare this slow fusion model which is more mostly advanced model that harness the temporal dimension to the single frame model the improvement is not significant it’s actually marginal.

So it’s quite discouraging. I grady convolution models supposed to be more expensive. Basically looking at the multiple frames apply the convolution over more dimensions computational much more expensive but somehow similar to single strain model. Why is the motion is not useful? Or

now our following up question is it’s kind of counterintuitive to the observation from this paper why the motion is not quite useful what’s the problem? Is it a problem in the model or is it a problem of our assumption or is it the problem in our data set? It turned out that

all these things are true like the model is not optimal okay so even if the concept of 3d convolution is reasonable maybe the small details in the architecture is wrong and maybe this small architectural configuration is

it’s actually pretty critical right? We actually learned that lessons from the image classification task right we learned about many different architectures

starting from the Alex net all the way to the resnet. If you think about the performance between the Alex net and resnet that’s super significant. But in terms of the art I mean but in terms of the model it’s just difference in the configuration.

By finding more optimal configuration we can still get more improvement. So that could be the one reason why we didn’t have much. And also it turned out that the data set itself it’s problematic. Ok you don’t have to take this issue seriously for now.

I’m going to talk about it later as well. But it turned out that most of the data sets has very very strong casure in the actions. So by just looking at the background you can tell a lot of different things. For instance how do you recognize the video

from different sports games? How do you differentiate the basketball from the baseball? Looking at the stadium you don’t have to look at other things. By just looking at the stadium you can tell the class of these sports

means that by looking at the single frame you can tell whether I mean which video it may related to. So there is a strong correlation between the action class and the special patterns in this benchmark especially the sports one million benchmark or the usef benchmark.

All right so if we really want to test our model whether it’s actually leveraging the temporal information or not then the special bias I mean there should be a less special bias

in the action category. So your benchmark should contains more interesting actions like fighting at the fighting at the baseball in a stadium or dancing at the moment. There should be less correlation between the actions and the.

So the later benchmarks I mean the later works kind of discovered these issues and they tried to remove the special bias in the action or develop a better benchmark that has less correlation between the action class and the spirit of

but that’s another story. So anyhow the motion, the model itself needs some influence. So the following up work I mean the later work kind of quite discouraged. I mean quite surprised by this study and actually test whether the motion itself is not really useful for the action recognition or not.

Their idea was also very simple. Their assumption was that maybe the architectural configuration of this previous study was not optimal elaborating the emotion. Instead of just let the model to discover these temporary features the authors of this paper just

explicitly extracted the temporal features as an input to the model. So what would be the temporal features? What could be the temporal features? What could be the feature that represents the motion in the video? It’s optical flow. That’s one example.

The optical flow is basically just displacement a pixel wise displacement between two frames. So we can so when we have a video we can not only fill in the collection of the frames as an input but we can extract the optical flow between

each pair of frames and then feed this optical flow as an input to the model. There are a lot of different ways to extract the opicon flow. You can extract the opicon flow between two consecutive frames or you can

extract more long term trajectories by using some post processing of the optical floor. So if you consider three frames there is two optical flows right? And then by connecting

I mean each optical flow represents how one pixel is moving into another frame. And by having multiple of these optical flow you can get the trajectory of the pixel. So they also studied whether the representation of this temporary information is important. So they

they tested with many different optical flows like starting from the nif form of the optical floor to the long term trajectory.

And this optical flow actually gives us another features on top of the images because this optical because in the images you can tell the special patterns but in the optical flow it also highlights the motion which part of the object has a motion and which direction this emotion is heading to. So

by leveraging this bosty frame level information and the optical flow we can leverage the temporal information more easily okay so there is no need for the model to discover the temporal features. That’s the important thing. We are feeding some explicit form of the temporal representation

and then the model just learning to elaborate with

and the way the authors design the model is something like this. It’s called tool stream networks. So the first stream is the special stream and it’s basically just the image convolution neural networks. It takes a single stream from the video

get the up okay another stream takes the motion which is extracted from multiple drames and then it applies the temporal another conolution neal lattice okay considering this as another important you also get the output. At the very last layer

I combine the output predictions from both branch as a final prediction which where the loss is defined. So it’s basically extracting two different independent CNNS to extract the feature for motion and spatial patterns separately.

And even the upper predictions are made at the very last level independently and they only combine the information at the.

So the benefit of this architecture is that it uses the external motion information and it removes the model to learn. I mean the requirement I mean the necessary for learning the temper feature itself. So we are basically isolating the the effecting isolating the effect.

I’m in fact tired for discovering the motion.

And another benefit is that you can fully leverage pretrain CNN from the in’ data set because one stream is just streamwise commercial neural network. So you can leverage the putrient problems. Only this temper stream need to be learned from spread.

And this is the results from the Ucf 11 data set. But before we talked about the results any questions?

This is the results. So you can see that the very first layer is the model train from scratch and the second and third layer is the margin of the models that starts from the ptrin model. So the same observation here starting by starting from the putrain model even the special condenet

this frameize model performs reasonably well. And if you only use the temporal condins the performance is something like this. So these different roles are this different choice of the optical flow. How do you represent the motion but you can see that generally performs very well.

Finally it performs significantly better than just using the special pattern. So the motion indeed helps in action. Ren the problem of the prior work was not because the motion is not useful but because the architecture was not optimal

to fully leverage the motion information right so that’s a relief right? The motion itself is actually useful but we just didn’t know how to elaborate with this.

Another interesting observation is that by taking the motion at the input and learning the features for this temporary combing firsts it somehow learns to discover some temporary features.

So in here it shows so it’s basically shows the multiple field curves but the channel of the filter in this case is the tie right by looking at each filter independently you see what kind of feature it discovers over the temporal dimension over the spatial dimension but by

by visualizing them with the temporal dimension as well you can see what kind of temporal features it learns. And you can see like this kind of filter actually discovers some temporal features. It kind of learns that maybe

this kind of temporal importance right so it learns some temporal features together with a special feature add the motion stream. So to summarize we talked about two string networks. It exploits two different independent CNNS for leveraging the spatial and temporary information respectively.

It enables the transfer learning from the pretrend model by initializing your special screen with the pretrain eetwork. Most importantly it demonstrated that the motion itself is and maybe designing a better architecture to fully leverage the motion. We can indeed get the improvement in that.

The limitation of the two stream network is that it extracted the temporal features using the handcrafted features like the optical floor and then it leveraged this information. And the lessons we learned from the convolution neural network and the deep learning is that the handicraft feature

it’s always has a looking for improvement because there’s always corner cases where this handcut feature is not optimal. Maybe just learning everything from anent would further improve the performance because we can learn to discover.

So now our mission is designing the spatial temporal convulsion neural network in a way that it defeats the optical flow input. If we can somehow learn something more than the optical flow this could be a promising direction

to add a few words to the optical flow. Optical flow itself is not perfect. I want to remind you that optical flow was based on the phosph, the tail expansion. All the equations are derived from that equation. It needs some assumptions like slow

color consistency something like that. We learned several assumptions made in the opposite floor and these kind of assumptions are not always perfect. It gives us a very inaccurate estimation of the motion around the object boundaries. Also if the motion is very large it also can easily

you miss that motion because it breaks the for the pilar assumption. There are many many corner cases where the optical flow can be extremely noisy. And by feeding the optical flow as an additional input it basically embrace all these limitations. By letting the model to

learning to discover the motion useful for the action recognition it can deal with these kind of corner cases and get the further improvement just like we did for the images from the 6th to the De pictures simil similar to that we once

that kind of transition happens at the temporary dimension. So we’re going to revisit the 3d convolution. I already discussed about this so it just has a better fancy illustration. So 2d convolution you have

I mean apply the filtering over the spatial dimension in a slightly window manner and give you some another image. If you apply this 2d convolution over the multiple frames by concatenating multiple frames to the channel dimensions

you get the convolution but with much deeper channel. So it’s basically the same thing. 3d convolution happens at here. I mean it happens in a slidely window manner but the 3d combolution defines your colonel.

This channel is smaller than the entire channel of the input and then it defines the convolution over not only the spatial dimension but also the temporal dimension. And output is also another tenser.

The author of this paper just applied this 3 convolutions over multiple layers in a very careful manner. So basically this I mean in a very high level concepts this is just identical to the slow fusion model you learned earlier. The slow fusion model didn’t give us much improvement.

By just finding the better configuration they actually got some improvement. So ideally what we expect is that the lessons from the image combos inal network should extend to the temporal dimension. So

the convulsion geletrics we have a hierarchical extraction of the features. So starting from the low level images

high level features mainly because we have more nonlinear functions but also has a much larger receptive. At the very early layers we discover some low level visual primitives but at the higher layers we discover more complex patterns.

We hope the similar thing happens at here because now the 3d convolutions are also applied in a hierarchical manner. So maybe at the earlier layers we discovered some low level temporal visual primitives like how the pixels are moving.

But at the higher layers we hope that these temperate features are aggregated to capture more high level semantics over the temperate dimens. So for instance at the lower layer this is just the example. So at the lower layers I’m moving my hands okay so let’s say that I’m doing this

at the lower layer. Maybe it just discovered the motion of the pixels very similar to the images where find the pixel level color differences which is the edges. We may discover pixel level differences between consecutive frames. The receptive field will be very small

at the temporal dimension so it’ll just discover these kind of things. But at the higher layers it will observe much more frames with much more special resolution as well. So it made this cover more you know

larger semantics of the temporal motions like arms waving. And at the higher layer it will aggregate more features and then turn it discovers more higher level semantics and then it’ll be something like 10 gestures

the basic motions of the denss right? So we hope that kind of temporal hierarchical abbsction happens by and learned by this hierarchical 3d convulsion. Your le

it turned out that if we increase the 3d convolution layers from very few ely layers to the entire networks it actually improves the performance

which kind of gave us some promising signals that maybe it learns to. There are some temporal information that can be useful to be discovered by the model. I also found that by learning the temperate features from scratch

from scratch by the model it also performs better than this models that leverage the explicitly extracted handicrafted motion. Compare this to stream networks which is 88 but with this grid combustion network it performs.

This kind of idea is called c 3d okay this combulution neural networks designed with a 3d combulution filter. It’s very similar to fully convolution network. Fully convolution network is not just a specific architecture but it’s more like idea designing. Every convolution

I mean every part of the network is a combination. Similarly designing the model with the 3d combulation layer is some sort of general idea. And the model the name of the model is usually called a c grading okay so this kind of c grading models are

some popular ideas you will see in case that you are dealing with the bus. And obviously they are a much more advanced model of the c3d architecture which you will learn in the paper presentation.

So obviously this 3d convolution model also comes with some limitations. The first limitation is that the computation is very expensive. The sliding window that there is additional

dimension for this lighting window so that itself is already pretty expensive. And you are dealing with the collection of the frames. That’s another bargain. So computational it’s very demanding. Also it can be

the parameters of the model can explode with a 3d complex filters because yeah we are dealing with the videos the parameters can be

you cannot. Sometimes it’s not easy to exploit the pre trend models because your 3d combolion cirits are now different from the 2d comp. So these are some issues you need to solve to improve the 3d convolution. I mean the c3d architecture.

I’m not going to cover all these approaches but you will just learn this through your paper. Choose some advanced models in the fedbac chamnan and then so to summarize, we learned about the action recognition. It’s basically the classification of the spatial temper patterns. It turned out

that the motion is important and designing the proper model that can exploit the motion information by antent learning is the key to the action recognition. We learned about some basic architecture based on 3d convolution and that will serve as a foundations.

By the way, did you check the third assignment? Yeah the assignment due is like two weeks less than two weeks so please prepare that. And another thing I want to discuss is that because of some personal,

maybe we cannot cover some transformer based vision models in the lecture. I will just prepare I mean, I will just teach some basic idea of the transformers and how they applied in the different task

in the lecture. And then I’ll list many transform based vision models for your paper presentation session so that you can study them further. So based on that plan, we have only one lecture remaining in the next week. Does it work for you?

All right so yeah that’s all for today and I’ll see you in the next.

Reference

[1] Stanford University CS231B: The Cutting Edge of Computer Vision, Spring 2015
[2] Isard, Michael, and Andrew Blake. “Condensation—conditional density propagation for visual tracking.” International journal of computer vision 29.1 (1998): 5-28.
[3] Henriques, João F., et al. “High-speed tracking with kernelized correlation filters.” IEEE transactions on pattern analysis and machine intelligence 37.3 (2014): 583-596.
[4] Li, Xi, et al. “A survey of appearance models in visual object tracking.” ACM transactions on Intelligent Systems and Technology (TIST) 4.4 (2013): 1-48.
[5] Hong, Seunghoon, et al. “Online tracking by learning discriminative saliency map with convolutional neural network.” International conference on machine learning. PMLR, 2015.
[6] Nam, Hyeonseob, and Bohyung Han. “Learning multi-domain convolutional neural networks for visual tracking.” Proceedings of the IEEE conference on computer vision and pattern recognition. 2016.

Leave a comment