TOWARDSAI.NET
Transformers for Videos
Author(s): Sarvesh Khetan Originally published on Towards AI. Note : this is in continuation of this blog wherein I have discussed different models to perform video classification task.Video is nothing but a sequence of images and hence to make use of this sequence information researcherssequence models like RNN / LSTM / GRU / Transformers on video dataset !!Since transformers model is the most prominent sequence model, below I will only discuss transformer but you can design similar architectures with other sequence models too !!Taking inspiration from vision transformers that we saw in image classification task, researchers designed this architecture for transformers on videosIssue with above architecture is that there is too much of attention being calculated due to which model takes a lot of time for computation and hence researchers wanted to reduce these computations. They proposed more efficient architectures and one such architecture was proposed by google in 2021 in its paper Video Vision Transformers (ViViT). First lets us understand the intuition behind this architectureNow to implement this, the architecture diagram looks something like as follows In above architecture I have only shown 1 spatial transformer and 1 temporal transformer but you can add more of these transformer layers to improve your system but rememberSpatial positional encoding and Temporal positional encoding are only inputs to the very first spatial and temporal transformers respectively !!Also, you will have to do reshaping everytime before passing the inputs to spatial transformer and temporal transformer to make sure the attention is being calculated correctly as shown in the intuitionJoin thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming asponsor. Published via Towards AI
0 Kommentare 0 Anteile 78 Ansichten