Authors:
Mr. Shubhanshu Khatana
Mr. Siddhant Saxena
Advisory:
Prof. Sunil Kumar
Paper and Code Coming Soon....
Visual Q/A or ViQA has been a significant research domain that led to a lot of development in Image Captioning and classification architectures in the field of Computer Vision. The same method of Question/Answering and classification tasks are not successful in dealing with the data stream i.e. Videos. The Image captioning model is very inefficient when it comes to dealing with a Video long sequence of Images, they perform poorly on short-length videos too which led to a left-out space of data containing Long untrimmed videos which are more abandoned than the former. To rectify the above problem, we propose a novel solution contemplating all the developed approaches to deal with such data formats. The network or the solution pipeline works as follows: The long untrimmed videos are sent through a network known as Temporal Segment Network (TSN) which performs a length segmentation task onto the video sequence, choice of sampling method is performed i.e. Dense Sampling, Sparse Sampling, or Action-based Random Sampling which introduces the stochasticity in the system. The following frames sampled according to a Relevancy Metric are then carried further for simpler Image Captioning tasks performed over those selected frames resulting in the generation of text-description of the data spread over the desired length of the video which will be considered as the Summary of the Video. Text description will then be sent to a BERT Encoder which converts the text into lower-dimensional embeddings. The following embeddings from the previous step will then be used to create a Knowledge Graph, which will be able to generate the correlation between the embeddings based on the correlation between the actions performed in different parts/segments of the video. This will help us in performing a Visual Q/A task by giving a Text-based-Query traversed over the data represented by our Knowledge Graph. The correlation found between the different segments/parts of the video sequence using the information embedded by the text-encoder following the captioning model will facilitate the VQA tasks that can be performed over a long untrimmed video.