arXiv / 2023

Explaining Vision and Language through Graphs of Events in Space and Time

Mihai Masala, Nicolae Cudlenco, Traian Rebedea, Marius Leordeanu

Foundation ModelsLarge Language ModelsMultimodal LearningPopular and Landmark Papers

Artificial Intelligence makes great advances today and starts to bridge the gap between vision and language. However, we are still far from understanding, explaining and controlling explicitly the visual content from a linguistic perspective, because we still lack a common explainable representation between the two domains. In this work we come to address this limitation and propose the Graph of Events in Space and Time (GEST), by which we can represent, create and explain, both visual and linguistic stories. We provide a theoretical justification of our model and an experimental validation, which proves that GEST can bring a solid complementary value along powerful deep learning models. In particular, GEST can help improve at the content-level the generation of videos from text, by being easily incorporated into our novel video generation engine. Additionally, by using efficient graph matching techniques, the GEST graphs can also improve the comparisons between texts at the semantic level.

0 citations0 influential

Full paper

Read the original paper

Open PDF Source page

Learning resources

arXiv PDFPDF arXiv abstract pagearXiv Google Scholar referencesGoogle Scholar Papers with Code searchPapers with Code YouTube explanationsYouTube

Reading state

Discuss in ChatGPT

Uses your own ChatGPT account. The paper context is copied into a tutor prompt before ChatGPT opens.

Preview prompt

You are my AI/ML research paper instructor. I want to deeply understand the paper below.

First, teach it in layers:
1. One-paragraph intuition.
2. Problem statement and why it mattered.
3. Key method, architecture, or algorithm.
4. Important equations or mechanisms, explained intuitively.
5. Experiments and evidence.
6. Limitations, assumptions, and failure modes.
7. How this paper influenced later AI/ML/Deep Learning/GenAI work.
8. A 30-minute study plan with checkpoints.
9. Quiz me with 5 questions and wait for my answers.

When something is not available in the attached context, say what is missing and infer carefully.

### Paper attached as context
Title: Explaining Vision and Language through Graphs of Events in Space and Time
Authors: Mihai Masala, Nicolae Cudlenco, Traian Rebedea, Marius Leordeanu
Year: 2023
Venue: arXiv
Categories: Foundation Models, Large Language Models, Multimodal Learning, Popular and Landmark Papers
Citations: 0
Paper URL: https://arxiv.org/abs/2309.08612v1
Open PDF: https://arxiv.org/pdf/2309.08612v1

Abstract:
Artificial Intelligence makes great advances today and starts to bridge the gap between vision and language. However, we are still far from understanding, explaining and controlling explicitly the visual content from a linguistic perspective, because we still lack a common explainable representation between the two domains. In this work we come to address this limitation and propose the Graph of Events in Space and Time (GEST), by which we can represent, create and explain, both visual and linguistic stories. We provide a theoretical justification of our model and an experimental validation, which proves that GEST can bring a solid complementary value along powerful deep learning models. In particular, GEST can help improve at the content-level the generation of videos from text, by being easily incorporated into our novel video generation engine. Additionally, by using efficient graph matching techniques, the GEST graphs can also improve the comparisons between texts at the semantic level.

Learning resources:
- PDF: arXiv PDF (https://arxiv.org/pdf/2309.08612v1)
- arXiv: arXiv abstract page (https://arxiv.org/abs/2309.08612v1)
- Google Scholar: Google Scholar references (https://scholar.google.com/scholar?q=Explaining%20Vision%20and%20Language%20through%20Graphs%20of%20Events%20in%20Space%20and%20Time)
- Papers with Code: Papers with Code search (https://paperswithcode.com/search?q=Explaining%20Vision%20and%20Language%20through%20Graphs%20of%20Events%20in%20Space%20and%20Time)
- YouTube: YouTube explanations (https://www.youtube.com/results?search_query=Explaining%20Vision%20and%20Language%20through%20Graphs%20of%20Events%20in%20Space%20and%20Time+paper+explained)