ejecvnlp Open Access Journal

European Journals of Emerging Computer Vision and Natural Language Processing

eISSN: Applied
Publication Frequency : 2 Issues per year.

  • Peer Reviewed & International Journal
Table of Content
Issues (Year-wise)
Loading…

Open Access iconOpen Access

ARTICLE

SYNERGIES OF SIGHT AND LANGUAGE: A JOURNEY INTO HOW MACHINES LEARN TO SEE AND SPEAK

1 Department of Cognitive Robotics, Delft University of Technology, Delft, Netherlands
2 Artificial Intelligence and Vision Lab, University of Tokyo, Tokyo, Japan

Citations: Loading…
ABSTRACT VIEWS: 21   |   FILE VIEWS: 25   |   PDF: 25   HTML: 0   OTHER: 0   |   TOTAL: 46
Views + Downloads (Last 90 days)
Cumulative % included

Abstract

For decades, we've dreamed of creating machines that can see the world as we do and talk to us about it. This once-distant dream is now a vibrant reality at the intersection of two of artificial intelligence's most ambitious fields: Computer Vision (CV) and Natural Language Processing (NLP). This article takes you on a comprehensive journey through this exciting landscape, exploring how we're teaching computers to connect pixels to prose. We'll start by exploring the fundamental questions that drive this field, like the "symbol grounding problem"—the puzzle of how words get their meaning—and use frameworks like Bloom's Taxonomy to map out what it truly means for a machine to "understand." We'll break down the core tasks of vision into the "3Rs" (Recognition, Reconstruction, Reorganization) and language into its essential layers (Syntax, Semantics, Pragmatics) to see exactly where these two worlds meet.

The heart of this survey is a deep dive into the toolbox of modern AI. We'll explore how machines learn to represent the world, from the early days of handcrafted visual features to the powerful deep learning models that create today's "image embeddings" and "word embeddings." From there, we'll investigate the ingenious architectures that fuse these senses together, including shared embedding spaces, the elegant encoder-decoder models that power image captioning, the clever attention mechanisms that let models "focus" on what's important, and the modular networks that allow for compositional reasoning.

We'll see these methods in action as we examine the key battlegrounds where progress is measured: tasks like generating captions for images and videos, answering complex questions about a scene (VQA), and retrieving images with a simple text query. We'll then venture into the world of robotics, where this technology is giving machines the ability to follow our commands, learn by watching us, and engage in grounded, meaningful dialogue. By weaving together insights from landmark studies, we'll paint a picture of the field's incredible achievements. Finally, we'll have a candid discussion about the tough challenges that lie ahead—the need for genuine commonsense, the subtle biases that creep into our data, and the quest for true understanding—as we look toward a future of embodied, communicative AI that promises to change our relationship with technology forever.


Keywords

Computer Vision, Natural Language Processing, Vision and Language, Multimodal Learning

References

1. Aditya, S., Yang, Y., Baral, C., Fermuller, C., & Aloimonos, Y. (2015). From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292.

2. Aksoy, E. E., Abramov, A., Dorr, J., Ning, K., Dellen, B., & Worgotter, F. (2011). Learning the semantics of object–action relations by observation. The International Journal of Robotics Research, 30(10), 1229-1249.

3. Aloimonos, Y., & Fermuller, C. (2015). The cognitive dialogue: A new model for vision implementing common sense reasoning. Image and Vision Computing, 34, 42-44.

4. Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., & Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1), 2773-2832.

5. Anandkumar, A., Hsu, D., & Kakade, S. M. (2012a). A method of moments for mixture models and hidden Markov models. In Conference on Learning Theory.


How to Cite

SYNERGIES OF SIGHT AND LANGUAGE: A JOURNEY INTO HOW MACHINES LEARN TO SEE AND SPEAK. (2024). European Journals of Emerging Computer Vision and Natural Language Processing, 1(01), 92-100. https://parthenonfrontiers.com/index.php/ejecvnlp/article/view/130

Share Link