SYNERGIES OF SIGHT AND LANGUAGE: A JOURNEY INTO HOW MACHINES LEARN TO SEE AND SPEAK
- Authors
-
-
Dr. Leona V. Merake
Department of Cognitive Robotics, Delft University of Technology, Delft, NetherlandsAuthor -
Dr. Arvind S. Tomura
Artificial Intelligence and Vision Lab, University of Tokyo, Tokyo, JapanAuthor
-
- Keywords:
- Computer Vision, Natural Language Processing, Vision and Language, Multimodal Learning
- Abstract
-
For decades, we've dreamed of creating machines that can see the world as we do and talk to us about it. This once-distant dream is now a vibrant reality at the intersection of two of artificial intelligence's most ambitious fields: Computer Vision (CV) and Natural Language Processing (NLP). This article takes you on a comprehensive journey through this exciting landscape, exploring how we're teaching computers to connect pixels to prose. We'll start by exploring the fundamental questions that drive this field, like the "symbol grounding problem"—the puzzle of how words get their meaning—and use frameworks like Bloom's Taxonomy to map out what it truly means for a machine to "understand." We'll break down the core tasks of vision into the "3Rs" (Recognition, Reconstruction, Reorganization) and language into its essential layers (Syntax, Semantics, Pragmatics) to see exactly where these two worlds meet.
The heart of this survey is a deep dive into the toolbox of modern AI. We'll explore how machines learn to represent the world, from the early days of handcrafted visual features to the powerful deep learning models that create today's "image embeddings" and "word embeddings." From there, we'll investigate the ingenious architectures that fuse these senses together, including shared embedding spaces, the elegant encoder-decoder models that power image captioning, the clever attention mechanisms that let models "focus" on what's important, and the modular networks that allow for compositional reasoning.
We'll see these methods in action as we examine the key battlegrounds where progress is measured: tasks like generating captions for images and videos, answering complex questions about a scene (VQA), and retrieving images with a simple text query. We'll then venture into the world of robotics, where this technology is giving machines the ability to follow our commands, learn by watching us, and engage in grounded, meaningful dialogue. By weaving together insights from landmark studies, we'll paint a picture of the field's incredible achievements. Finally, we'll have a candid discussion about the tough challenges that lie ahead—the need for genuine commonsense, the subtle biases that creep into our data, and the quest for true understanding—as we look toward a future of embodied, communicative AI that promises to change our relationship with technology forever.
- Downloads
-
Download data is not yet available.
- References
-
1. Aditya, S., Yang, Y., Baral, C., Fermuller, C., & Aloimonos, Y. (2015). From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292.
2. Aksoy, E. E., Abramov, A., Dorr, J., Ning, K., Dellen, B., & Worgotter, F. (2011). Learning the semantics of object–action relations by observation. The International Journal of Robotics Research, 30(10), 1229-1249.
3. Aloimonos, Y., & Fermuller, C. (2015). The cognitive dialogue: A new model for vision implementing common sense reasoning. Image and Vision Computing, 34, 42-44.
4. Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., & Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1), 2773-2832.
5. Anandkumar, A., Hsu, D., & Kakade, S. M. (2012a). A method of moments for mixture models and hidden Markov models. In Conference on Learning Theory.
6. Anandkumar, A., Liu, Y. K., Hsu, D. J., Foster, D. P., & Kakade, S. M. (2012b). A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems (pp. 917-925).
7. Anderson, A. J., Bruni, E., Bordignon, U., Poesio, M., & Baroni, M. (2013). Of words, eyes and brains: Correlating image-based distributional semantic models with neural representations of concepts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1960-1970).
8. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
9. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 39-48).
10. Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3), 463.
11. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433).
12. Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), 469-483.
13. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
14. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1 (pp. 86-90).
15. Bakir, G. H. (2007). Predicting structured data. MIT press.
16. Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... & Schneider, N. (2012). Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
17. Bandura, A. (1974). Psychological Modeling: Conflicting Theories. Transaction Publishers.
18. Barbu, A., Bridge, A., Burchill, Z., Coroian, D., Dickinson, S., Fidler, S., ... & Salvi, D. (2012a). Video in sentences out. In UAI 2012.
19. Barbu, A., Michaux, A., Narayanaswamy, S., & Siskind, J. M. (2012b). Simultaneous object detection, tracking, and event recognition. In ACS 2012.
20. Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.
21. Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In Proceedings of the 8th IEEE International Conference on Computer Vision, 2001 (ICCV 2001), Vol. 2. IEEE, 408–415.
22. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
23. Baroni, M. (2016). Grounding distributional semantics in the visual world. Language and Linguistics Compass, 10(1), 3–13.
24. Barranco, F., Fermuller, C., & Aloimonos, Y. (2014). Contour motion estimation for asynchronous event-driven cameras. Proc. IEEE, 102(10), 1537–1556.
25. Barron, J. T., & Malik, J. (2015). Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell., 37(8), 1670–1687.
26. Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (SURF). Comput. Vis. Image Understand., 110(3), 346–359.
27. Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: Speeded up robust features. In Computer Vision–ECCV 2006. Springer, 404–417.
28. Banarescu, L., et al. (2012). Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
29. Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell., 19(7), 711–720.
30. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neur. Comput., 15(6), 1373–1396.
31. Beltagy, I., Roller, S., Cheng, P., Erk, K., & Mooney, R. J. (2015). Representing meaning with a combination of logical form and vectors. arXiv preprint arXiv:1505.06816.
32. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8), 1798–1828.
33. Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res., 3, 1137–1155.
34. Bengio, Y., Larochelle, H., Lamblin, P., Popovici, D., Courville, A., Simard, C., ... & Erhan, D. (2007). Deep architectures for baby AI.
35. Berg, A., Deng, J., & Fei-Fei, L. (2010). Large scale visual recognition challenge (ILSVRC), 2010.
36. Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009 (CVPR Workshops 2009). IEEE, 1–8.
37. Berg, T. L., Berg, A. C., Edwards, J., Maire, M., White, R., Teh, Y. W., ... & Forsyth, D. A. (2004). Names and faces in the news. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), Vol. 2. IEEE, II–848.
38. Berg, T. L., Berg, A. C., & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. In Computer Vision–ECCV 2010. Springer, 663–676.
39. Berg, T. L., Forsyth, D., & others. (2006). Animals on the web. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 1463–1470.
40. Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., ... & Plank, B. (2016). Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res., 55, 409–442.
41. Blei, D. M., & Jordan, M. I. (2003). Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 127–134.
42. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res., 3, 993–1022.
43. Bloom, B. S., & others. (1956). Taxonomy of educational objectives. Vol. 1: Cognitive domain. McKay, New York, NY, 20–24.
44. Bronstein, A. M., Bronstein, M. M., & Kimmel, R. (2005). Three-dimensional face recognition. Int. J. Comput. Vis., 64(1), 5–30.
45. Bruni, E., Boleda, G., Baroni, M., & Tran, N. K. (2012). Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 136–145.
46. Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal distributional semantics. J. Artif. Intell. Res., 49, 1–47.
47. Byron, D., Koller, A., Oberlander, J., Stoia, L., & Striegnitz, K. (2007). Generating instructions in virtual environments (GIVE): A challenge and an evaluation testbed for NLG.
48. Cangelosi, A. (2006). The grounding and sharing of symbols. Pragm. Cogn., 14(2), 275–285.
49. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E. R., & Mitchell, T. M. (2010). Toward an architecture for never-ending language learning. In AAAI, Vol. 5. 3.
50. Carrasco, M. (2011). Visual attention: The past 25 years. Vis. Res., 51(13), 1484–1525.
51. Carreira, J., & Sminchisescu, C. (2010). Constrained parametric min-cuts for automatic object segmentation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3241–3248.
52. Chang, A. X., Savva, M., & Manning, C. D. (2014). Semantic parsing for text to 3d scene generation. ACL 2014, 17.
53. Chao, Y. W., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017–1025.
54. Carlson, A., et al. (2010). Toward an architecture for never-ending language learning. In AAAI.
55. Chemero, A. (2003). An outline of a theory of affordances. Ecological Psychology, 15(2), 181–195.
56. Chen, D. L., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1. Association for Computational Linguistics, 190–200.
57. Chen, D. L., & Mooney, R. J. (2008). Learning to sportscast: A test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine Learning. ACM, 128–135.
58. Chang, A. X., Savva, M., & Manning, C. D. (2014). Semantic parsing for text to 3d scene generation. In ACL 2014.
59. Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollar, P., & Zitnick, C. L. (2015a). Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
60. Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 1409–1416.
61. Chen, Z., Lin, W., Chen, Q., Chen, X., Wei, S., Jiang, H., & Zhu, X. (2015b). Revisiting word embedding for contrasting meaning. In Proceedings of ACL.
62. Chelba, C., et al. (2014). One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.
63. Chen, X., et al. (2015a). Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
64. Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV.
65. Chemero, A. (2003). An outline of a theory of affordances. Ecological Psychology, 15(2), 181–195.
66. Clark, S., & Pulman, S. (2007). Combining symbolic and distributional models of meaning. In AAAI Spring Symposium: Quantum Interaction. 52–55.
67. Cohen, M. D., & Bacdayan, P. (1994). Organizational routines are stored as procedural memory: Evidence from a laboratory study. Organiz. Sci., 5(4), 554–568.
68. Cohen, N., Sharir, O., & Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory. 698–728.
69. Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. Syntax Sem. Struct. Stat. Transl., 103.
70. Choi, M. J., Torralba, A., & Willsky, A. S. (2012). Context models and out-of-context objects. Pattern Recogn. Lett., 33(7), 853–862.
71. Cohen, N., Sharir, O., & Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. In COLT.
72. Coradeschi, S., Loutfi, A., & Wrede, B. (2013). A short review of symbol grounding in robotic and intelligent systems. KI-Künstliche Intell., 27(2), 129–136.
73. Coradeschi, S., & Saffiotti, A. (2000). Anchoring symbols to sensor data: Preliminary report. In AAAI/IAAI. 129–135.
74. Cowan, N. (2008). What are the differences between long-term, short-term, and working memory? Progr. Brain Res., 169, 323–338.
75. Darrell, T. (2010). Learning Representations for Real-world Recognition. UCB EECS Colloquium.
76. Das, P., Xu, C., Doell, R., & Corso, J. (2013). A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2634–2641.
77. Daumé III, H. (2007). Frustratingly easy domain adaptation. ACL 2007, 256.
78. Daumé III, H., Langford, J., & Marcu, D. (2009). Search-based structured prediction. Mach. Learn., 75(3), 297–325.
79. Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems. 1269–1277.
80. Dodge, J., et al. (2012). Detecting visual text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 762–772.
81. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.
- Downloads
- Published
- 2024-12-27
- Section
- Articles
- License
-
All articles published by The Parthenon Frontiers and its associated journals are distributed under the terms of the Creative Commons Attribution (CC BY 4.0) International License unless otherwise stated.
Authors retain full copyright of their published work. By submitting their manuscript, authors agree to grant The Parthenon Frontiers a non-exclusive license to publish, archive, and distribute the article worldwide. Authors are free to:
-
Share their article on personal websites, institutional repositories, or social media platforms.
-
Reuse their content in future works, presentations, or educational materials, provided proper citation of the original publication.
-
How to Cite
Similar Articles
- Dr. Eryndor V. Mallin, Dr. Taisia L. Khoren, REAL-TIME AUDITORY GUIDANCE FOR THE VISUALLY IMPAIRED: AN F-RCNN APPROACH IN ASSISTIVE ROBOTICS , European Journals of Emerging Computer Vision and Natural Language Processing: Vol. 1 No. 01 (2024): Volume 01 Issue 01
- Dr. Yuko Tanaka, Dr. Carlos M. Ruiz, Immersive Technologies for Remote Human-Robot Interaction: A Systematic Literature Review , European Journals of Emerging Computer Vision and Natural Language Processing: Vol. 1 No. 01 (2024): Volume 01 Issue 01
- Dr. Paige L. Bennett, Dr. Aaron J. Myers, OPTIMIZING RANDOM FOREST REGRESSOR PERFORMANCE IN BEEF QUALITY PREDICTION THROUGH HYPERPARAMETER TUNING , European Journals of Emerging Computer Vision and Natural Language Processing: Vol. 1 No. 01 (2024): Volume 01 Issue 01
- John M. Pando, Gianna Trentini, Service Placement Strategies Across the Cloud-Fog-Edge Continuum: A Comprehensive Survey , European Journals of Emerging Computer Vision and Natural Language Processing: Vol. 1 No. 01 (2024): Volume 01 Issue 01
You may also start an advanced similarity search for this article.