SYNERGIES OF SIGHT AND LANGUAGE: A JOURNEY INTO HOW MACHINES LEARN TO SEE AND SPEAK

Dr. Leona V. Merake; Dr. Arvind S. Tomura

Open Access icon Open Access

ARTICLE

SYNERGIES OF SIGHT AND LANGUAGE: A JOURNEY INTO HOW MACHINES LEARN TO SEE AND SPEAK

Dr. Leona V. Merake ¹ , Dr. Arvind S. Tomura ²

¹ Department of Cognitive Robotics, Delft University of Technology, Delft, Netherlands

² Artificial Intelligence and Vision Lab, University of Tokyo, Tokyo, Japan

Issue Vol. 1 No. 01 (2024): Volume 01 Issue 01 --- Section Articles --- Published Date: 2024-12-27

Citations: Loading…

ABSTRACT VIEWS: 24 | FILE VIEWS: 34 | PDF: 34 HTML: 0 OTHER: 0 | TOTAL: 58

Views + Downloads (Last 90 days)

Cumulative % included

Abstract

For decades, we've dreamed of creating machines that can see the world as we do and talk to us about it. This once-distant dream is now a vibrant reality at the intersection of two of artificial intelligence's most ambitious fields: Computer Vision (CV) and Natural Language Processing (NLP). This article takes you on a comprehensive journey through this exciting landscape, exploring how we're teaching computers to connect pixels to prose. We'll start by exploring the fundamental questions that drive this field, like the "symbol grounding problem"—the puzzle of how words get their meaning—and use frameworks like Bloom's Taxonomy to map out what it truly means for a machine to "understand." We'll break down the core tasks of vision into the "3Rs" (Recognition, Reconstruction, Reorganization) and language into its essential layers (Syntax, Semantics, Pragmatics) to see exactly where these two worlds meet.

The heart of this survey is a deep dive into the toolbox of modern AI. We'll explore how machines learn to represent the world, from the early days of handcrafted visual features to the powerful deep learning models that create today's "image embeddings" and "word embeddings." From there, we'll investigate the ingenious architectures that fuse these senses together, including shared embedding spaces, the elegant encoder-decoder models that power image captioning, the clever attention mechanisms that let models "focus" on what's important, and the modular networks that allow for compositional reasoning.

We'll see these methods in action as we examine the key battlegrounds where progress is measured: tasks like generating captions for images and videos, answering complex questions about a scene (VQA), and retrieving images with a simple text query. We'll then venture into the world of robotics, where this technology is giving machines the ability to follow our commands, learn by watching us, and engage in grounded, meaningful dialogue. By weaving together insights from landmark studies, we'll paint a picture of the field's incredible achievements. Finally, we'll have a candid discussion about the tough challenges that lie ahead—the need for genuine commonsense, the subtle biases that creep into our data, and the quest for true understanding—as we look toward a future of embodied, communicative AI that promises to change our relationship with technology forever.

Keywords

Computer Vision, Natural Language Processing, Vision and Language, Multimodal Learning

References

1. Aditya, S., Yang, Y., Baral, C., Fermuller, C., & Aloimonos, Y. (2015). From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292.

2. Aksoy, E. E., Abramov, A., Dorr, J., Ning, K., Dellen, B., & Worgotter, F. (2011). Learning the semantics of object–action relations by observation. The International Journal of Robotics Research, 30(10), 1229-1249.

3. Aloimonos, Y., & Fermuller, C. (2015). The cognitive dialogue: A new model for vision implementing common sense reasoning. Image and Vision Computing, 34, 42-44.

4. Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., & Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1), 2773-2832.

5. Anandkumar, A., Hsu, D., & Kakade, S. M. (2012a). A method of moments for mixture models and hidden Markov models. In Conference on Learning Theory.

6. Anandkumar, A., Liu, Y. K., Hsu, D. J., Foster, D. P., & Kakade, S. M. (2012b). A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems (pp. 917-925).

7. Anderson, A. J., Bruni, E., Bordignon, U., Poesio, M., & Baroni, M. (2013). Of words, eyes and brains: Correlating image-based distributional semantic models with neural representations of concepts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1960-1970).

8. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.

9. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 39-48).

10. Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3), 463.

11. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433).

12. Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), 469-483.

13. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

14. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1 (pp. 86-90).

15. Bakir, G. H. (2007). Predicting structured data. MIT press.

16. Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... & Schneider, N. (2012). Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.

17. Bandura, A. (1974). Psychological Modeling: Conflicting Theories. Transaction Publishers.

18. Barbu, A., Bridge, A., Burchill, Z., Coroian, D., Dickinson, S., Fidler, S., ... & Salvi, D. (2012a). Video in sentences out. In UAI 2012.

19. Barbu, A., Michaux, A., Narayanaswamy, S., & Siskind, J. M. (2012b). Simultaneous object detection, tracking, and event recognition. In ACS 2012.

20. Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.

21. Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In Proceedings of the 8th IEEE International Conference on Computer Vision, 2001 (ICCV 2001), Vol. 2. IEEE, 408–415.

22. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

23. Baroni, M. (2016). Grounding distributional semantics in the visual world. Language and Linguistics Compass, 10(1), 3–13.

24. Barranco, F., Fermuller, C., & Aloimonos, Y. (2014). Contour motion estimation for asynchronous event-driven cameras. Proc. IEEE, 102(10), 1537–1556.

25. Barron, J. T., & Malik, J. (2015). Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell., 37(8), 1670–1687.

26. Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (SURF). Comput. Vis. Image Understand., 110(3), 346–359.

27. Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: Speeded up robust features. In Computer Vision–ECCV 2006. Springer, 404–417.

28. Banarescu, L., et al. (2012). Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.

29. Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell., 19(7), 711–720.

30. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neur. Comput., 15(6), 1373–1396.

31. Beltagy, I., Roller, S., Cheng, P., Erk, K., & Mooney, R. J. (2015). Representing meaning with a combination of logical form and vectors. arXiv preprint arXiv:1505.06816.

32. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8), 1798–1828.

33. Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res., 3, 1137–1155.

34. Bengio, Y., Larochelle, H., Lamblin, P., Popovici, D., Courville, A., Simard, C., ... & Erhan, D. (2007). Deep architectures for baby AI.

35. Berg, A., Deng, J., & Fei-Fei, L. (2010). Large scale visual recognition challenge (ILSVRC), 2010.

36. Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009 (CVPR Workshops 2009). IEEE, 1–8.

37. Berg, T. L., Berg, A. C., Edwards, J., Maire, M., White, R., Teh, Y. W., ... & Forsyth, D. A. (2004). Names and faces in the news. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), Vol. 2. IEEE, II–848.

38. Berg, T. L., Berg, A. C., & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. In Computer Vision–ECCV 2010. Springer, 663–676.

39. Berg, T. L., Forsyth, D., & others. (2006). Animals on the web. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 1463–1470.

40. Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., ... & Plank, B. (2016). Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res., 55, 409–442.

41. Blei, D. M., & Jordan, M. I. (2003). Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 127–134.

42. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res., 3, 993–1022.

43. Bloom, B. S., & others. (1956). Taxonomy of educational objectives. Vol. 1: Cognitive domain. McKay, New York, NY, 20–24.

44. Bronstein, A. M., Bronstein, M. M., & Kimmel, R. (2005). Three-dimensional face recognition. Int. J. Comput. Vis., 64(1), 5–30.

45. Bruni, E., Boleda, G., Baroni, M., & Tran, N. K. (2012). Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 136–145.

46. Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal distributional semantics. J. Artif. Intell. Res., 49, 1–47.

47. Byron, D., Koller, A., Oberlander, J., Stoia, L., & Striegnitz, K. (2007). Generating instructions in virtual environments (GIVE): A challenge and an evaluation testbed for NLG.

48. Cangelosi, A. (2006). The grounding and sharing of symbols. Pragm. Cogn., 14(2), 275–285.

49. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E. R., & Mitchell, T. M. (2010). Toward an architecture for never-ending language learning. In AAAI, Vol. 5. 3.

50. Carrasco, M. (2011). Visual attention: The past 25 years. Vis. Res., 51(13), 1484–1525.

51. Carreira, J., & Sminchisescu, C. (2010). Constrained parametric min-cuts for automatic object segmentation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3241–3248.

52. Chang, A. X., Savva, M., & Manning, C. D. (2014). Semantic parsing for text to 3d scene generation. ACL 2014, 17.

53. Chao, Y. W., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017–1025.

54. Carlson, A., et al. (2010). Toward an architecture for never-ending language learning. In AAAI.

55. Chemero, A. (2003). An outline of a theory of affordances. Ecological Psychology, 15(2), 181–195.

56. Chen, D. L., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1. Association for Computational Linguistics, 190–200.

57. Chen, D. L., & Mooney, R. J. (2008). Learning to sportscast: A test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine Learning. ACM, 128–135.

58. Chang, A. X., Savva, M., & Manning, C. D. (2014). Semantic parsing for text to 3d scene generation. In ACL 2014.

59. Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollar, P., & Zitnick, C. L. (2015a). Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

60. Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 1409–1416.

61. Chen, Z., Lin, W., Chen, Q., Chen, X., Wei, S., Jiang, H., & Zhu, X. (2015b). Revisiting word embedding for contrasting meaning. In Proceedings of ACL.

62. Chelba, C., et al. (2014). One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.

63. Chen, X., et al. (2015a). Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.

64. Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV.

65. Chemero, A. (2003). An outline of a theory of affordances. Ecological Psychology, 15(2), 181–195.

66. Clark, S., & Pulman, S. (2007). Combining symbolic and distributional models of meaning. In AAAI Spring Symposium: Quantum Interaction. 52–55.

67. Cohen, M. D., & Bacdayan, P. (1994). Organizational routines are stored as procedural memory: Evidence from a laboratory study. Organiz. Sci., 5(4), 554–568.

68. Cohen, N., Sharir, O., & Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory. 698–728.

69. Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. Syntax Sem. Struct. Stat. Transl., 103.

70. Choi, M. J., Torralba, A., & Willsky, A. S. (2012). Context models and out-of-context objects. Pattern Recogn. Lett., 33(7), 853–862.

71. Cohen, N., Sharir, O., & Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. In COLT.

72. Coradeschi, S., Loutfi, A., & Wrede, B. (2013). A short review of symbol grounding in robotic and intelligent systems. KI-Künstliche Intell., 27(2), 129–136.

73. Coradeschi, S., & Saffiotti, A. (2000). Anchoring symbols to sensor data: Preliminary report. In AAAI/IAAI. 129–135.

74. Cowan, N. (2008). What are the differences between long-term, short-term, and working memory? Progr. Brain Res., 169, 323–338.

75. Darrell, T. (2010). Learning Representations for Real-world Recognition. UCB EECS Colloquium.

76. Das, P., Xu, C., Doell, R., & Corso, J. (2013). A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2634–2641.

77. Daumé III, H. (2007). Frustratingly easy domain adaptation. ACL 2007, 256.

78. Daumé III, H., Langford, J., & Marcu, D. (2009). Search-based structured prediction. Mach. Learn., 75(3), 297–325.

79. Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems. 1269–1277.

80. Dodge, J., et al. (2012). Detecting visual text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 762–772.

81. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.

How to Cite

SYNERGIES OF SIGHT AND LANGUAGE: A JOURNEY INTO HOW MACHINES LEARN TO SEE AND SPEAK. (2024). European Journals of Emerging Computer Vision and Natural Language Processing, 1(01), 92-100. https://parthenonfrontiers.com/index.php/ejecvnlp/article/view/130

Download Citation

ejecvnlp Open Access Journal

European Journals of Emerging Computer Vision and Natural Language Processing

All issues

SYNERGIES OF SIGHT AND LANGUAGE: A JOURNEY INTO HOW MACHINES LEARN TO SEE AND SPEAK

Abstract

Keywords

References

How to Cite

Related articles

Journal Information

Journal Guidelines

Join Us

Contact Us

Share Link