European Journals of Emerging Computer Vision and Natural Language Processing
A-Z Journals

REAL-TIME AUDITORY GUIDANCE FOR THE VISUALLY IMPAIRED: AN F-RCNN APPROACH IN ASSISTIVE ROBOTICS

Authors
  • Dr. Eryndor V. Mallin

    School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
    Author
  • Dr. Taisia L. Khoren

    Department of Electrical Engineering and Information Technology, ETH Zurich, Switzerland
    Author
Keywords:
Computer Vision, Natural Language Processing, Multimedia Robotics, Human-Robot Interaction
Abstract

The convergence of Computer Vision (CV) and Natural Language Processing (NLP), two of the most dynamic research areas in machine learning, is charting a transformative course for the field of robotics. This article delves into the intricate integration of these two pivotal domains of artificial intelligence to enhance the capabilities of multimedia robotics applications. We explore how robots, by simultaneously interpreting visual data from their environment and comprehending human language, can achieve unprecedented levels of interaction and operational sophistication. The discussion navigates through the foundational principles of CV and NLP, highlighting the evolution of techniques from classical methods to advanced deep learning models [9]. We examine the methodologies behind fusing visual and linguistic data, focusing on architectures that enable robots to perform complex tasks such as object recognition and manipulation based on verbal commands [14]. A significant focus is placed on a practical application of this synergy: an assistive technology for visually impaired individuals, which utilizes a smartphone paired with a Faster Region Convolutional Neural Network (F-RCNN) based server to identify obstacles and provide real-time auditory guidance. This article presents an in-depth analysis of the applications, benefits, and inherent challenges of this integration, drawing upon a wide array of research. Through a comprehensive review of existing literature, we illustrate the profound impact of this synergy on creating more intelligent, autonomous, and intuitive robotic systems. The findings suggest that the continued advancement in the fusion of CV and NLP will be instrumental in realizing the full potential of social and industrial robots in our society [1].

Downloads
Download data is not yet available.
References

1. G. Yin, Intelligent framework for social robots based on artificial intelligence-driven mobile edge computing, Computers & Electrical Engineering, 96, Part B, (2021).

2. Fisher, M., Cardoso, R. C., Collins, E. C., Dadswell, C., Dennis, L. A., Dixon, C., ... & Webster, M., An overview of verification and validation challenges for inspection robots, Robotics, 10, 67 (2021).

3. A. Jamshed and M. M. Fraz, NLP Meets Vision for Visual Interpretation - A Retrospective Insight and Future directions, 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT), 1-8 (2021).

4. W. Fang, P. E.D. Love, H. Luo, L. Ding, Computer vision for behaviour-based safety in construction: A review and future directions, Advanced Engineering Informatics, 43, (2020).

5. H. Sharma, Improving Natural Language Processing tasks by Using Machine Learning Techniques, 2021 5th International Conference on Information Systems and Computer Networks (ISCON), 1-5 (2021).

6. M. Jitendra, P. Arbeláez, J. Carreira, K. Fragkiadaki, R. Girshick, G. Gkioxari, S. Gupta, B. Hariharan, A. Kar, and S. Tulsiani, The three R’s of computer vision: Recognition, reconstruction and reorganization, Pattern Recognition Letters, 72, 4-14 (2016).

7. P. Gärdenfors, The Geometry of Meaning: Semantics Based on Conceptual Spaces, MIT Press, (2014).

8. E. Dockrell, D. Messer, R. George, and A. Ralli, Beyond naming patterns in children with WFDs—Definitions for nouns and verbs, Journal of Neurolinguistics, 16, 191-211 (2003).

9. A. Torfi, R. A. Shirvani, Y. Keneshloo, N. Tavaf, and E. A. Fox, Natural language processing advancements by deep learning: A survey, arXiv preprint arXiv:2003.01200 (2020).

10. W. Graterol, J. Diaz-Amado, Y. Cardinale, I. Dongo, E. Lopes-Silva, and C. Santos-Libarino, Emotion detection for social robots based on nlp transformers and an emotion ontology, Sensors, 21, 1322 (2021).

11. S., Zhenfeng, W. Wu, Z. Wang, W. Du, and C. Li, Seaships: A large-scale precisely annotated dataset for ship detection, IEEE transactions on multimedia, 20, 2593-2604 (2018).

12. https://monkeylearn.com/blog/natural-language-processing-challenges/ , last vist 1/2/2022.

13. C. Zhang, Z. Yang, X. He and L. Deng, Multimodal Intelligence: Representation Learning, Information Fusion, and Applications, in IEEE Journal of Selected Topics in Signal Processing, 14, 478-493 (2020).

14. S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy, Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, 25, 1507-1514 (2011).

15. Y. Yezhou, C. Teo, H. Daumé III, and Y. Aloimonos, Corpus-guided sentence generation of natural images, In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 444-454 (2011).

16. T. S. Motwani, R. J. Mooney, Improving Video Activity Recognition using Object Recognition and Text Mining, In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), 600-605 (2012).

17. N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko and S. Guadarrama, Generating Natural-Language Video Descriptions Using Text-Mined Knowledge, In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013), 541-547 (2013).

18. J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R Mooney, Integrating language and vision to generate natural language descriptions of videos in the wild, Proceedings of the 25th International Conference on Computational Linguistics (COLING), (2014).

19. Y. Yezhou, C. L. Teo, C. Fermüller, and Y. Aloimonos, Robots with language: Multi-label visual recognition using NLP, In IEEE International Conference on Robotics and Automation, 4256-4262 (2013).

20. S. Aditya, Y. Yang, C. Baral, C. Fermuller, and Y. Aloimonos, From images to sentences through scene description graphs using commonsense reasoning and knowledge, arXiv preprint arXiv, (2015).

21. R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. I. Cinbis, F. Keller, A. Muscat, and B. Plank, Automatic description generation from images: A survey of models, datasets, and evaluation measures, Journal of Artificial Intelligence Research, 55, 409-442 (2016).

22. P. Das, C. Xu, R. Doell, and J. Corso, A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2634-264 (2013).

23. Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, International journal of computer vision, 106, 210-233 (2014).

24. A. Karpathy and L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128-3137 (2015).

25. R. Schwartz, R. Reichart and A. Rappoport, Symmetric pattern based word embeddings for improved word similarity prediction, In CoNLL, 2015, 258-267 (2015).

26. N. Shukla, C. Xiong, and S. C. Zhu, A unified framework for human-robot knowledge transfer, In Proceedings of the 2015 AAAI Fall Symposium Series, (2015).

27. Carina Silberer, Vittorio Ferrari, and Mirella Lapat, Models of semantic representation with visual attributes, In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 572-582 (2013).

28. R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2, 207-218 (2014).

29. M. Tapaswi, M. B¨auml, and R. Stiefelhagen, Bookmovie: Aligning video scenes with book chapters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1827–1835 (2015).

30. I. Abdalla Mohamed, A. Ben Aissa, L. F. Hussein, Ahmed I. Taloba, and T. kallel, A new model for epidemic prediction: COVID-19 in kingdom saudi arabia case study”, Materials Today: Proceedings, (2021).

31. Ahmed. I. Taloba and S. S. I. Ismail, An Intelligent Hybrid Technique of Decision Tree and Genetic Algorithm for E-Mail Spam Detection, Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), 99-104 (2019).

32. Ahmed I. Taloba, M. R. Riad and T. H. A. Soliman, Developing an efficient spectral clustering algorithm on large scale graphs in spark, Eighth International Conference on Intelligent Computing and Information Systems (ICICIS), 292-298 (2017).

33. D. Tang, B. Qin, and T. Liu, “Document modeling with gated recurrent neural network for sentiment classification,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 1422–1432.

34. Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning, 2014, pp. 1188–1196.

35. S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” ArXiv Prepr. ArXiv05326, 2015.

36. Y. Yang, W. Yih, and C. Meek, “Wikiqa: A challenge dataset for open-domain question answering,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2013–2018.

37. W. Yih, M. Richardson, C. Meek, M.-W. Chang, and J. Suh, “The value of semantic parse labeling for knowledge base question answering,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 201–206.

38. A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” ArXiv Prepr. ArXiv01847, 2016.

39. N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” ArXiv Prepr. ArXiv2188, 2014.

40. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Adv. Neural Inf. Process. Syst., vol. 26, 2013.

Downloads
Published
2024-12-30
Section
Articles
License

All articles published by The Parthenon Frontiers and its associated journals are distributed under the terms of the Creative Commons Attribution (CC BY 4.0) International License unless otherwise stated. 

Authors retain full copyright of their published work. By submitting their manuscript, authors agree to grant The Parthenon Frontiers a non-exclusive license to publish, archive, and distribute the article worldwide. Authors are free to:

  • Share their article on personal websites, institutional repositories, or social media platforms.

  • Reuse their content in future works, presentations, or educational materials, provided proper citation of the original publication.

How to Cite

REAL-TIME AUDITORY GUIDANCE FOR THE VISUALLY IMPAIRED: AN F-RCNN APPROACH IN ASSISTIVE ROBOTICS. (2024). European Journals of Emerging Computer Vision and Natural Language Processing, 1(01), 101-108. https://parthenonfrontiers.com/index.php/ejecvnlp/article/view/131

Similar Articles

You may also start an advanced similarity search for this article.