REAL-TIME AUDITORY GUIDANCE FOR THE VISUALLY IMPAIRED: AN F-RCNN APPROACH IN ASSISTIVE ROBOTICS

Dr. Eryndor V. Mallin; Dr. Taisia L. Khoren

Open Access icon Open Access

ARTICLE

REAL-TIME AUDITORY GUIDANCE FOR THE VISUALLY IMPAIRED: AN F-RCNN APPROACH IN ASSISTIVE ROBOTICS

Dr. Eryndor V. Mallin ¹ , Dr. Taisia L. Khoren ²

¹ School of Informatics, University of Edinburgh, Edinburgh, United Kingdom

² Department of Electrical Engineering and Information Technology, ETH Zurich, Switzerland

Issue Vol. 1 No. 01 (2024): Volume 01 Issue 01 --- Section Articles --- Published Date: 2024-12-30

Citations: Loading…

ABSTRACT VIEWS: 29 | FILE VIEWS: 30 | PDF: 30 HTML: 0 OTHER: 0 | TOTAL: 59

Views + Downloads (Last 90 days)

Cumulative % included

Abstract

The convergence of Computer Vision (CV) and Natural Language Processing (NLP), two of the most dynamic research areas in machine learning, is charting a transformative course for the field of robotics. This article delves into the intricate integration of these two pivotal domains of artificial intelligence to enhance the capabilities of multimedia robotics applications. We explore how robots, by simultaneously interpreting visual data from their environment and comprehending human language, can achieve unprecedented levels of interaction and operational sophistication. The discussion navigates through the foundational principles of CV and NLP, highlighting the evolution of techniques from classical methods to advanced deep learning models [9]. We examine the methodologies behind fusing visual and linguistic data, focusing on architectures that enable robots to perform complex tasks such as object recognition and manipulation based on verbal commands [14]. A significant focus is placed on a practical application of this synergy: an assistive technology for visually impaired individuals, which utilizes a smartphone paired with a Faster Region Convolutional Neural Network (F-RCNN) based server to identify obstacles and provide real-time auditory guidance. This article presents an in-depth analysis of the applications, benefits, and inherent challenges of this integration, drawing upon a wide array of research. Through a comprehensive review of existing literature, we illustrate the profound impact of this synergy on creating more intelligent, autonomous, and intuitive robotic systems. The findings suggest that the continued advancement in the fusion of CV and NLP will be instrumental in realizing the full potential of social and industrial robots in our society [1].

Keywords

Computer Vision, Natural Language Processing, Multimedia Robotics, Human-Robot Interaction

References

1. G. Yin, Intelligent framework for social robots based on artificial intelligence-driven mobile edge computing, Computers & Electrical Engineering, 96, Part B, (2021).

2. Fisher, M., Cardoso, R. C., Collins, E. C., Dadswell, C., Dennis, L. A., Dixon, C., ... & Webster, M., An overview of verification and validation challenges for inspection robots, Robotics, 10, 67 (2021).

3. A. Jamshed and M. M. Fraz, NLP Meets Vision for Visual Interpretation - A Retrospective Insight and Future directions, 2021 International Conference on Digital Futures and Transformative Technologies (ICoDT), 1-8 (2021).

4. W. Fang, P. E.D. Love, H. Luo, L. Ding, Computer vision for behaviour-based safety in construction: A review and future directions, Advanced Engineering Informatics, 43, (2020).

5. H. Sharma, Improving Natural Language Processing tasks by Using Machine Learning Techniques, 2021 5th International Conference on Information Systems and Computer Networks (ISCON), 1-5 (2021).

6. M. Jitendra, P. Arbeláez, J. Carreira, K. Fragkiadaki, R. Girshick, G. Gkioxari, S. Gupta, B. Hariharan, A. Kar, and S. Tulsiani, The three R’s of computer vision: Recognition, reconstruction and reorganization, Pattern Recognition Letters, 72, 4-14 (2016).

7. P. Gärdenfors, The Geometry of Meaning: Semantics Based on Conceptual Spaces, MIT Press, (2014).

8. E. Dockrell, D. Messer, R. George, and A. Ralli, Beyond naming patterns in children with WFDs—Definitions for nouns and verbs, Journal of Neurolinguistics, 16, 191-211 (2003).

9. A. Torfi, R. A. Shirvani, Y. Keneshloo, N. Tavaf, and E. A. Fox, Natural language processing advancements by deep learning: A survey, arXiv preprint arXiv:2003.01200 (2020).

10. W. Graterol, J. Diaz-Amado, Y. Cardinale, I. Dongo, E. Lopes-Silva, and C. Santos-Libarino, Emotion detection for social robots based on nlp transformers and an emotion ontology, Sensors, 21, 1322 (2021).

11. S., Zhenfeng, W. Wu, Z. Wang, W. Du, and C. Li, Seaships: A large-scale precisely annotated dataset for ship detection, IEEE transactions on multimedia, 20, 2593-2604 (2018).

12. https://monkeylearn.com/blog/natural-language-processing-challenges/ , last vist 1/2/2022.

13. C. Zhang, Z. Yang, X. He and L. Deng, Multimodal Intelligence: Representation Learning, Information Fusion, and Applications, in IEEE Journal of Selected Topics in Signal Processing, 14, 478-493 (2020).

14. S. Tellex, T. Kollar, S. Dickerson, M. R. Walter, A. G. Banerjee, S. J. Teller, and N. Roy, Understanding natural language commands for robotic navigation and mobile manipulation. In Proceedings of the AAAI Conference on Artificial Intelligence, 25, 1507-1514 (2011).

15. Y. Yezhou, C. Teo, H. Daumé III, and Y. Aloimonos, Corpus-guided sentence generation of natural images, In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, 444-454 (2011).

16. T. S. Motwani, R. J. Mooney, Improving Video Activity Recognition using Object Recognition and Text Mining, In Proceedings of the 20th European Conference on Artificial Intelligence (ECAI-2012), 600-605 (2012).

17. N. Krishnamoorthy, G. Malkarnenkar, R. J. Mooney, K. Saenko and S. Guadarrama, Generating Natural-Language Video Descriptions Using Text-Mined Knowledge, In Proceedings of the 27th AAAI Conference on Artificial Intelligence (AAAI-2013), 541-547 (2013).

18. J. Thomason, S. Venugopalan, S. Guadarrama, K. Saenko, and R Mooney, Integrating language and vision to generate natural language descriptions of videos in the wild, Proceedings of the 25th International Conference on Computational Linguistics (COLING), (2014).

19. Y. Yezhou, C. L. Teo, C. Fermüller, and Y. Aloimonos, Robots with language: Multi-label visual recognition using NLP, In IEEE International Conference on Robotics and Automation, 4256-4262 (2013).

20. S. Aditya, Y. Yang, C. Baral, C. Fermuller, and Y. Aloimonos, From images to sentences through scene description graphs using commonsense reasoning and knowledge, arXiv preprint arXiv, (2015).

21. R. Bernardi, R. Cakici, D. Elliott, A. Erdem, E. Erdem, N. I. Cinbis, F. Keller, A. Muscat, and B. Plank, Automatic description generation from images: A survey of models, datasets, and evaluation measures, Journal of Artificial Intelligence Research, 55, 409-442 (2016).

22. P. Das, C. Xu, R. Doell, and J. Corso, A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 2634-264 (2013).

23. Y. Gong, Q. Ke, M. Isard, and S. Lazebnik, A multi-view embedding space for modeling internet images, tags, and their semantics, International journal of computer vision, 106, 210-233 (2014).

24. A. Karpathy and L. Fei-Fei, Deep visual-semantic alignments for generating image descriptions, In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 3128-3137 (2015).

25. R. Schwartz, R. Reichart and A. Rappoport, Symmetric pattern based word embeddings for improved word similarity prediction, In CoNLL, 2015, 258-267 (2015).

26. N. Shukla, C. Xiong, and S. C. Zhu, A unified framework for human-robot knowledge transfer, In Proceedings of the 2015 AAAI Fall Symposium Series, (2015).

27. Carina Silberer, Vittorio Ferrari, and Mirella Lapat, Models of semantic representation with visual attributes, In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, 572-582 (2013).

28. R. Socher, A. Karpathy, Q. V. Le, C. D. Manning, and A. Y. Ng, Grounded compositional semantics for finding and describing images with sentences. Transactions of the Association for Computational Linguistics, 2, 207-218 (2014).

29. M. Tapaswi, M. B¨auml, and R. Stiefelhagen, Bookmovie: Aligning video scenes with book chapters. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 1827–1835 (2015).

30. I. Abdalla Mohamed, A. Ben Aissa, L. F. Hussein, Ahmed I. Taloba, and T. kallel, A new model for epidemic prediction: COVID-19 in kingdom saudi arabia case study”, Materials Today: Proceedings, (2021).

31. Ahmed. I. Taloba and S. S. I. Ismail, An Intelligent Hybrid Technique of Decision Tree and Genetic Algorithm for E-Mail Spam Detection, Ninth International Conference on Intelligent Computing and Information Systems (ICICIS), 99-104 (2019).

32. Ahmed I. Taloba, M. R. Riad and T. H. A. Soliman, Developing an efficient spectral clustering algorithm on large scale graphs in spark, Eighth International Conference on Intelligent Computing and Information Systems (ICICIS), 292-298 (2017).

33. D. Tang, B. Qin, and T. Liu, “Document modeling with gated recurrent neural network for sentiment classification,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 1422–1432.

34. Q. Le and T. Mikolov, “Distributed representations of sentences and documents,” in International conference on machine learning, 2014, pp. 1188–1196.

35. S. R. Bowman, G. Angeli, C. Potts, and C. D. Manning, “A large annotated corpus for learning natural language inference,” ArXiv Prepr. ArXiv05326, 2015.

36. Y. Yang, W. Yih, and C. Meek, “Wikiqa: A challenge dataset for open-domain question answering,” in Proceedings of the 2015 conference on empirical methods in natural language processing, 2015, pp. 2013–2018.

37. W. Yih, M. Richardson, C. Meek, M.-W. Chang, and J. Suh, “The value of semantic parse labeling for knowledge base question answering,” in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2016, pp. 201–206.

38. A. Fukui, D. H. Park, D. Yang, A. Rohrbach, T. Darrell, and M. Rohrbach, “Multimodal compact bilinear pooling for visual question answering and visual grounding,” ArXiv Prepr. ArXiv01847, 2016.

39. N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neural network for modelling sentences,” ArXiv Prepr. ArXiv2188, 2014.

40. T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean, “Distributed representations of words and phrases and their compositionality,” Adv. Neural Inf. Process. Syst., vol. 26, 2013.

How to Cite

REAL-TIME AUDITORY GUIDANCE FOR THE VISUALLY IMPAIRED: AN F-RCNN APPROACH IN ASSISTIVE ROBOTICS. (2024). European Journals of Emerging Computer Vision and Natural Language Processing, 1(01), 101-108. https://parthenonfrontiers.com/index.php/ejecvnlp/article/view/131

Download Citation

ejecvnlp Open Access Journal

European Journals of Emerging Computer Vision and Natural Language Processing

All issues

REAL-TIME AUDITORY GUIDANCE FOR THE VISUALLY IMPAIRED: AN F-RCNN APPROACH IN ASSISTIVE ROBOTICS

Abstract

Keywords

References

How to Cite

Related articles

Journal Information

Journal Guidelines

Follow Us

Join Us

Contact Us

Share Link

Related articles

Fusing Pixels and Prose: The Transformative Impact of Integrating Computer Vision and Natural Language Processing on Multimedia Robotics Applications

SYNERGIES OF SIGHT AND LANGUAGE: A JOURNEY INTO HOW MACHINES LEARN TO SEE AND SPEAK

Immersive Technologies for Remote Human-Robot Interaction: A Systematic Literature Review

Enhancing Indonesian Scientific Article Management through Machine Learning and NLP

Emerging Frontiers in Computer Vision: A Critical Analysis of Deep Learning Techniques and Their Real-World Applications

Safeguarding Identity: A Comprehensive Survey of Anonymization Strategies for Behavioral Biometric Data

Climate Change Impacts, Vulnerability, and Adaptive Capacity in Agrarian and Forest-Dependent Regions: An Integrated Socio-Ecological Analysis

Predictive Water Quality Management for Arowana Aquaculture Using Hybrid Iot And Fuzzy Time Series Models

Formal Verification of Learning-Based Neural Agents in Non-Deterministic and Hybrid Environments