1. Aditya, S., Yang, Y., Baral, C., Fermuller, C., & Aloimonos, Y. (2015). From images to sentences through scene description graphs using commonsense reasoning and knowledge. arXiv preprint arXiv:1511.03292.
2. Aksoy, E. E., Abramov, A., Dorr, J., Ning, K., Dellen, B., & Worgotter, F. (2011). Learning the semantics of object–action relations by observation. The International Journal of Robotics Research, 30(10), 1229-1249.
3. Aloimonos, Y., & Fermuller, C. (2015). The cognitive dialogue: A new model for vision implementing common sense reasoning. Image and Vision Computing, 34, 42-44.
4. Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., & Telgarsky, M. (2014). Tensor decompositions for learning latent variable models. Journal of Machine Learning Research, 15(1), 2773-2832.
5. Anandkumar, A., Hsu, D., & Kakade, S. M. (2012a). A method of moments for mixture models and hidden Markov models. In Conference on Learning Theory.
6. Anandkumar, A., Liu, Y. K., Hsu, D. J., Foster, D. P., & Kakade, S. M. (2012b). A spectral algorithm for latent Dirichlet allocation. In Advances in Neural Information Processing Systems (pp. 917-925).
7. Anderson, A. J., Bruni, E., Bordignon, U., Poesio, M., & Baroni, M. (2013). Of words, eyes and brains: Correlating image-based distributional semantic models with neural representations of concepts. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing (pp. 1960-1970).
8. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016a). Learning to compose neural networks for question answering. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies.
9. Andreas, J., Rohrbach, M., Darrell, T., & Klein, D. (2016b). Neural module networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 39-48).
10. Andrews, M., Vigliocco, G., & Vinson, D. (2009). Integrating experiential and distributional data to learn semantic representations. Psychological Review, 116(3), 463.
11. Antol, S., Agrawal, A., Lu, J., Mitchell, M., Batra, D., Zitnick, C. L., & Parikh, D. (2015). Vqa: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision (pp. 2425-2433).
12. Argall, B. D., Chernova, S., Veloso, M., & Browning, B. (2009). A survey of robot learning from demonstration. Robotics and Autonomous Systems, 57(5), 469-483.
13. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
14. Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley framenet project. In Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics-Volume 1 (pp. 86-90).
15. Bakir, G. H. (2007). Predicting structured data. MIT press.
16. Banarescu, L., Bonial, C., Cai, S., Georgescu, M., Griffitt, K., Hermjakob, U., ... & Schneider, N. (2012). Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
17. Bandura, A. (1974). Psychological Modeling: Conflicting Theories. Transaction Publishers.
18. Barbu, A., Bridge, A., Burchill, Z., Coroian, D., Dickinson, S., Fidler, S., ... & Salvi, D. (2012a). Video in sentences out. In UAI 2012.
19. Barbu, A., Michaux, A., Narayanaswamy, S., & Siskind, J. M. (2012b). Simultaneous object detection, tracking, and event recognition. In ACS 2012.
20. Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.
21. Barnard, K., & Forsyth, D. (2001). Learning the semantics of words and pictures. In Proceedings of the 8th IEEE International Conference on Computer Vision, 2001 (ICCV 2001), Vol. 2. IEEE, 408–415.
22. Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
23. Baroni, M. (2016). Grounding distributional semantics in the visual world. Language and Linguistics Compass, 10(1), 3–13.
24. Barranco, F., Fermuller, C., & Aloimonos, Y. (2014). Contour motion estimation for asynchronous event-driven cameras. Proc. IEEE, 102(10), 1537–1556.
25. Barron, J. T., & Malik, J. (2015). Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell., 37(8), 1670–1687.
26. Bay, H., Ess, A., Tuytelaars, T., & Van Gool, L. (2008). Speeded-up robust features (SURF). Comput. Vis. Image Understand., 110(3), 346–359.
27. Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: Speeded up robust features. In Computer Vision–ECCV 2006. Springer, 404–417.
28. Banarescu, L., et al. (2012). Abstract meaning representation (AMR) 1.0 specification. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing.
29. Belhumeur, P. N., Hespanha, J. P., & Kriegman, D. J. (1997). Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEE Trans. Pattern Anal. Mach. Intell., 19(7), 711–720.
30. Belkin, M., & Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neur. Comput., 15(6), 1373–1396.
31. Beltagy, I., Roller, S., Cheng, P., Erk, K., & Mooney, R. J. (2015). Representing meaning with a combination of logical form and vectors. arXiv preprint arXiv:1505.06816.
32. Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell., 35(8), 1798–1828.
33. Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. J. Mach. Learn. Res., 3, 1137–1155.
34. Bengio, Y., Larochelle, H., Lamblin, P., Popovici, D., Courville, A., Simard, C., ... & Erhan, D. (2007). Deep architectures for baby AI.
35. Berg, A., Deng, J., & Fei-Fei, L. (2010). Large scale visual recognition challenge (ILSVRC), 2010.
36. Berg, T. L., & Berg, A. C. (2009). Finding iconic images. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, 2009 (CVPR Workshops 2009). IEEE, 1–8.
37. Berg, T. L., Berg, A. C., Edwards, J., Maire, M., White, R., Teh, Y. W., ... & Forsyth, D. A. (2004). Names and faces in the news. In Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’04), Vol. 2. IEEE, II–848.
38. Berg, T. L., Berg, A. C., & Shih, J. (2010). Automatic attribute discovery and characterization from noisy web data. In Computer Vision–ECCV 2010. Springer, 663–676.
39. Berg, T. L., Forsyth, D., & others. (2006). Animals on the web. In Proceedings of the 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. IEEE, 1463–1470.
40. Bernardi, R., Cakici, R., Elliott, D., Erdem, A., Erdem, E., Ikizler-Cinbis, N., ... & Plank, B. (2016). Automatic description generation from images: A survey of models, datasets, and evaluation measures. J. Artif. Intell. Res., 55, 409–442.
41. Blei, D. M., & Jordan, M. I. (2003). Modeling annotated data. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval. ACM, 127–134.
42. Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent Dirichlet allocation. J. Mach. Learn. Res., 3, 993–1022.
43. Bloom, B. S., & others. (1956). Taxonomy of educational objectives. Vol. 1: Cognitive domain. McKay, New York, NY, 20–24.
44. Bronstein, A. M., Bronstein, M. M., & Kimmel, R. (2005). Three-dimensional face recognition. Int. J. Comput. Vis., 64(1), 5–30.
45. Bruni, E., Boleda, G., Baroni, M., & Tran, N. K. (2012). Distributional semantics in technicolor. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, 136–145.
46. Bruni, E., Tran, N. K., & Baroni, M. (2014). Multimodal distributional semantics. J. Artif. Intell. Res., 49, 1–47.
47. Byron, D., Koller, A., Oberlander, J., Stoia, L., & Striegnitz, K. (2007). Generating instructions in virtual environments (GIVE): A challenge and an evaluation testbed for NLG.
48. Cangelosi, A. (2006). The grounding and sharing of symbols. Pragm. Cogn., 14(2), 275–285.
49. Carlson, A., Betteridge, J., Kisiel, B., Settles, B., Hruschka Jr, E. R., & Mitchell, T. M. (2010). Toward an architecture for never-ending language learning. In AAAI, Vol. 5. 3.
50. Carrasco, M. (2011). Visual attention: The past 25 years. Vis. Res., 51(13), 1484–1525.
51. Carreira, J., & Sminchisescu, C. (2010). Constrained parametric min-cuts for automatic object segmentation. In Proceedings of the 2010 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 3241–3248.
52. Chang, A. X., Savva, M., & Manning, C. D. (2014). Semantic parsing for text to 3d scene generation. ACL 2014, 17.
53. Chao, Y. W., Wang, Z., He, Y., Wang, J., & Deng, J. (2015). HICO: A benchmark for recognizing human-object interactions in images. In Proceedings of the IEEE International Conference on Computer Vision. 1017–1025.
54. Carlson, A., et al. (2010). Toward an architecture for never-ending language learning. In AAAI.
55. Chemero, A. (2003). An outline of a theory of affordances. Ecological Psychology, 15(2), 181–195.
56. Chen, D. L., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies—Volume 1. Association for Computational Linguistics, 190–200.
57. Chen, D. L., & Mooney, R. J. (2008). Learning to sportscast: A test of grounded language acquisition. In Proceedings of the 25th International Conference on Machine Learning. ACM, 128–135.
58. Chang, A. X., Savva, M., & Manning, C. D. (2014). Semantic parsing for text to 3d scene generation. In ACL 2014.
59. Chen, X., Fang, H., Lin, T. Y., Vedantam, R., Gupta, S., Dollar, P., & Zitnick, C. L. (2015a). Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
60. Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In Proceedings of the 2013 IEEE International Conference on Computer Vision (ICCV). IEEE, 1409–1416.
61. Chen, Z., Lin, W., Chen, Q., Chen, X., Wei, S., Jiang, H., & Zhu, X. (2015b). Revisiting word embedding for contrasting meaning. In Proceedings of ACL.
62. Chelba, C., et al. (2014). One billion word benchmark for measuring progress in statistical language modeling. In Proceedings of the 15th Annual Conference of the International Speech Communication Association.
63. Chen, X., et al. (2015a). Microsoft COCO captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
64. Chen, X., Shrivastava, A., & Gupta, A. (2013). Neil: Extracting visual knowledge from web data. In ICCV.
65. Chemero, A. (2003). An outline of a theory of affordances. Ecological Psychology, 15(2), 181–195.
66. Clark, S., & Pulman, S. (2007). Combining symbolic and distributional models of meaning. In AAAI Spring Symposium: Quantum Interaction. 52–55.
67. Cohen, M. D., & Bacdayan, P. (1994). Organizational routines are stored as procedural memory: Evidence from a laboratory study. Organiz. Sci., 5(4), 554–568.
68. Cohen, N., Sharir, O., & Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. In Proceedings of the 29th Annual Conference on Learning Theory. 698–728.
69. Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. Syntax Sem. Struct. Stat. Transl., 103.
70. Choi, M. J., Torralba, A., & Willsky, A. S. (2012). Context models and out-of-context objects. Pattern Recogn. Lett., 33(7), 853–862.
71. Cohen, N., Sharir, O., & Shashua, A. (2016). On the expressive power of deep learning: A tensor analysis. In COLT.
72. Coradeschi, S., Loutfi, A., & Wrede, B. (2013). A short review of symbol grounding in robotic and intelligent systems. KI-Künstliche Intell., 27(2), 129–136.
73. Coradeschi, S., & Saffiotti, A. (2000). Anchoring symbols to sensor data: Preliminary report. In AAAI/IAAI. 129–135.
74. Cowan, N. (2008). What are the differences between long-term, short-term, and working memory? Progr. Brain Res., 169, 323–338.
75. Darrell, T. (2010). Learning Representations for Real-world Recognition. UCB EECS Colloquium.
76. Das, P., Xu, C., Doell, R., & Corso, J. (2013). A thousand frames in just a few words: Lingual description of videos through latent topics and sparse object stitching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2634–2641.
77. Daumé III, H. (2007). Frustratingly easy domain adaptation. ACL 2007, 256.
78. Daumé III, H., Langford, J., & Marcu, D. (2009). Search-based structured prediction. Mach. Learn., 75(3), 297–325.
79. Denton, E. L., Zaremba, W., Bruna, J., LeCun, Y., & Fergus, R. (2014). Exploiting linear structure within convolutional networks for efficient evaluation. In Advances in Neural Information Processing Systems. 1269–1277.
80. Dodge, J., et al. (2012). Detecting visual text. In Proceedings of the 2012 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics, 762–772.
81. Donahue, J., Hendricks, L. A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.