contact@parthenonfrontiers.com

A STATIC ANALYSIS FRAMEWORK UTILIZING LARGE LANGUAGE MODELS FOR IDENTIFYING MALICIOUS OFFICE OPEN XML FILES

Authors

  • Dr. Carlos M. Ruiz School of Computing and Information Systems, University of Melbourne, Australia Author
  • Prof. Anna Petrova Faculty of Computational Mathematics and Cybernetics, Lomonosov Moscow State University, Russia Author

Keywords:

Pulmonary blastoma, Biphasic tumor, Lung neoplasm, Case report

Abstract

Office Open XML (OOXML) documents represent a primary vector for malware distribution, capitalizing on their ubiquitous presence in modern enterprise and personal computing. The inherent complexity of the OOXML format provides a fertile ground for concealing malicious payloads, which often evade traditional security measures. Conventional detection methods, which predominantly rely on signature-based scanning and predefined rules, are frequently outpaced by the rapid evolution of malware, particularly sophisticated threats like polymorphic code, zero-day exploits, and advanced social engineering tactics. This paper proposes a novel, in-depth static analysis framework that leverages the advanced contextual understanding and reasoning capabilities of Large Language Models (LLMs) to unmask malicious OOXML documents. Our methodology involves a systematic deconstruction of the OOXML package into a structured, human-readable JSON format. This comprehensive representation is then fed to an LLM, which, guided by a sophisticated, role-based prompt, performs a deep semantic analysis of the document’s constituent parts. The model scrutinizes everything from VBA macro code and XML relationship files to embedded objects and metadata for indicators of malicious intent. This approach transcends the limitations of simple pattern matching, enabling a holistic assessment of the document's structure and content. The framework demonstrates a high potential for accurately identifying malicious documents, including those that employ heavy obfuscation or novel attack vectors, thereby offering a significant and necessary advancement in the ongoing fight against document-based cyber threats.

References

1. Microsoft Office Statistics: Latest Data & Summary. 2024. Available online: https://wifitalents.com/statistic/microsoft-office/ (accessed on 20 June 2024).

2. Macros from the Internet Are Blocked by Default in Office. 2024. Available online: https://learn.microsoft.com/en-us/deployoffice/security/internet-macros-blocked (accessed on 20 June 2024).

3. The Beginner’s Guide to—OOXML Malware Reverse Engineering Part 1. 2024. Available online: https://bufferzonesecurity.com/the-beginners-guide-to-ooxml-malware-reverse-engineering-part-1/ (accessed on 16 June 2024).

4. How to Analyze Malicious Microsoft Office Files. 2025. Available online: https://intezer.com/blog/malware-analysis/analyzemalicious-microsoft-office-files/ (accessed on 2 April 2025).

5. A Distribution of Exploits Used in Attacks by Type of Application Attacked, May 2020. 2024. Available online: https://securelist.com/kaspersky-security-bulletin-2020-2021-eu-statistics/102335/#vulnerable-applications-used-by-cybercriminals (accessed on 17 June 2024).

6. Microsoft 365 MSO 2306 Build 16.0.16529.20100 Remote Code Execution. 2025. Available online: https://packetstormsecurity.com/files/173361/Microsoft-365-MSO-2306-Build-16.0.16529.20100-Remote-Code-Execution.html (accessed on 3 April 2025).

7. Microsoft Office Security Vulnerabilities, CVEs CVSS Score >= 7. 2025. Available online: https://www.cvedetails.com/vulnerability-list/vendor_id-26/product_id-320/Microsoft-Office.html?page=1&cvssscoremin=7 (accessed on 3 April 2025).

8. The Last Six Months Shows a 341% Increase in Malicious Emails. 2025. Available online: https://www.securitymagazine.com/articles/100687-the-last-six-months-shows-a-341-increase-in-malicious-emails (accessed on 15 April 2025).

9. 139,445—Pentesting SMB. 2025. Available online: https://book.hacktricks.xyz/network-services-pentesting/pentesting-smb (accessed on 19 April 2025).

10. HTTP Spoofing. 2025. Available online: https://www.invicti.com/learn/mitm-https-spoofing-idn-homograph-attack/ (accessed on 9 April 2025).

11. Malicious Shapes In Office—Part 1. 2025. Available online: https://medium.com/@laughing_mantis/malicious-shapes-inoffice-part-1-8a4efca74358 (accessed on 15 April 2025).

12. Malicious Shapes In Office—Part 2. 2025. Available online: https://medium.com/@laughing_mantis/malicious-shapes-inoffice-part-2-910375cd05f3 (accessed on 15 April 2025).

13. Heß, J. Office2JSON. 2024. Available online: https://github.com/RuntimeException420/Office2JSON (accessed on 18 June 2024).

14. MalwareBazaar Database. 2024. Available online: https://bazaar.abuse.ch/browse/ (accessed on 18 June 2024).

15. VirusTotal—Search. 2025. Available online: https://www.virustotal.com/gui/home/search (accessed on 10 April 2025).

16. HashMyFiles by NirSoft. 2025. Available online: https://www.nirsoft.net/utils/hash_my_files.html (accessed on 16 April 2025).

17. YARA’s Documentation. 2025. Available online: https://yara.readthedocs.io/en/v4.4.0/index.html (accessed on 16 April 2025).

18. PeStudio Overview: Setup, Tutorial and Tips. 2025. Available online: https://www.varonis.com/blog/pestudio (accessed on 16 April 2025).

19. Decalage2/Oletools. 2024. Available online: https://github.com/decalage2/oletools (accessed on 19 June 2024).

20. REMnux: A Linux Toolkit for Malware Analysis. 2025. Available online: https://remnux.org/#home (accessed on 16 April 2025).

21. LetsDefend: Dynamic Malware Analysis Part 1. 2025. Available online: https://infosecwriteups.com/letsdefend-dynamicmalware-analysis-part-1-1ce35ff5b59f (accessed on 16 April 2025).

22. What Is Endpoint Detection and Response? 2025. Available online: https://www.trellix.com/security-awareness/endpoint/what-is-endpoint-detection-and-response/ (accessed on 16 April 2025).

23. What Are Metamorphic and Polymorphic Malware? 2024. Available online: https://www.techtarget.com/searchsecurity/definition/metamorphic-and-polymorphic-malware (accessed on 28 June 2024).

24. Naidu, V.; Narayanan, A. A Syntactic Approach for Detecting Viral Polymorphic Malware Variants. In Proceedings of the Intelligence and Security Informatics, Auckland, New Zealand, 19 April 2016; Volume 9650.

25. Protect a Document with a Password. 2025. Available online: https://support.microsoft.com/en-us/office/protect-a-documentwith-a-password-05084cc3-300d-4c1a-8416-38d3e37d6826 (accessed on 16 April 2025).

26. Pleshakova, E.; Osipov, A.; Gataullin, S.; Gataullin, T.; Vasilakos, A. Next gen cybersecurity paradigm towards artificial general intelligence: Russian market challenges and future global technological trends. J. Comput. Virol. Hacking Tech. 2024, 20, 429–440.

27. Minaee, S.; Mikolov, T.; Nikzad, N.; Chenaghlu, M.; Socher, R.; Amatriain, X.; Gao, J. Large Language Models—A Survey. arXiv 2024, arXiv:2402.06196.

28. GPQA: A Graduate-Level Google-Proof Q&A Benchmark. 2025. Available online: https://klu.ai/glossary/gpqa-eval (accessed on 14 April 2025).

29. LLM Leaderboard. 2025. Available online: https://klu.ai/llm-leaderboard (accessed on 14 April 2025).

30. Sanchez, P.M.S.; Celdran, A.H.; Bovet, G.; Perez, G.M. Transfer Learning in Pre-Trained Large Language Models for Malware Detection Based on System Calls. arXiv 2024, arXiv:2405.09318.

31. Singh, P. Detection of Malicious OOXML Documents Using Domain Specific Features. Ph.D. Thesis, Indian Institute of Information Technology and Management Gwalior, Gwalior, India, 2017.

32. Zahan, N.; Burckhardt, P.; Lysenko, M.; Aboukhadijeh, F.; Williams, L. Shifting the Lens: Detecting Malicious npm Packages using Large Language Models. arXiv 2025, arXiv:2403.12196.

33. Patsakis, C.; Casino, F.; Lykousas, N. Assessing LLMs in Malicious Code Deobfuscation of Real-world Malware Campaigns. arXiv 2025, arXiv:2404.19715.

34. Müller, J.; Ising, F.; Mainka, C.; Mladenov, V.; Schinzel, S.; Schwenk, J. Office Document Security and Privacy. In Proceedings of the 14th USENIX Workshop on Offensive Technologies (WOOT 20), Boston, MA, USA, 10–11 August 2020.

35. Nath, H.V.; Mehtre, B. Static Malware Analysis Using Machine Learning Methods. In Proceedings of the International Conference on Security in Computer Networks and Distributed Systems (SNDS-2014), Trivandrum, India, 13–14 March 2014.

36. Khan, B.; Arshad, M. Comparative Analysis of Machine Learning Models for PDF Malware Detection: Evaluating Different Training and Testing Criteria. J. Cyber Secur. 2023, 5, 1–11.

37. Ucci, D.; Aniello, L.; Baldoni, R. Survey of Machine Learning Techniques for Malware Analysis. arXiv 2017, arXiv:1710.08189.

38. Shalaginov, A.; Banin, S.; Dehghantanha, A.; Franke, K. Machine Learning Aided Static Malware Analysis: A Survey and Tutorial. arXiv 2025, arXiv:1808.01201.

39. PEP 8—Style Guide for Python Code. 2025. Available online: https://peps.python.org/pep-0008/ (accessed on 10 April 2025).

40. Mandiant/Flare-vm. 2025. Available online: https://github.com/mandiant/flare-vm (accessed on 17 April 2025).

41. Heß, J. LLM-Sentinel. 2025. Available online: https://github.com/RuntimeException420/LLM-Sentinel (accessed on 10 April 2025).

42. Generate Better Prompts in the Developer Console. 2025. Available online: https://www.anthropic.com/news/prompt-generator (accessed on 10 April 2025).

43. Using the API—Client SDKs. 2025. Available online: https://docs.anthropic.com/en/api/client-sdks#python (accessed on 11 April 2025).

44. Prompt Engineering Overview. 2025. Available online: https://docs.anthropic.com/en/docs/build-with-claude/promptengineering/overview (accessed on 10 April 2025).

45. A Guide to Prompt Engineering: Enhancing the Performance of Large Language Models (LLMs). 2025. Available online: https://roboticsbiz.com/a-guide-to-prompt-engineering-enhancing-the-performance-of-large-language-models-llms/ (accessed on 20 May 2025).

46. Zhan, Q.; Liang, Z.; Ying, Z.; Kang, D. INJECAGENT: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents. arXiv 2024, arXiv:2403.02691.

47. Learn About Claude—Models. 2025. Available online: https://docs.anthropic.com/en/docs/about-claude/models (accessed on 10 April 2025).

48. Introducing the Next Generation of Claude. 2025. Available online: https://www.anthropic.com/news/claude-3-family (accessed on 10 April 2025).

49. Naik, N.; Jenkins, P.; Savage, N.; Yang, L.; Boongeon, T.; Iam-On, N.; Naik, K.; Song, J. Embedded YARA rules: Strengthening YARA Rules Utilising Fuzzy Hashing and Fuzzy Rules for Malware Analysis. 2020. Available online: https://link.springer.com/article/10.1007/s40747-020-00233-5 (accessed on 27 May 2025).

50. Anthropic—Pricing. 2025. Available online: https://www.anthropic.com/pricing#anthropic-api (accessed on 5 April 2025).

51. Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 Laying Down Harmonised Rules on Artificial Intelligence and Amending Regulations. 2025. Available online: https://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX%3A32024R1689 (accessed on 14 April 2025).

52. Commercial Terms of Service. 2025. Available online: https://www.anthropic.com/legal/commercial-terms (accessed on 14 April 2025).

53. Introducing Llama 3.1: Our Most Capable Models to Date. 2025. Available online: https://ai.meta.com/blog/meta-llama-3-1/ (accessed on 6 April 2025).

54. Gemma Open Models. 2025. Available online: https://ai.google.dev/gemma (accessed on 6 April 2025).

Downloads

Published

2024-12-11