OPTIMIZING SHAP EXPLANATIONS: A COST-EFFECTIVE DATA SAMPLING METHOD FOR ENHANCED INTERPRETABILITY

Prof. Li Wei; Alessia Romano

Open Access icon Open Access

ARTICLE

OPTIMIZING SHAP EXPLANATIONS: A COST-EFFECTIVE DATA SAMPLING METHOD FOR ENHANCED INTERPRETABILITY

Prof. Li Wei ¹ , Alessia Romano ²

¹ School of Data Science, The Chinese University of Hong Kong, Hong Kong

² Department of Computer Engineering, Politecnico di Milano, Italy

Issue Vol. 1 No. 01 (2024): Volume 01 Issue 01 --- Section Articles --- Published Date: 2024-12-25

Citations: Loading…

ABSTRACT VIEWS: 180 | FILE VIEWS: 52 | PDF: 52 HTML: 0 OTHER: 0 | TOTAL: 232

Views + Downloads (Last 90 days)

Cumulative % included

Abstract

The proliferation of complex machine learning (ML) models in critical domains such as healthcare, finance, and real estate has underscored the urgent need for Explainable Artificial Intelligence (XAI) [2, 3, 4, 31, 43, 45]. While these "black-box" models often achieve superior predictive performance, their inherent lack of transparency hinders trust, accountability, and the ability to effectively debug and refine them. SHapley Additive exPlanations (SHAP) is a widely recognized model-agnostic XAI method that provides detailed insights into individual feature contributions to model predictions, robustly grounded in cooperative game theory [39, 58, 61]. However, a significant drawback of SHAP, particularly when applied to large datasets or computationally intensive models, is its substantial computational overhead, often rendering its application impractical in resource-constrained or real-time operational environments [60, 64]. This article proposes and rigorously investigates a data-efficient strategy for achieving SHAP interpretability by leveraging intelligent data reduction techniques. Specifically, we explore the application of Slovin's formula, a statistical sampling technique traditionally employed in survey research, as a low-cost heuristic for data reduction. Unlike more complex feature selection or dimensionality reduction methods, Slovin's formula requires minimal prior statistical knowledge of the dataset's properties, offering a straightforward, accessible, and efficient alternative for subsampling without extensive preprocessing. Through controlled experiments on synthetically generated datasets, we demonstrate that by judiciously sampling a representative subset of the original data, SHAP explanations can be generated with significantly reduced computational cost while maintaining a high degree of fidelity to the explanations derived from the full dataset. Our findings highlight a U-shaped trade-off in SHAP value stability: mid-ranked features tend to remain more stable under subsampling, whereas features with extreme (very low or very high) importance exhibit higher fluctuations. Furthermore, we observe that categorical and non-skewed distributed features generally maintain greater robustness, while highly skewed target distributions introduce increased variability. Crucially, the effectiveness and reliability of Slovin's formula diminish when the subsample-to-sample ratio falls below a critical threshold of approximately 5%. This empirical evaluation underscores the potential of our cost-effective approach to democratize access to advanced interpretability, enabling faster model insights, improved debugging, and broader, more sustainable deployment of transparent AI systems in various domains.

Keywords

Cryptographic hash functions, Preimage attacks, Cube-and-Conquer, SAT solvers

References

[1] A. M. Abdullahi. 2023. The challenges of advancing inclusive education: the case of somalia’s higher education.Journal of Law andSustainable Development, 11, 2, e422–e422.

[2] A. A. Adeniran, A. P. Onebunne, and P. William. 2024. Explainable ai (xai) in healthcare: enhancing trust and transparency in criticaldecision-making.World Journal of Advanced Research and Reviews, 23, 2647–2658.

[3] Q. An, S. Rahman, J. Zhou, and J. J. Kang. 2023. A comprehensive review on machine learning in healthcare industry: classification,restrictions, opportunities and challenges.Sensors, 23, 9, 4178.

[4] Z. Asimiyu. 2024. Balancing explainable ai and security: machine learning for iot, finance, and real estate. Preprint. (2024).

[5] S. Athey and G. W. Imbens. 2019. Machine learning methods that economists should know about.Annual Review of Economics, 11, 1,685–725.

[6] S. Bachmann. 2025. Interpretable machine learning for the german residential rental market – shedding light into model mechanics.Aestimum. Just Accepted.

[7] M. L. Baptista, K. Goebel, and E. M. Henriques. 2022. Relation between prognostics predictor evaluation metrics and local interpretabilitySHAP values.Artificial Intelligence, 306, 103667.

[8] M. A. Batouei. 2024.A Feasibility Study On Artificial Neural Network-Based Prediction And Optimization Of Autoclave Curing Processutcomes Via Simulation-Based Thermal Images And Haralick Texture Features. Ph.D. Dissertation. University Of British Columbia,Okanagan.

[9] R. Bellman. 1961. On the approximation of curves by line segments using dynamic programming.Communications of the ACM, 4, 6,284.

[10] A. Bennetot et al. 2024. A practical tutorial on explainable ai techniques.ACM Computing Surveys, 57, 2, 1–44.

[11] L. Breiman. 2001. Random forests.Machine Learning, 45, 5–32.

[12] L. Breiman and J. H. Friedman. 1997. Predicting multivariate responses in multiple linear regression.Journal of the Royal StatisticalSociety: Series B (Statistical Methodology), 59, 1, 3–54.

[13] T. Brown et al. 2020. Language models are few-shot learners. InAdvances in Neural Information Processing Systems. Vol. 33, 1877–1901.

[14] N. Burkart and M. F. Huber. 2021. A survey on the explainability of supervised machine learning.Journal of Artificial IntelligenceResearch, 70, 245–317.

[15] G. Chandrashekar and F. Sahin. 2014. A survey on feature selection methods.Computers & Electrical Engineering, 40, 1, 16–28.

[16] J. Chen, L. Song, M. J. Wainwright, and M. I. Jordan. 2018. Learning to explain: an information-theoretic perspective on modelinterpretation. InProceedings of the 35th International Conference on Machine Learning. PMLR, 883–892.

[17] M. Christoph. 2020.Interpretable machine learning: A guide for making black box models explainable. Leanpub.

[18] W. G. Cochran. 1977.Sampling Techniques. (3rd ed.). John Wiley and Sons, New York.

[19] C. Cortes and V. Vapnik. 1995. Support-vector networks.Machine learning, 20, 273–297.

[20] R. Davis, B. Buchanan, and E. Shortliffe. 1977. Production rules as a representation for a knowledge-based consultation program.Artificial Intelligence, 8, 1, 15–45.

[21] C. Davoli. 2024.Data-Driven Approaches for the design of Traction Electrical Motors. Ph.D. Dissertation. Politecnico di Torino.

[22] J. Dean and S. Ghemawat. 2004. Mapreduce: simplified data processing on large clusters. InProceedings of the 6th Symposium onOperating Systems Design and Implementation.

[23] I. K. Fodor. 2002. A survey of dimension reduction techniques. Tech. rep. UCRL-ID-148494. Lawrence Livermore National Laboratory.

[24] J. Frankle, G. K. Dziugaite, D. M. Roy, and M. Carbin. 2019. Stabilizing the lottery ticket hypothesis.arXiv preprint arXiv:1903.01611.

[25] I. Guyon and A. Elisseeff. 2003. An introduction to variable and feature selection.Journal of Machine Learning Research, 3, 1157–1182.

[26] S. Hochreiter and J. Schmidhuber. 1997. Long short-term memory.Neural computation, 9, 8, 1735–1780.

[27] G. D. Israel. 1992. Determining sample size. (1992).

[28] D. Janzing, L. Minorics, and P. Blöbaum. 2020. Feature relevance quantification in explainable ai: a causal problem. InProceedings ofthe International Conference on Artificial Intelligence and Statistics. PMLR, 2907–2916.

[29] I. T. Jolliffe and J. Cadima. 2016. Principal component analysis: a review and recent developments.Philosophical Transactions of theRoyal Society A: Mathematical, Physical and Engineering Sciences, 374, 2065, 20150202.

[30] M. Kirchner. 2024. Nuclear power for ai data centers: microsoft has three mile island reactivated. Accessed: 2024-02-05. https://www.heise.de/en/news/Nuclear-power-f or-AI-data-centers-Microsof t-has-Three-Mile-Island-reactivated-9939253.html.

[31] B. Krämer, C. Nagl, M. Stang, and W. Schäfers. 2023. Explainable ai in a real estate context – exploring the determinants of residentialreal estate values.Journal of Housing Research, 32, 2, 204–245.

[32] A. Krizhevsky, I. Sutskever, and G. E. Hinton. 2012. Imagenet classification with deep convolutional neural networks. InAdvances inNeural Information Processing Systems. Vol. 25.

[33] P. Kumar. 2023. Explainable ai/ml testing: ensuring transparency, accountability, and compliance.Journal of Artificial Intelligence,Machine Learning & Data Science, 1, 4, 476–482.

[34] V. Kumar, K. Joshi, R. Kumar, M. Memoria, A. Gupta, and F. Ajesh. 2025. Future prospective of neuromorphic computing in artificialintelligence: a review, methods, and challenges. InPrimer to Neuromorphic Computing, 185–197.

[35] D. Lanin and N. Hermanto. 2019. The effect of service quality toward public satisfaction and public trust on local government inindonesia.International Journal of Social Economics, 46, 3, 377–392.

[36] H. M. Levitt, M. Bamberg, J. W. Creswell, D. M. Frost, R. Josselson, and C. Suárez-Orozco. 2018. Journal article reporting standards forqualitative primary, qualitative meta-analytic, and mixed methods research in psychology: the apa publications and communicationsboard task force report.American Psychologist, 73, 1, 26.

[37] P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis. 2020. Explainable ai: a review of machine learning interpretability methods.Entropy, 23, 1, 18.

[38] S. Liu, P. Gao, Y. Li, W. Fu, and W. Ding. 2023. Multi-modal fusion network with complementarity and importance for emotionrecognition.Information Sciences, 619, 679–694.

[39] S. M. Lundberg and S.-I. Lee. 2017. A unified approach to interpreting model predictions. InAdvances in Neural Information ProcessingSystems. Vol. 30.

[40] M. Minsky and S. Papert. 1969.Perceptrons: An Introduction to Computational Geometry. MIT Press, Cambridge, MA.

[41] J. N. Morgan and J. A. Sonquist. 1963. Problems in the analysis of survey data, and a proposal.Journal of the American StatisticalAssociation, 58, 302, 415–434.

[42] H. Müller, A. Holzinger, M. Plass, L. Brcic, C. Stumptner, and K. Zatloukal. 2022. Explainability and causability for artificial intelligence-supported medical image analysis in the context of the european in vitro diagnostic regulation.New Biotechnology, 70, 67–72.

[43] D. K. Nguyen, G. Sermpinis, and C. Stasinakis. 2023. Big data, artificial intelligence and machine learning: a transformative symbiosisin favour of financial technology.European Financial Management, 29, 2, 517–548.

[44] D. Normelindasari and A. Solichin. 2020. Effect of system quality, information quality, and perceived usefulness on user satisfaction ofwebstudent applications to improve service quality for budi luhur university students. InProceedings of the 4th International Conferenceon Management, Economics and Business (ICMEB 2019). Atlantis Press, 77–82.

[45] S. S. Patel. 2023. Explainable machine learning models to analyse maternal health.Data & Knowledge Engineering, 146, 102198.

[46] N. Patidar, S. Mishra, R. Jain, D. Prajapati, A. Solanki, R. Suthar, K. Patel, and H. Patel. 2024. Transparency in ai decision making: asurvey of explainable ai methods and applications.Advances of Robotic Technology, 2, 1.

[47] S. P. Putri, Y. Nakayama, F. Matsuda, T. Uchikata, S. Kobayashi, A. Matsubara, and E. Fukusaki. 2013. Current metabolomics: practicalapplications.Journal of Bioscience and Bioengineering, 115, 6, 579–589.

[48] J. R. Quinlan. 1993. Combining instance-based and model-based learning. InProceedings of the Tenth International Conference onMachine Learning, 236–243.

[49] A. Radford et al. 2021. Learning transferable visual models from natural language supervision. InProceedings of the 38th InternationalConference on Machine Learning. PMLR, 8748–8763.

[50] H. K. Ramadhani and D. Aldyandi. 2024. The relationship between the level of knowledge of kiasu culture and the way of view of highschool/vocational school students in the city of surabaya to achieve golden indonesia.Medical Technology and Public Health Journal, 8,1, 55–61

[51] G. Ras, N. Xie, M. Van Gerven, and D. Doran. 2022. Explainable deep learning: a field guide for the uninitiated.Journal of ArtificialIntelligence Research, 73, 329–396.

[52] C. O. Retzlaff, A. Angerschmid, A. Saranti, D. Schneeberger, R. Roettger, H. Mueller, and A. Holzinger. 2024. Post-hoc vs ante-hocexplanations: xai design guidelines for data scientists.Cognitive Systems Research, 86, 101243.

[53] M. T. Ribeiro, S. Singh, and C. Guestrin. 2016. Model-agnostic interpretability of machine learning.arXiv preprint arXiv:1606.05386.

[54] F. Rosenblatt. 1958. The perceptron: a probabilistic model for information storage and organization in the brain.Psychological Review,65, 6, 386–404.

[55] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. 1986. Learning representations by back-propagating errors.Nature, 323, 6088,533–536.

[56] W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, and K.-R. Müller. 2021. Explaining deep neural networks and beyond: a review ofmethods and applications.Proceedings of the IEEE, 109, 3, 247–278.

[57] R. E. Schapire. 1990. The strength of weak learnability.Machine Learning, 5, 197–227.

[58] L. S. Shapley. 1953. Stochastic games.Proceedings of the National Academy of Sciences, 39, 10, 1095–1100.

[59] V. W. Skrivankova et al. 2021. Strengthening the reporting of observational studies in epidemiology using mendelian randomization:the STROBE-MR statement.JAMA, 326, 16, 1614–1621.

[60] E. Strubell, A. Ganesh, and A. McCallum. 2020. Energy and policy considerations for modern deep learning research. InProceedings ofthe AAAI Conference on Artificial Intelligencenumber 9. Vol. 34, 13693–13696.

[61] M. Sundararajan and A. Najmi. 2020. The many shapley values for model explanation. InProceedings of the 37th International Conferenceon Machine Learning. PMLR, 9269–9278.

[62] J. J. Tejada and J. R. B. Punzalan. 2012. On the misuse of slovin’s formula.The Philippine Statistician, 61, 1, 129–136.

[63] A. M. Turing. 1950. Computing machinery and intelligence.Mind, 59, 236, 433–460.

[64] G. Van den Broeck, A. Lykov, M. Schleich, and D. Suciu. 2022. On the tractability of SHAP explanations.Journal of Artificial IntelligenceResearch, 74, 851–886.

[65] L. Van der Maaten and G. Hinton. 2008. Visualizing data using t-SNE.Journal of Machine Learning Research, 9, 11, 2579–2605.

[66] E. N. Vanegas Herrera. 2024.Three essays on machine learning and time series applications on finance: Skew index and return predictability.Ph.D. Dissertation. Unknown.

[67] V. N. Vapnik. 1999. An overview of statistical learning theory.IEEE Transactions on Neural Networks, 10, 5, 988–999.

[68] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. Attention is all you need. InAdvances in Neural Information Processing Systems. Vol. 30.

[69] T. Wanga et al. 2024. Explainable ai across domains: techniques, domain-specific applications, and future directions. (2024).

[70] J. M. Wooldridge. 2010.Econometric Analysis of Cross Section and Panel Data. MIT Press.

[71] Z. Yang et al. 2024. Cogvideox: text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072.

How to Cite

OPTIMIZING SHAP EXPLANATIONS: A COST-EFFECTIVE DATA SAMPLING METHOD FOR ENHANCED INTERPRETABILITY. (2024). European Journal of Emerging Artificial Intelligence, 1(01), 71-89. https://parthenonfrontiers.com/index.php/ejeai/article/view/49

Download Citation

ejeai Open Access Journal

European Journal of Emerging Artificial Intelligence

All issues

OPTIMIZING SHAP EXPLANATIONS: A COST-EFFECTIVE DATA SAMPLING METHOD FOR ENHANCED INTERPRETABILITY

Abstract

Keywords

References

How to Cite

Related articles

Journal Information

Journal Guidelines

Join Us

Contact Us

Share Link