Accelerating Post-Silicon Debug: An Ensemble Machine Learning and Explainable AI Approach for Platform Boot Failures
Fecha
Título de la revista
ISSN de la revista
Título del volumen
Editor
Resumen
As modern server platforms increase in complexity, debugging boot failures in FPGA-controlled power-up sequences becomes increasingly difficult, especially in post-silicon environments where reproducing issues is nontrivial and visibility is limited. This work introduces a machine learning-based framework for automatic classification of platform boot states by accessing Control and Status Register (CSR) data through the Board Management Controller (BMC) component. An ensemble model combining Neural Networks, Random Forest, Extreme Gradient Boosting (XGBoost), and a binary refinement classifier enables accurate differentiation across four platform boot conditions. The solution integrates Explainable Artificial Intelligence (XAI) techniques to highlight key signals that influence each decision, offering engineers insights for a faster triage. A Python-based inference script connects pre-silicon training and post-silicon deployment by mapping real-time hardware readings to the input format of the model. The experimental results demonstrate high accuracy, reduced boot state classification overlap, and effective generalization to previously unobserved datasets. This framework significantly improves the speed and clarity of post-silicon debug, reducing the dependency on traditional techniques.