IEEE Access, cilt.13, ss.215748-215770, 2025 (SCI-Expanded, Scopus)
Medical text classification (MTC) poses significant challenges in health informatics due to contextual complexity, class imbalance, and limited labeled data. This study introduces HeaRT (Health Representation Transformer), a novel and explainable classification framework tailored for sparsely labeled and semantically complex medical texts. To the best of our knowledge, this is the first study to systematically combine TF-based lexical weighting, SBERT contextual embeddings, and cluster-enhanced structural features within a unified multi-vector fusion architecture for MTC. HeaRT combines TF-based lexical weighting with SBERT embeddings to capture both lexical and contextual nuances. A hybrid feature selection mechanism based on ANOVA and SHAP is employed to reduce redundancy and improve interpretability. To further enhance representational capacity, cluster-derived features from K-means, DBSCAN, and Agglomerative Clustering are added, introducing topological structure awareness into the learning process. The proposed framework is evaluated on two benchmark datasets: Medical Abstracts and the Biomedical Text Dataset, and is compared against state-of-the-art models such as BERT, and Doc2Sequence. Experimental results reveal that HeaRT achieves an F1-score of 60.74% with AdaBoost on Medical Abstracts and 94.02% with LightGBM on the Biomedical Text Dataset. Paired t-test results confirm the statistical significance of these gains (p < 0.05). These findings establish HeaRT as a robust, interpretable, and extensible solution for medical text classification, with strong potential for deployment in clinical decision support, biomedical literature mining, and healthcare informatics applications.