HeaRT: An Innovative Health Representation Transformer With Clustered Feature Augmentation for Biomedical Text Classification

PINAR, MERVE; Altinel, AYŞE; AKTAŞ, ABDULSAMET

doi:10.1109/access.2025.3646756

HeaRT: An Innovative Health Representation Transformer With Clustered Feature Augmentation for Biomedical Text Classification

PINAR M., Altinel A. B., AKTAŞ A.

IEEE Access, cilt.13, ss.215748-215770, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 13
Basım Tarihi: 2025
Doi Numarası: 10.1109/access.2025.3646756
Dergi Adı: IEEE Access
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Sayfa Sayıları: ss.215748-215770
Anahtar Kelimeler: biomedical natural language processing, cluster-enhanced feature augmentation, Medical text classification, multi-vector feature fusion, SBERT embeddings
Marmara Üniversitesi Adresli: Evet

Özet

Medical text classification (MTC) poses significant challenges in health informatics due to contextual complexity, class imbalance, and limited labeled data. This study introduces HeaRT (Health Representation Transformer), a novel and explainable classification framework tailored for sparsely labeled and semantically complex medical texts. To the best of our knowledge, this is the first study to systematically combine TF-based lexical weighting, SBERT contextual embeddings, and cluster-enhanced structural features within a unified multi-vector fusion architecture for MTC. HeaRT combines TF-based lexical weighting with SBERT embeddings to capture both lexical and contextual nuances. A hybrid feature selection mechanism based on ANOVA and SHAP is employed to reduce redundancy and improve interpretability. To further enhance representational capacity, cluster-derived features from K-means, DBSCAN, and Agglomerative Clustering are added, introducing topological structure awareness into the learning process. The proposed framework is evaluated on two benchmark datasets: Medical Abstracts and the Biomedical Text Dataset, and is compared against state-of-the-art models such as BERT, and Doc2Sequence. Experimental results reveal that HeaRT achieves an F1-score of 60.74% with AdaBoost on Medical Abstracts and 94.02% with LightGBM on the Biomedical Text Dataset. Paired t-test results confirm the statistical significance of these gains (p < 0.05). These findings establish HeaRT as a robust, interpretable, and extensible solution for medical text classification, with strong potential for deployment in clinical decision support, biomedical literature mining, and healthcare informatics applications.