Publications | Duc Tai Phan

denotes journal

denotes conference

2026

KBS
From object difficulty to image scoring: A strategy for active learning in object detection

Duc Tai Phan , Nhut Minh Nguyen , Khang Phuc Nguyen , Phuong-Nam Tran , Nhat Truong Pham , Linh Le , Choong Seon Hong , and Duc Ngoc Minh Dang

Knowledge-Based Systems, Apr 2026

Abs Bib HTML Code

Object detection (OD) faces the costly hurdle of large-scale dataset annotation. Active Learning (AL) is a promising solution that selects the most beneficial samples for annotation. Applying AL to OD is challenging, as it requires addressing uncertainties in both classifying and localizing objects while combining object-level information into a single image-level decision. Current methods, like uncertainty sampling, often rely on output-level uncertainty or heuristic signal combinations. However, relying on static model outputs limits their ability to capture underlying feature-space instabilities. To address these limitations, we propose Feature Difficulty-based AL (FDAL), a novel framework that shifts the focus from output-level heuristics to active latent-space manipulation. By systematically interpolating unlabeled features toward class anchors, FDAL identifies prediction inconsistencies to reveal hidden traits unknown to the model. This anchor-guided interpolation unifies classification and localization uncertainties, capturing latent instabilities that traditional paradigms miss. Experiments on four OD benchmarks demonstrate that FDAL achieves state-of-the-art (SOTA) results, consistently improving detection accuracy while substantially reducing annotation costs. FDAL outperforms SOTA methods by 0.8% in detection accuracy on Pattern analysis, statistical modelling, and computational learning (PASCAL) Visual Object Classes (PASCAL VOC), 1.99% on Karlsruhe Institute of Technology and Toyota Technological Institute (KITTI), 1.68% on Cityscapes, and 0.27% on Microsoft Common Objects in Context (MS COCO). FDAL’s selection is highly efficient, taking 0.06 seconds per image on PASCAL VOC and up to 0.63 seconds per image on Cityscapes, enabling scalability to large unlabeled pools. FDAL offers a practical, efficient, and effective solution for advancing AL in OD.
@article{PHAN2026115946, author = {Phan, Duc Tai and Nguyen, Nhut Minh and Nguyen, Khang Phuc and Tran, Phuong-Nam and Pham, Nhat Truong and Le, Linh and Hong, Choong Seon and Dang, Duc Ngoc Minh}, title = {From object difficulty to image scoring: A strategy for active learning in object detection}, journal = {Knowledge-Based Systems}, month = apr, year = {2026}, keywords = {Active learning, Object detection, Computer vision, Difficulty, Instability}, volume = {342}, pages = {115946}, issn = {0950-7051}, doi = {https://doi.org/10.1016/j.knosys.2026.115946}, url = {https://www.sciencedirect.com/science/article/pii/S0950705126006726} }
ECTI-CON’25
Swin Transformer V2 for Optical Chemical Structure Recognition: Comparison with Convolutional Neural Networks and Swin Transformer Variants

Thanh Trung Nguyen , Nhut Minh Nguyen , Duc Tai Phan , Quang Nhan Hoang , Tri Minh Pham , and Duc Ngoc Minh Dang

In The 22nd International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON 2025) , Bangkok, Thailand, Mar 2026

Abs Bib

Optical Chemical Structure Recognition (OCSR) is designed to transform images of chemical structures into formats that computers can process, facilitating the extraction of molecular data from scientific publications on a large scale. Although modern deep learning methods have shown encouraging results, the impact of different visual backbone designs on understanding molecular structures has not been thoroughly investigated. This research compares various visual backbones for OCSR using an encoder-decoder architecture. The results show that while Convolutional Neural Network (CNN) architectures deliver excellent accuracy at the character level, Swin Transformer V2 generates more accurate molecular structures, as evidenced by superior Tanimoto similarity metrics. Moreover, even though Swin Transformer V2 has similar computational demands as the original Swin Transformer (V1), it exhibits enhanced training reliability and better capability for structural modeling, underscoring its effectiveness as a dependable visual backbone for OCSR applications.
@inproceedings{Nguyen2605:Swin2, author = {Nguyen, Thanh Trung and Nguyen, Nhut Minh and Phan, Duc Tai and Hoang, Quang Nhan and Pham, Tri Minh and Dang, Duc Ngoc Minh}, title = {Swin Transformer V2 for Optical Chemical Structure Recognition: Comparison with Convolutional Neural Networks and Swin Transformer Variants}, booktitle = {The 22nd International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON 2025)}, address = {Bangkok, Thailand}, month = mar, year = {2026}, keywords = {Optical Chemical Structure Recognition (OCSR); Swin Transformer; Deep Learning; Molecular Graph Extraction; Computer Vision.}, }

2025

RIVF’25
YOLOv5-Powered Smart Parking System with IoT-Based Real-Time Slot Monitoring

Phuong Lam Nguyen , Duc Tai Phan , Thanh Trung Nguyen , Nhut Minh Nguyen , and Duc Ngoc Minh Dang

In 2025 RIVF International Conference on Computing and Communication Technologies (RIVF 2025) , Ho Chi Minh City, Vietnam, Dec 2025

Abs Bib HTML

Finding a parking space has become a daily struggle in an era of rapid urbanization and increasing car ownership, contributing to traffic congestion, lost time, and environmental stress. To transform urban parking management, this study presents a novel Smart Parking System (SPS) that seamlessly integrates deep learning and Internet of Things (IoT) technologies. Our SPS delivers a streamlined workflow, featuring real-time slot occupancy detection via a fine-tuned YOLOv5 model analyzing camera feeds, automated entry/exit handling triggered by IoT sensors, and a specialized License Plate Recognition (LPR) model. Additionally, it includes secure cloud-based data logging for effortless tracking and billing. The web dashboard provides real-time, color-coded visualizations of slot availability, supporting efficient parking management. Experimental evaluation demonstrates an accuracy of 90% in license plate recognition and over 99.5% in slot occupancy detection, highlighting the system’s reliability and practical applicability for smart city deployment.
@inproceedings{11365041, author = {Nguyen, Phuong Lam and Phan, Duc Tai and Nguyen, Thanh Trung and Nguyen, Nhut Minh and Dang, Duc Ngoc Minh}, title = {YOLOv5-Powered Smart Parking System with IoT-Based Real-Time Slot Monitoring}, booktitle = {2025 RIVF International Conference on Computing and Communication Technologies (RIVF 2025)}, address = {Ho Chi Minh City, Vietnam}, pages = {693-698}, month = dec, year = {2025}, keywords = {YOLO;Analytical models;Smart cities;Automated parking;Real-time systems;Internet of Things;License plate recognition;Monitoring;Traffic congestion;Intelligent sensors;Smart Parking System;License Plate Recognition;Real-Time Slot Monitoring;YOLOv5;Optical Character Recognition;Computer Vision}, doi = {10.1109/RIVF68649.2025.11365041}, isbn = {979-8-3315-7790-2}, }
MEDES’25

GloMER: Towards Robust Multimodal Emotion Recognition via Gated Fusion and Contrastive Learning

Nhut Minh Nguyen , Duc Tai Phan , and Duc Ngoc Minh Dang

In 17th International Conference on Management of Digital Ecosystems , Ho Chi Minh City, Vietnam, Nov 2025

Abs Code

Speech Emotion Recognition (SER) enhances Human-Computer-Interaction (HCI) across healthcare, education, and customer service. Although multimodal approaches that combine audio and text show promise, they face challenges in aligning heterogeneous modalities and achieving balanced feature fusion. In this work, we propose GloMER, a novel multimodal SER architecture that integrates a self-alignment strategy with contrastive learning and a Gated Multimodal Unit (GMU) for adaptive fusion. The self-alignment mechanism ensures semantic consistency while preserving modality diversity. The GMU dynamically regulates modality contributions, ensuring balanced integration and context-aware representations under imbalanced conditions. We evaluate GloMER, achieving state-of-the-art (SOTA) on two benchmark datasets, delivering consistent improvements in multimodal emotion recognition. Ablation studies show that combining classification with alignment improves embedding discriminability, while gated fusion balances modalities and captures complementary cues. These findings establish GloMER as a robust framework for advancing multimodal emotion recognition.
MEDES’25

DAAL: Dual Ambiguity in Active Learning for Object Detection with YOLOE

Duc Tai Phan , Nhut Minh Nguyen , and Duc Ngoc Minh Dang

In 17th International Conference on Management of Digital Ecosystems , Ho Chi Minh City, Vietnam, Nov 2025

Abs Code

The high cost of annotating vast datasets poses a significant challenge in object detection, hindering the development of robust models. By selectively annotating the most instructive samples, active learning has become a viable strategy for addressing this problem and improving model performance and labeling efficiency. However, the two main types of model uncertainty—epistemic, which captures the model’s uncertainty in classifying an object, and aleatoric, which addresses the inherent ambiguity in an object’s presence and location—are frequently not balanced in traditional active learning techniques. In this paper, we introduce DAAL (Dual Ambiguity Active Learning), a novel framework that quantifies and combines both epistemic and aleatoric ambiguity into a single, weighted score. Epistemic ambiguity measures the model’s indecision in assigning semantic labels, while aleatoric ambiguity assesses the conviction in object presence and localization. By combining these, DAAL selects the most informative images for annotation, optimizing model performance under limited labeling budgets. Extensive experiments on popular benchmarks demonstrate that DAAL consistently outperforms traditional methods, achieving superior accuracy under the same limited labeling budget. This affirms its effectiveness in creating more efficient annotation workflows for object detection.
APNOMS’25
CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms

Nhut Minh Nguyen , Thu Thuy Le , Thanh Trung Nguyen , Duc Tai Phan , Anh Khoa Tran , and Duc Ngoc Minh Dang

In 2025 25th Asia-Pacific Network Operations and Management Symposium (APNOMS’2025) , Kaohsiung, Taiwan, Sep 2025

Abs Bib HTML

Multimodal Speech Emotion Recognition (SER) offers significant advantages over unimodal approaches by integrating diverse information streams such as audio and text. However, effectively fusing these heterogeneous modalities remains a significant challenge. We propose CemoBAM, a novel dualstream architecture that effectively integrates the Heterogeneous Graph Attention Network (CH-GAT) with the Cross-modal Convolutional Block Attention Mechanism (xCBAM). In CemoBAM architecture, the CH-GAT constructs a heterogeneous graph that models intra- and inter-modal relationships, employing multi-head attention to capture fine-grained dependencies across audio and text feature embeddings. The xCBAM enhances feature refinement through a cross-modal transformer with a modified 1D-CBAM, employing bidirectional cross-attention and channel-spatial attention to emphasize emotionally salient features. The CemoBAM architecture surpasses previous state-of-the-art (SOTA) methods by 0.32% on IEMOCAP and 3.25% on ESD datasets. Comprehensive ablation studies validate the impact of Top-K graph construction parameters, fusion strategies, and the complementary contributions of both modules. The results highlight CemoBAM’s robustness and potential for advancing multimodal SER applications.
@inproceedings{11181320, author = {Nguyen, Nhut Minh and Le, Thu Thuy and Nguyen, Thanh Trung and Phan, Duc Tai and Tran, Anh Khoa and Dang, Duc Ngoc Minh}, title = {CemoBAM: Advancing Multimodal Emotion Recognition through Heterogeneous Graph Networks and Cross-Modal Attention Mechanisms}, booktitle = {2025 25th Asia-Pacific Network Operations and Management Symposium (APNOMS'2025)}, address = {Kaohsiung, Taiwan}, month = sep, year = {2025}, keywords = {Human computer interaction;Emotion recognition;Attention mechanisms;Speech recognition;Speech enhancement;Electrostatic discharges;Transformers;Robustness;Noise measurement;Multimodal emotion recognition;Speech emotion recognition;Cross-modal heterogeneous graph attention;Crossmodal convolutional block attention mechanism;Feature fusion}, doi = {10.23919/APNOMS67058.2025.11181320}, isbn = {978-4-88552-356-4}, }
APNOMS’25
ALMUS: Enhancing Active Learning for Object Detection with Metric-Based Uncertainty Sampling

Duc Tai Phan , Nhut Minh Nguyen , Khang Phuc Nguyen , Tri Minh Pham , and Duc Ngoc Minh Dang

In 2025 25th Asia-Pacific Network Operations and Management Symposium (APNOMS’2025) , Kaohsiung, Taiwan, Sep 2025

Abs Bib HTML

Object detection is critical in computer vision but often requires large amounts of labeled data for effective training. Active learning (AL) has emerged as a promising solution to reduce the annotation burden by selecting the most informative samples for labeling. However, existing AL methods for object detection primarily focus on uncertainty sampling, which may not effectively balance the dual challenges of classification and localization. In this study, we explore active learning for object detection, with the objective of optimizing model performance while substantially reducing the demand for annotated data. We propose a novel Active Learning with Metric-based Uncertainty Sampling (ALMUS) that works effectively for the object detection task. This approach prioritizes selecting images containing objects from categories where the model exhibits suboptimal performance, as determined by category-specific evaluation metrics. To balance the annotation budget across different object classes, we propose a dynamic allocation strategy that considers the difficulty of each class and the distribution of object instances within the dataset. This combination of strategies enables our method to effectively address the dual challenges of classification and localization in object detection tasks while still focusing on the rarest and most challenging classes. We conduct extensive experiments on the PASCAL VOC 2007 and 2012 datasets, demonstrating that our method outperforms several active learning baselines. Our results indicate that the proposed approach enhances model performance and accelerates convergence, making it a valuable contribution to the field of active learning in object detection.
@inproceedings{11181447, author = {Phan, Duc Tai and Nguyen, Nhut Minh and Nguyen, Khang Phuc and Pham, Tri Minh and Dang, Duc Ngoc Minh}, title = {ALMUS: Enhancing Active Learning for Object Detection with Metric-Based Uncertainty Sampling}, booktitle = {2025 25th Asia-Pacific Network Operations and Management Symposium (APNOMS'2025)}, address = {Kaohsiung, Taiwan}, pages = {6}, month = sep, year = {2025}, keywords = {Location awareness;Measurement;Training;Uncertainty;Annotations;Active learning;Object detection;Robustness;Resource management;Convergence;Active Learning;Object Detection;Uncertainty Sampling;Metric-based Sampling}, doi = {10.23919/APNOMS67058.2025.11181447}, isbn = {978-1-23456-789-0}, }

2024

RIVF’24
Improving Face Attendance Checking System with Ensemble Learning

Duc Tai Phan , Phuong-Nam Tran , and Duc Ngoc Minh Dang

In 2024 RIVF International Conference on Computing and Communication Technologies (RIVF 2024) , Danang, Vietnam, Dec 2024

Abs Bib HTML Code

An accurate and efficient attendance system is essential in modern industrial and educational settings. However, traditional methods often suffer from inaccuracies, inefficiencies, and vulnerability to fraud. Recent work usually utilizes a single model for employing the face recognition system, leading to many failures due to the model’s low performance. This paper presents an advanced face attendance system using ensemble learning to overcome the limitations of single-model approaches. The proposed system achieves near-perfect accuracy under optimal conditions by integrating the strengths of multiple deep learning models, including ResNet, VGGFace, and FaceNet. The ensemble approach boosts the robustness and reliability of the attendance system, making it a promising solution for real-world deployment in educational and workplace environments. The key contribution of this work is the development of a face attendance system that utilizes the complementary capabilities of different models to deliver significantly improved accuracy and resilience compared to standalone methods.
@inproceedings{11009085, author = {Phan, Duc Tai and Tran, Phuong-Nam and Dang, Duc Ngoc Minh}, title = {Improving Face Attendance Checking System with Ensemble Learning}, booktitle = {2024 RIVF International Conference on Computing and Communication Technologies (RIVF 2024)}, address = {Danang, Vietnam}, pages = {5}, month = dec, year = {2024}, doi = {10.1109/RIVF64335.2024.11009085}, isbn = {979-8-3315-0507-3}, }
ICTC’24
Towards Real-Time Vietnamese Traffic Sign Recognition on Embedded Systems

Phuong-Nam Tran , Nhat Truong Pham , Nam Van Hai Phan , Duc Tai Phan , Cuong Tuan Nguyen , and Duc Ngoc Minh Dang

In 2024 15th International Conference on Information and Communication Technology Convergence (ICTC 2024) , Jeju Island, Republic of Korea, Oct 2024

Abs Bib HTML

AI development has brought many significant changes in various aspects of our daily lives in recent years. Integrating AI technology into various applications has revolutionized multiple domains, and one particularly vital area is traffic sign recognition, which significantly enhances driver safety. This paper presents an approach to traffic sign recognition specifically designed for the Jetson Nano 2GB device. By utilizing the YOLOv8 Nano model, the proposed approach achieves a remarkable frame rate of up to 32 frames per second (FPS). To optimize inference speed on Jetson with limited memory, the approach incorporates TensorRT and quantization techniques. In addition, this paper introduces a dataset called the Vietnamese Traffic Sign Detection Database 100 (VTSDB100). This dataset is an extension of the VTSDB46 dataset and encompasses a comprehensive collection of 100 different classes of traffic signs. These signs were captured in diverse locations within Ho Chi Minh City, Vietnam, providing a rich and diverse dataset for training and evaluating traffic sign recognition models. Extensive experiments and analyses were conducted using various object detection methods on the VTSDB100 dataset. The findings highlight the potential of deploying the proposed approach on resource-constrained devices and provide valuable insights for further research and development in AI-powered driver safety systems.
@inproceedings{10827558, author = {Tran, Phuong-Nam and Pham, Nhat Truong and Phan, Nam Van Hai and Phan, Duc Tai and Nguyen, Cuong Tuan and Dang, Duc Ngoc Minh}, title = {Towards Real-Time Vietnamese Traffic Sign Recognition on Embedded Systems}, booktitle = {2024 15th International Conference on Information and Communication Technology Convergence (ICTC 2024)}, address = {Jeju Island, Republic of Korea}, month = oct, year = {2024}, keywords = {Training;Quantization (signal);Databases;Web services;Urban areas;Object detection;Real-time systems;Safety;Internet of Things;Artificial intelligence;Traffic Sign Recognition;Object Detection;Quantization;Vietnamese Traffic Sign dataset;Deep learning}, doi = {10.1109/ICTC62082.2024.10827558}, isbn = {979-8-3503-6463-7}, }