Deep Learning Based Frameworks for the Detection and Classification of Soniferous Fish

Ziqi Huang, Dominik Ochs, M. Clara. P. Amorim, Paulo J. Fonseca, Mayank Goel, Nuno Jardim Nunes, Manuel Vieira, Manuel Lopes

Passive Acoustic Monitoring (PAM) is emerging as a valuable tool for assessing fish pop- ulations in natural habitats. This study compares two deep learning-based frameworks: (1) a multi-label classification system (SegClas) combining Convolutional Neural Networks (CNNs) and Long Short Term Memory (LSTM) networks and, (2) an object detection ap- proach (ObjDet) using a YOLO-based model to detect, classify, and count sounds produced by soniferous fish in the Tagus estuary, Portugal. The target species-Lusitanian toadfish (Halobatrachus didactylus), meagre (Argyrosomus regius), and weakfish (Cynoscion regalis)- exhibit overlapping vocalization patterns, posing classification challenges. Results show both methods achieve high accuracy (over 96%) and F1 scores above 87% for species-level sound identification, demonstrating their effectiveness under varied noise conditions. ObjDet gen- erally offers slightly higher classification performance (F1 up to 92%) and can annotate each vocalization for more precise counting. However, it requires bounding-box annotations and higher computational costs (inference time of ca. 1.95 seconds per hour of recording). In contrast, SegClas relies on segment-level labels and provides faster inference (ca. 1.46 sec- onds per hour). This study also compares both counting strategies, each offering distinct advantages for different ecological and operational needs. Our results highlight the potential of deep learning-based PAM for fish population assessment.