Which threshold for mAP or IoU is acceptable will be discussed within the IMO organisation and the
related experts at a later time on the journey towards Maritime Autonomous Surface Ships (MASS)
(IMO, 2024). There are already many opportunities and challenges within the maritime scientific com-
munity (Felski and Zwolak, 2020).
Models for object detection
The study by Koch at al. (2024) compares five commonly used architectures for object detection. Table
2 summarises properties of the five architectures investigated. The computational cost of an architec-
ture plays an important role in the suitability of Al models used on vessels. The metric Floating Point
Dperations (FLOPs) is established to quantify the computational cost and refers to how many floating
Doint operations are required to run a single instance of a given model. It is a measurement used to
evaluate the performance and resource requirements and provides information about the computa-
tional complexity of the model as well as its efficiency. The higher the number of FLOPs, the more
computing power is required. This makes it less suitable for real-time applications because it affects
the latency. Additionally, in the European Al Act, the metric FLOPs is used to determine the criticality
of an artificial system and defines the regulatory threshold.
in Table 2, five selected architectures trained with additionally maritime data are compared with the
COCO dataset in order to evaluate performance potentials.
! Architecture Type Parameter 1 FLOPs I mAP (COCO)
Faster-R-CN CNN ' 42 million !134G 37.0
RetinaNet ' CNN 38 million | 1526 36.4
DETR ' Vision Transformer ' 41 million 187G ' 42.0
FocalTransfomer Vision Transformer 39 million 265G 45.5
YOLOvV7 CNN 37 million 1056G 51.4
Table 2: Comparison of popular architectures in the object detection domain in general applications. The aim is to have a high number of
MAP and a low number of FLOPs (Koch et al., 2024). Vision Transformer and CNN are the Deep Learning architectures used
Models for semantic segmentation
Semantic segmentation involves the task of assigning each pixel in an image to a class and thus seg-
menting the scene into sensible areas. Semantic segmentation is particularly useful for detecting areas
rather than objects. This finer-grained understanding of the environment enables additional applica-
tions such as estimating the horizon line and inclination angles of ships as well as context information
for other systems (Koch et al., 2024).
Table 3 compares some architectures with metrics evaluated on the generic data set ADE20K and the
Cityscapes dataset from the automotive sector (Koch et al., 2024).
Architecture Type ' Parameter FLOPs mIoU ;
(ADE2Ok)
DeepLabV3 | CNN | 42 million | 106 ı 44.1
OCR Vision Transformer | 10.5 million '3406G 45.3
SefFormer Vision Transformer | 44 million 79G 50.0
ViT-Adapter 403 G 52:5
Table 3: Comparison of common architectures in the area of semantic segmentation in general applications evaluated using the ADE20K
dataset (Koch et al., 2024)