Input image
Seaementation mask
Obiect detection
Figure 3: Starting from maritime images (under “Input image”) a distinction is made between semantic segmentation (under “Segmenta-
tion Mask”) and object detection (under “Object detection”) (Chen et al., 2021)
object detection and semantic segmentation (Figure 3). Both tasks represent an active research area
in computer vision. Whilst object detection is used to locate and classify objects within an image with
the help of a bounding box, semantic segmentation is used to divide an image into multiple regions on
pixel by pixel basis. These tasks have been dominated by Machine Learning for several years now. Two
methods — or more specifically, two classes of architectures - have emerged: Convolutional Neural
Networks (CNN) and Vision Transformer (Koch et al., 2024).
Performance metric
The ISO 16273 standard currently defines the required accuracy of a night vision system with a human
abserver being able to detect a specified target nine out of 10 times under voyage conditions. Perfor-
mance metrics are used to evaluate how well an object detection model performs. Metrics commonly
used to compare the performance of different object detection models include “Mean Average Preci-
sion” (mAP) and “Mean Intersection over Union” (mloU). These are explained in the following section.
Object detection - ‘Mean Average Precision” (mAP)
in order to estimate the mAP, all detections must be classified into three categories:
True positive (TP): Correctly detected objects
False positive (FP): Incorrectly indicated detections (i.e. cases in which the model detects ob-
jects that actually are not present or misclassifies objects)
False negative (FN): Omitted objects (i.e. objects that were not recognised by the model even
though they actually exist).
These three cases are illustrated in Figure 4. Precision defined in ISO TS 4213 indicates how often the
model predicts an object correctly. In Figure 4, three objects have been detected, but only two were
correctly identified as vessel; this results in a precision metric (p) of 66% for this model.
T')
D —
TP + FP