3785
This possibly originated from the same group of data being
contributed to an archive by two different data contributors.
Once a set of data was homogenized, its data were in-
‚egrated into a unique table. This final merging focused
on the removal of duplicates between the sets of data. Al-
:hough some duplicates are known (e.g. MOBY, BOUS-
SOLE, AERONET-OC, and NOMAD data are found in
SeaBASS and MERMAID), others are unknown (e.g. how
many of GeP&CO, ICES, AMT, and HOT are within NO-
MAD, SeaBASS, and MERMAID). Therefore, duplicates
were identified using the metadata (“dataset” and “sub-
dataset”) when possible, and temporal-spatial matches, as
an additional precaution. For temporal-spatial matches, sev-
eral thresholds were used, but typically 5min and 200m
were taken to be sufficient to identify most duplicated data,
which reflected small differences in time, latitude, and lon-
gitude, between the different sets of data. Larger thresh-
olds were used in some cases as a cautionary procedure.
This was the case when searching for NOMAD data in
other data sets, because NOMAD includes a few cases
where merging of radiometric and pigment data was done
with large spatial-temporal thresholds (Werdell and Bailey,
2005). A large temporal threshold was also used when in-
:egrating observations from the three data sources that did
not have time available (ESTOC, MAREDAT, and TPSS).
In regard to all data, if duplicates were found, data from
che NOMAD data set were selected first, followed by data
From individual projects or contributors (MOBY, BOUS-
SOLE, AERONET-OC, AMT, HOT,GeP&CO, AWI, BAR-
ENTSSEA, BATS, CALCOFI, CCELTER, CIMT, ESTOC,
IMOS, PALMER, TPSS, and TARA) and finally for the
remaining data sets (SeaBASS, MERMAID, ICES, ARC-
SSPP, BIOCHEM, BODC, COASTCOLOUR, MAREDAT,
and SEADATANET). This procedure was chosen to preserve
(he NOMAD data set as a whole, since it is widely used in
ocean colour validation. It should be noted that, by this pro-
cedure, data from individual projects or contributors may be
listed under NOMAD (e.g. some PALMER data are found
in NOMAD with metadata string “nomad_palmer_lter””). Af-
ter giving priority to NOMAD, the priority was generally
given to data from individual projects or contributors, but
due to an incremental approach where only new data are
added to previous versions of the compilation, some data
from individual projects or contributors (BATS, CALCOFI,
CIMT, PALMER, and TPSS) added in later stages, may
ve found under other data sources. This occurs mainly for
BATS and CALCOFT, which have their earlier chlorophyll
data in SeaBASS with metadata strings “seabass_bats*”” and
“seabass_cal*”, and CIMT which has some of its data under
COASTCOLOUR. After all data from a given source were
free of duplicates, they were merged consecutively by varıi-
able in the final table. During this process, we also searched
for rows (stations) that were separated from each other by
time differences less than 5 min and horizontal spatial differ-
ences of less than 200m. When such rows were found. the
zarth Syst. Sci. Data. 14. 5737-5770. 202.
A. Valente et al.: A compilation of global bio-optical in situ data
observations in those rows were merged into a single row.
The compiled merged data were compared with the original
sets to certify that no errors occurred during the merging. As
a final step, a water-column (station) depth was recorded for
each observation, which was the closest water-column depth
from the ETOPO1 global relief model (National Geophys-
ical Data Center ETOPO1; Amante and Eakins, 2009). For
observations where the closest water depth was above sea
level (e.g. data collected very near the coast), it was given
the value of zero.
Data processing thus included two major steps: pre-
processing and merging. The first step was related to the pro-
cessing of each of the 27 contributing data sets and aimed to
identify problems and convert the data of interest to a stan-
dard format. The second step dealt with the integration of
all the contributing sets of data into a unified data set and
included the elimination of duplicated data between the indi-
vidual sets of data. In the next subsections, a brief overview
of each original set of data is provided.
2.2 Pre-processing of each set of data
2.2.1 Marine Optical BuoY (MOBY)
MOBY is a fixed mooring system operated by the National
Oceanic and Atmospheric Administration (NOAA) that
provides a continuous time series of water-leaving radiance
and surface irradiance in the visible region of the spectra
since 1997. The site is located a few kilometres west of the
Hawalian Island of Lanai where the water depth is about
1200 m. Since its deployment, MOBY measurements have
been the primary basis for the on-orbit vicarious calibrations
of the SeaWiFS and MODIS ocean colour sensors. A full
description of the MOBY system and processing is provided
in Clark et al. (2003). Data are freely available for scientific
use at the MOBY Gold directory. The products of interest are
the “scientific time series” files, which refer to MOBY data
averaged over sensor-specific wavelengths and particular
hours of the day (around 20:00-23:00 UTC). For this work,
the satellite band-average products for SeaWiFS, MODIS
AQUA, MERIS, VIIRS-SNPP, VIIRS-JPSS (also known
as NOAA-20 VIIRS), OLCI-S3A, and OLCI-S3B were
compiled from the “R2017 reprocessing”. The “inband”
average subproduct was used, and to maintain the highest
quality, only data determined from the upper two arms
(“Lw1”) and flagged “good” quality were acquired. Data
from the MOBY203 deployment were discarded due to the
absence of surface irradiance data. The compiled variable
was the remote-sensing reflectance, “rrs”, which was com-
puted from the original water-leaving radiance (“Lw’””) and
surface irradiance (“Es”). The water-leaving radiances were
corrected for the bidirectional nature of the light field (Morel
and Gentili, 1996; Morel et al., 2002) using the same lookup
table and method as that used in the SeaWiFS Data Analysis
System (SeaDAS) processing code. The MOBY data were
https://dol.org/10.5194/essd-14-5737-2022