A. Valente et al.: A compilation of global bio-optical in situ data
ENTSSEA, BATS, BIOCHEM, BODC, CALCOFI, CCEL-
TER, CIMT, COASTCOLOUR, ESTOC, IMOS, MARE-
DAT, PALMER, SEADATANET, TPSS, and TARA. One re-
quirement for “chla_fluor” measurements was that they were
made using in vitro methods (i.e. based on extractions of
chlorophyll-a). Although this severely decreased the num-
ver of observations, since in vivo fluorometry (e.g. fluorome-
ters mounted on CTDs) is widely available in oceanographic
databases, it was decided to exclude such data because of
potential problems with the calibration of in situ fluorome-
ter data. The variable “chla_hple” was calculated by sum-
ming all reported chlorophyll-a derivatives, including di-
vinyl chlorophyll-a, epimers, allomers, and chlorophyllide-
a. The two chlorophyll variables are retained separately in
che database to facilitate their use. HPLC measurements
could be considered of higher quality, but fluorometric mea-
surements are more numerous. Thus, one option for users is
to use “chla_fluor” only when there are no “chla_hplc” mea-
surements available. To be consistent with satellite-derived
chlorophyll values, which are derived from the light emerg-
ing from the upper layer of the ocean, all chlorophyll ob-
servations in the top 10m (replicates at the same depth, or
measurements at multiple depths) were averaged if the coef-
ficient of variation among observations was less than 50%,
otherwise they were discarded. The averages were then as-
signed to the surface. The depth of 10m was chosen as a
compromise between clear oligotrophic and turbid eutrophic
waters. Other methods, such as chlorophyll depth-averages
using local attenuation conditions (Morel and Maritorena,
2001), require observations at multiple depths, which, given
our decision to use only in vitro measurements, would have
considerably reduced the final number of observations.
Regarding the inherent optical properties (“aph”, “adg”,
“bbp”), if not already calculated and provided in the con-
tributed data sets, they were computed from related variables
that were available: particle absorption (“ap”), detrital ab-
sorption (“ad”), coloured dissolved organic matter (CDOM)
absorption (“ag”), and total backscattering (“bb”). The fol-
lowing equations were used: “adg = ad + ag”, “ap = aph
- ad”, and “bb = bbp + bbw”. For the latter equation, the
variable ”bbw” was computed using “bbw = bw/2”, where
‘bw” is the scattering coefficient of seawater derived from
Zhang et al. (2009). The diffuse attenuation coefficient for
downward irradiance (“kd’””) did not require any conversion
and was compiled as originally acquired. Observations of in-
herent optical properties (surface values) and diffuse atten-
uation coefficient for downward irradiance were acquired in
total from six data sources designed for ocean colour vali-
dation and applications (SeaBASS, NOMAD, MERMAID,
AWI, COASTCOLOUR, TPSS), thus already subject to the
processing routines of these data sets. Concerning total sus-
pended matter, these data were compiled as originally avail-
able from MERMAID and COASTCOLOUR.
The merged data set was compiled from 27 sets of in
situ data, which were obtained individually either from
attos://doi.org/10.5194/essd-14-573 /-202:
5741
archives that incorporate data from multiple contribu-
tors (SeaBASS, NOMAD, MERMAID, ICES, ARCSSPP,
BIOCHEM, BODC, COASTCOLOUR, MAREDAT, SEA-
DATANET), or from particular contributors, measurement
programs, or projects (MOBY, BOUSSOLE, AERONET-
OC, HOT, GeP&CO, AMT, AWI, BARENTSSEA, BATS,
CALCOFI, CCELTER, CIMT, ESTOC, IMOS, PALMER,
IPSS, TARA), and were subsequently homogenized and
merged. Data contributors are listed in Table 2 and in the aux-
iliary material. There were methodological differences be-
tween data sets. Therefore, after acquisition, and prior to any
merging, each set of data was pre-processed for quality con-
trol and converted to a common format. During this process,
data were discarded if they had: (1) unrealistic or missing
date and geographic coordinate fields; (2) poor quality (e.g.
original flags) or method of observation that did not meet the
criteria for the data set (e.g. in situ fluorescence for chloro-
phyll concentration); and (3) spuriously high or low data. For
the last, the following limits were imposed: for “chla_fluor”
and “chla_hple” [0.001-100] mg m73; for “rrs” [0-0.15]
sr71; for “aph”, “adg”, and “bbp” [0.0001—-10] m7!; for
“tsm” [0-1000] g m73; and for “kd” [(aw(A)-10] m7}, where
“aw” is the pure water absorption coefficients derived from
Pope and Fry (1997). Also, during this stage, three metadata
strings were attributed to each observation: “dataset”, “sub-
dataset”, and “contributor”. The “dataset”” contains the name
of the original set of data and can only be one of the fol-
lowing: “aoc”, “boussole”, “mermaid”, “moby”, “nomad”,
“seabass”, “hot”, “ices”, “amt”, “gepco”, “arcsspp”, “awi“,
“barentssea“, “bats‘, “biochem“‘, “bode‘‘, “calcofi“, “cec“,
“ccelter“, “cimt“, “estoc“, “imos“, “maredat“, “palmer“,
“seadatanet“, “tpss‘“, and “tara”. The “subdataset” starts with
the “dataset” identifier and is followed by additional infor-
mation about the data, as <dataset>_<cruise/station/site>)
(e.g. “seabass_car81”). The “contributor” contains the name
of the data contributor. An effort was made to homogenize
the names of data contributors from the different sets of data.
These three metadata are the link to trace each observation
to its origin and were propagated throughout the processing.
Sinally, this processing stage ended with each set of data be-
ing scanned for replicate variable data and replicate station
data, which when found, were averaged if the coefficient of
varlation was less than 50 %, otherwise they were discarded.
Replicates were defined as multiple observations of the same
variable, with the same date, time, latitude, longitude, and
depth. Replicate station data were defined as multiple mea-
surements of the same variable, with the same date, time, lat-
itude, and longitude. For the latter case, a search window of
5 min in time and 200 m in distance was given to account for
station drift. A small number of observations that were iden-
tified as replicates had a different “subdataset” identifiers (ie.
different cruise names). These observations were considered
suspicious if the values were different and discarded. If the
values were the same, one of the observations was retained.
Earth Syst. Sei. Data, 14, 5737-5770. 2022