238
A. Valente et al.: A compilation of global bio-optical in situ data
Earth Syst. Sei. Data, 8, 235-252, 2016
www.earth-syst-sci-data.net/8/235/2016/
observations at multiple depths, which, given our decision to
use only in vitro measurements, would have reduced consid
erably the final number of observations.
With regard to the inherent optical properties (aph, adg,
bbp), if not already calculated and provided in the con
tributed datasets, they were computed from related vari
ables that were available: particle absorption (ap), detrital
absorption (ad), coloured dissolved organic matter (CDOM)
absorption (ag) and total backscattering (bb). The follow
ing equations were used adg = ad + ag, ap = aph + ad and
bb = bbp + bbw. For the latter equation, the variable bbw
was computed using bbw = bw / 2, where bw is the scatter
ing coefficient of seawater derived from Zhang et al. (2009).
The diffuse attenuation coefficient for downward irradiance
(kd) did not require any conversion and was compiled as
originally acquired. Observations of inherent optical prop
erties (surface values) and diffuse attenuation coefficient for
downward irradiance, were acquired from three data sources
particularly designed for ocean-colour validation (SeaBASS,
NOMAD, MERMAID) and were thus already subject to the
processing routines of these datasets.
The merged dataset was compiled from 10 sets of in
situ data, which were obtained individually either from
archives that incorporate data from multiple contributors
(SeaBASS, NOMAD, MERMAID and ICES) or from par
ticular measurement programs or projects (MOBY, BOUS
SOLE, AERONET-OC, HOT, GeP&CO, AMT) and were
subsequently homogenised and merged. Data contributors
are listed in Table 2. There were methodological differences
between datasets. Therefore, after acquisition, and prior to
any merging, each set of data was preprocessed for qual
ity control and conversion to a common format. During this
process, data were discarded if they had (1) unrealistic or
missing date, time and geographic coordinate fields; (2) poor
quality (e.g. original flags) or a method of observation that
did not meet the criteria for the dataset (e.g. in situ fluo
rescence for chlorophyll concentration); and (3) spuriously
high or low data. For the latter, the following limits were
imposed: for chla_fluor and chla_hplc [0.001-100] mgm -3 ;
forrrs [0-0.15] sr -1 ; for aph, adg and bbp [0.0001-10] m -1 ;
for kd [aw(À)-10] m -1 , where aw is the pure water absorp
tion coefficients derived from Pope and Fry (1997). Also dur
ing this stage, three metadata strings were attributed to each
observation: dataset, subdataset and pi. The dataset contains
the name of the original set of data, and can only be one
of the following: “aoc”, “boussole”, “mermaid”, “moby”,
“nomad”, “seabass”, “hot”, “ices”, “amt” or “gepco”. The
subdataset starts with the dataset identifier and is followed
by additional information about the data, in the format
<dataset>_<cruise/station/site>) (e.g. seabass_car71). The pi
contains the name of the principal investigator(s). An effort
was made to homogenise the names of principal investigators
from the different sets of data. These three metadata are the
link to trace each observation to its origin and were prop
agated throughout the processing. Finally, this processing
stage ended with each set of data being scanned for replicate
variable data and replicate station data, which when found,
were averaged if the coefficient of variation was less than
50%; otherwise they were discarded. Replicates were de
fined as multiple observations of the same variable, with the
same date, time, latitude, longitude and depth. Replicate sta
tion data were defined as multiple measurements of the same
variable, with the same date, time, latitude and longitude. For
the latter case, a search window of 5 min in time and 200 m
in distance was given, to account for station drift. A small
number of observations that were identified as replicates had
different subdataset identifiers (i.e. a different cruise name).
These observations were considered suspicious if the values
were different and were discarded. If the values were the
same, one of the observations was retained. This possibly
originated from the same group of data being contributed to
an archive by two different principal investigators.
Once each set of data was homogenised, all data were
integrated into a unique table. This final merging focused
on the removal of duplicates between the sets of data. Al
though some duplicates are known (e.g. MOBY, BOUS-
SOLE, AERONET-OC and NOMAD data are found in
SeaBASS and MERMAID sets of data), others are un
known (e.g. how much of GeP&CO, ICES, AMT, HOT is
within NOMAD, SeaBASS and MERMAID). Therefore, du
plicates were identified using the metadata (dataset and sub
dataset) when possible and temporal-spatial matches as an
additional precaution. For temporal-spatial matches, several
thresholds were used, but typically 5 min and 200 m were
taken to be enough to identify most duplicated data, which
reflected small differences in time, latitude and longitude,
between the different sets of data. Larger thresholds were
used in some cases as a cautionary procedure. This was the
case when searching for NOMAD data in other datasets be
cause NOMAD includes a few cases where merging of ra
diometric and pigment data was done with large spatial-
temporal thresholds (Werdell and Bailey, 2005). With regard
to all data, if duplicates were found, data from the NOMAD
dataset were selected first, followed by data from individ
ual projects (MOBY, BOUSSOLE, AERONET-OC, AMT,
HOT and GeP&CO) and finally for the remaining datasets
(SeaBASS, MERMAID and ICES). This procedure was cho
sen to preserve the NOMAD dataset as a whole, since it is
widely used in ocean-colour validation. After all data were
free of duplicates, they were merged consecutively by vari
able in the final table. During this process, we also searched
for rows (stations) that were separated from each other by
time differences less than 5 min and horizontal spatial differ
ences of less than 200 m. When such rows were found, the
observations in those rows were merged into a single row.
The compiled merged data were compared with the original
sets to certify that no errors occurred during the merging. As
a final step, a water-column (station) depth was recorded for
each observation, which was the closest water-column depth
from the ETOPOl global relief model (National Geophys