        A Datamodel for the AuxTel spectrograph (LATISS)
        ===================

(The cognoscenti will recognise certain similarities to the PFS data model,
https://github.com/Subaru-PFS/datamodel)

This document outlines the data model for LATISS (The LSST Atmospheric Transmission Imager
and Slitless Spectrograph) which is a slitless spectrograph on the 1.2m auxilliary telescope
on Cerro Pachon, approximately 100m from LSST.

There will be implementations of python classes that represent many of descriptions in this document,
supporting reading and writing FITS files that conform to this model.
To use them, add the python directory in this package to your $PYTHONPATH
and import e.g. atmospec.datamodel.latissConfig

The following symbolic names will be used in the discussion:

xxx
NCOLUMN     The number of columns in the detector
NROW        The number of rows that spectra nominally extend over in the dispersion direction.
        This will change depending on the dispersive element used
NSIMROW     Number of rows for simulated data.  May be different in different files, depending
        on the desired oversampling.
NCOARSE     Number of wavelengths used in computing the full covariance matrix of the spectra
xxx


The values of NCOLUMN and NROW will differ for raw and reduced data.
Possible values are given in the section, "Parameters"

In various places SHA-1 is refered to, which is a 160-bit hash, as used by e.g. git
(https://en.wikipedia.org/wiki/SHA-1).  We truncate these hashes to 64bits (so as to fit
in standard 64-bit integers), and abbreviate these SHA-1 values to 8 hexadecimal digits,
2^32 ~ 4e9 possible values, in filenames.  In filenames, hexadecimal characters shall be
written in lower case (0x12345abcdef not 0X12345ABCDEF or 0x12345ABCDEF).


The following variables are used in filename formats (which are written in python notation):

site:

T: Tuscon
B: BNL
N: NCSA (xxx remove?)
P: Princeton (xxx remove?)
S: Summit
I: SLAC I&T
X: AuxTel offline
F: simulation (fake)

category:

A: Science
B: Lab

visit:

An incrementing exposure number, unique at any site

disperser:

0 to n, where the value of n is number of different dispersive elements
and 0 represents no disperser, i.e. direct imaging
We consider the different dispersive elements as fundamental as they have different:
  * transmission functions
  * wavelength solutions
  * resolutions
  * flatfielding properties/datasets

tract:
(xxx will we have any need for this given one source density == 1/pointing)
An integer in the range (0, 99999) specifying an area of the sky

patch:
(xxx will we have any need for this given one source density == 1/pointing)
A string of the form "m,n" specifying a region within a tract

objId:

A unique 64-bit object ID for an object.  For example, the LSST object ID from the database.
(xxx we won't have LSST object IDs for quite a while, need a stop-gap. Can we use Gaia obj IDs?)

objIds are written out using %08x formats (rather than e.g. %08d) as a 64 bit integer is a large number.

catId:

A small integer specifying the source of the objId.  Currently only
0: Simulated
1: Gaia
are defined.

latissConfigId:

An integer uniquely specifying the configuration of LATISS; specifically a SHA-1 of the
(disperser, ra, dec) tuples (with position rounded to the nearest arcsecond) truncated to 64 bits.

See calculate_latissConfigId() in
python/atmospec/datas model/utils.py

We include the standard prefix "0x" in filenames to visually separate SHA-1s from the 64-bit objId; this is
especially important in latissObject filenames which have both an objId and a hash.

An alternative to a SHA-1 would be the visit number of the first data taken with this configuration but that
would have the disadvantage that it would not be available until the data was taken.

Lab data may choose to set the latissConfigId to 0x0. As a convenience to simulators, if the disperserIds
in a latissConfig file are consecutive integers starting at 1 and all ra/dec values are 0.0 the latissConfigId
is taken to be 0x0


latissVisitHash:

An integer uniquely defining the set of visits contributing to a reduced spectrum;
this will be calculated as a SHA-1 truncated to 64 bits
See calculate_latissVisitHash() in
python/latiss/datamodel/utils.py

As a special case, if the only visit is 0, latissVisitHash shall be taken to be 0x0 for the convenience
of simulation code.

MaskedImage:

A data type which consists of three images:
 A floating point image
 An integer mask image
 A floating point variance image


-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Raw Data (from the spectrograph camera, both direct and dispersed imaging)

"PF%1s%1s%06d%1d%1d.fits" % (site, category, visit, spectrograph, armNum)

Format: as written by the DAQ system.  I believe that the "A" (science) frames are currently a single image
extension of size > NCOLUMN*NROW due to extended registers and overclocks, even for the red and blue arms
which are physically two devices.

N.b. the restriction to filenames of the form 4 letters followed by 8 decimal digits comes from Subaru.

Note that the order of the last two numbers is "spectrograph, armNum" whereas in all other filenames
we use the order "arm, spectrograph" but now arm is a string (e.g. r2, n4).


---------------------------
PFS:










-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Calibration products (i.e. processed flats/biases/darks ready to be used in processing)

"pfsFlat-%06d-%1s%1d.fits" % (visit0, arm, spectrograph)
"pfsBias-%06d-%1s%1d.fits" % (visit0, arm, spectrograph)
"pfsDark-%06d-%1s%1d.fits" % (visit0, arm, spectrograph)

(While theoretically we don't need to distinguish between the r and m chips, it's simpler to be
consistent, so we will have 16, not 12, biases/darks for the complete spectrograph)

The visit is replaced by visit0, the first visit for which these flats/biases/darks are valid; in general
there will be many files for each (arm, spectrograph) but with different values of visit0.  Note that we
don't know the upper range of validity when these files are created, so it cannot be encoded in the filename.

Visit0 is usually the last input visit used in preparing the calibration files. The calibration file that is
used by the pipeline is selected based on a list of valid dates provided to the calibration database; visit0
is only used to provide a unique filename and a sensible sort order.

Single extension fits files of size NCOLUMN*NROW

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The positions and widths of fiber traces in the 2-D flat-field images.

"pfsFiberTrace-%06d-%1s%1d.fits" % (visit0, arm, spectrograph)

Note that visit0 is defined under "Calibration products"

Each fiber trace is represented as a MaskedImage, with the "image" consisting of the amplitude of the trace,
and a bit FIBERTRACE set in the "mask" to identify which pixels belong to the trace.

In the pfsFiberTrace file, these fibre profiles are stored as a subimage of a single large MaskedImage, packed
from left to right and all with the same number of rows (n.b. because of trace curvature and overlap this
MaskedImage may have more then the number of columns in a flat-fielded data image).  The image/mask/variance
are stored in 3 HDUs, followed by an HDU describing where fibre traces appear in the flat-fielded images (this
information is also used to pack/unpack the traces into the image/mask/variance HDUs).  FiberTraces which
could not be traced in the Quartz image will have zero widths.

HDU #0 PDU
HDU #1 IMAGE      Image representing the FiberTrace profiles     [FLOAT]               NROW*sum_i(NCOL_i)
HDU #2 MASK       Mask  representing the FiberTrace profiles     [INT32]               NROW*sum_i(NCOL_i)
HDU #3 VARIANCE   Variance representing the FiberTrace profiles  [FLOAT]               NROW*sum_i(NCOL_i)
HDU #4 ID_BOX     Fiber ID and bounding boxes                    [BINARY FITS TABLE]   NFIBER*5

The PDU has keys to define which data were used to construct the FiberTrace:
SPECTROGRAPH   Which spectrograph
ARM            The arm of the spectrograph
VISIT%03d      Visits which were used
I.e. if only visit 5830 were used, the r1 file would have ARM='r', SPECTROGRAPH=1, VISIT000=5394

The individual parameters per fiber are stored in HDU #4:
    FIBERID     NFIBER*32-bit               int
    MINX        NFIBER*32-bit               int
    MAXX        NFIBER*32-bit               int
    MINY        NFIBER*32-bit               int
    MAXY        NFIBER*32-bit               int
where the bounding box of the trace is defined by the lower left and upper right corners (MINX, MINY) and
(MAXX, MAXY) respectively.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The state of the PFI and corresponding targetting information

"pfsConfig-0x%08x.fits" % (pfsConfigId)

The choice of a hex format is because the the pfsConfigId is a SHA-1

Format:

HDU #0 PDU
HDU #1          Fits binary table

lists fiberId, catId, objId, ra, dec, fiber flux, MCS centroid, ... for each object. The types should be:
  fiberId        32-bit int
  catId          32-bit int
  tract      32-bit int
  patch          3-byte string
  objId          64-bit int
  ra             32-bit float
  dec            32-bit float
  fiberMag       5*32-bit float
  MCS centroid   pair of 32-bit floats

N.b. fiberIds start at 1.
N.b. "MCS centroid, ..." is a placeholder for state of the PFI

The HDU1 header should contain a set of 5 keywords FILTERN (FILTER0...FILTER4) giving the
names of the filters in the fiberMag entry.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Reduced but not combined single spectra from a single exposure (flux and wavelength calibrated)

"pfsArm-%06d-%1s%1d.fits" % (visit, arm, spectrograph)

N.b. visit numbers are only unique at a site, but we don't preserve this in derived product names.  There will
keywords in the header to disambiguate this.

The file will have several HDUs:

HDU #0 PDU
HDU #1 FLUX        Flux in units of nJy                    [FLOAT]        NROW*NFIBER
HDU #2 COVAR       Near-diagonal part of HDU 1's covariance    [FLOAT]        NROW*3*NFIBER
HDU #3 MASK        Pixel mask                              [32-bit INT]   NROW*NFIBER
HDU #4 WAVELENGTH  Wavelength solution in nm (vacuum)          [FLOAT]        NROW*NFIBER
HDU #5 SKY         Sky flux in same units as HDU1              [FLOAT]        NROW*NFIBER
HDU #6 CONFIG      Information about the PFI, targetting, etc. [BINARY FITS TABLE]

Note that the data need not be resampled onto a uniform grid, as a wavelength is provided for each pixel.

The COVAR data contains the diagonal COVAR[fiberId][0][0:]
                        +-1      COVAR[fiberId][1][0:-1]
and
                        +-2      COVAR[fiberId][2][0:-2]
terms in the covariance matrix.

At a minimum the CONFIG table contains columns for pfsConfigId and visit (and only one row!  But it's
a table not header keywords for consistency with the pfsObject file).

It would be possible to denormalise additional information about the observation into this HDU.  I do not
know if it should also contain things we've learned about the observation from the analysis that created
this file;  I'd be inclined to put it in a separate temporary file, and then build a single file describing
the entire PFS from the 12 temporaries.  Alternatively, this could be done at the database level.

Note that the pfsFiberTrace file contains enough information to recover the raw counts and the widths
of the fiber traces.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The sky model for a single exposure

"pfsSky-%06d-%1s%1d.fits" % (visit, arm, spectrograph)

Format: TBD

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

The PSF for a single exposure

"pfsPSF-%06d-%1s%1d.fits" % (visit, arm, spectrograph)

Format: TBD

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Combined spectra.

In SDSS we used "spPlate" files, named by the plugplate and MJD of observation but this is not suitable for
PFS where we:
1.  Will split observations of the same object over multiple nights
2.  Will potentially reconfigure the PFI between observations.

I don't think it makes sense to put multiple spectra together based on sky coordinates as we may go back and
add more observations later, so I think we're forced to separate files for every object.  That's a lot of
files, but maybe not too bad?  We could use a directory structure based on HSC's (tract, patch) -- note that
these are well defined even if we are not using HSC data to target.  An alternative would be to use a
healpix or HTM id.

Because we may later obtain more data on a given object, or decide that some data we have already taken is
bad, or process a number of subsets of the available data, there may be more than one set of visits used
to produce a pfsObject file for a given object.  We therefore include both the number of visits (nVisit)
and a SHA-1 hash of the visits, pfsVisitHash.  We use both as nVisits may be ambiguous, while pfsVisitHash
isn't human-friendly;  in particular it doesn't sort in a helpful way.  It seems improbable that we will
ever have more than 100 visits, but as the pfsVisitHash is unambiguous it seemed safer to only allow for
larger values of nVisit, but record them only modulo 100.

 "pfsObject-%05d-%s-%3d-%08x-%02d-0x%08x.fits" % (tract, patch, catId, objId, nVisit % 100, pfsVisitHash)

The path would be
tract/patch/pfsObject-*.fits

The file will have several HDUs:

HDU #0 PDU
HDU #1 FLUX        Flux in units of nJy                    [FLOAT]        NROW
HDU #2 FLUXTBL     Binary table                                [FITS BINARY TABLE] NROW
           Columns for:
           wavelength in units of nm (vacuum)      [FLOAT]
           intensity in units of nJy           [FLOAT]
           intensity error same units as intensity [FLOAT]
           mask                                    [32-bit INT]
HDU #3 COVAR       Near-diagonal part of HDU 2's covariance    [FLOAT]        NROW*3
HDU #4 COVAR2      Low-resolution non-sparse estimate covariance [FLOAT]      NCOARSE*NCOARSE
HDU #5 MASK        Pixel mask                              [32-bit INT]   NROW
HDU #6 SKY         Sky flux in same units as HDU1              [FLOAT]        NROW
HDU #7 CONFIG      Information about the PFI, targetting, etc. [BINARY FITS TABLE]

The PDU must contain at least the keywords
tract       tract       INT
patch       patch       STRING
catId       catId           INT
objId       objId       INT
    pfsVHash    pfsVisitHash    INT
(N.b. the keywords are case-insensitive).  Other HDUs should specify INHERIT=T.

The wavelengths are specified via the WCS cards in the header (e.g. CRPIX1, CRVAL1) for HDU #1, and
explicitly in the table for HDU #2.  We chose these two representations for the data due to the
difficulty in resampling marginally sampled data onto a regular grid,  while recognising the convenience
of such a grid when rebinning, performing PCAs, or stacking spectra.  For highest precision the data
in HDU #2 is likely to be used.

See pfsObject for definition of the COVAR data

What resolution should we use for HDU #1?  The instrument has a dispersion per pixel which is roughly constant
(in the blue arm Jim-sensei calculates that it varies from 0.70 to 0.65 (going red) A/pix; in the red, 0.88 to
0.82, and in the IR, 0.84 to 0.77).  We propose that we sample at 0.8 A/pixel.

The second covariance table (COVAR2) is the full covariance at low spectral resolution, maybe 10x10. It's
really only 0.5*NCOARSE*(NCOARSE + 1) numbers, but it doesn't seem worth the trouble to save a few bytes.
This covariance is needed to model the spectrophotometric errors.

The CONFIG table has nVisit rows, and at a minimum contains columns for visit and pfsConfigId:
visit:       32-bit int
pfsConfigId: 64-bit int

The targetting information that SDSS put in the PLUGMAP table is available from the pfsConfigId column, which
links to the relevant pfsConfig files.  This could be denormalised into the CONFIG table if desired.

Note that we don't keep the SDSS "AND" and "OR" masks -- if needs be we could set two mask bits to capture
the same information, but in practice SDSS's OR masks were not very useful.

For data taken with the medium resolution spectrograph, HDU #1 is expected to be at the resolution of
the medium arm, and to omit the data from the blue and IR arms.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

Simulations

We need to specify a format for the spectra going into the simulator.  At first glance we could adopt the
pfsObject file, along with a pfsConfig file to map them to the sky.  This would work, but there is a problem
with resolution, in particular emission lines.  Also, the resolution of the pfsObject files for
medium-resolution spectra may be different from that for low-resolution (extra-galactic) spectra (and the
simulators are probably not interested in covariances and masks)

The PSF of an extracted emission line is not the same as the marginal shape of the fibre spot, and I don't
think that the simulators want to know the details of the spectrograph anyway!  This leads us to going to
higher resolution, and this would be possible (e.g. R ~ 1e4?).  An alternative would be to provide the
continuum spectrum as an array, possibly even at PFS's resolution and an additional table of (lambda,
amplitude, width) for emission lines.  Will a continuum + lines model like this work for the Galactic
evolution stellar spectra?

For now, I propose:
 "pfsSimObject-%05d-%s-%3d-%08x.fits" % (tract, patch, catId, objId)

The tract, patch, catId, objId values will also be available as header keywords.

The file will have two HDUs:

HDU #0 PDU
HDU #1             Flux in units of nJy                    [FLOAT]        NSIMROW
HDU #2 WAVELENGTH  Wavelength solution in nm (vacuum)          [FLOAT]        NSIMROW

The simulation team will also need to provide a matching pfsConfig file, and may well choose to set
tract, patch = 0, "0,0"
The catId/objId is only needed to allow the set of pfsSimObject files and a pfsConfig file to specify a
set of simulated exposures.

Note that the wavelength need not be uniformly sampled.

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

        Example File Names
        ------------------

PFSA00066611.fits
Raw science exposure, visit 666, taken on the summit with spectrograph 1 using the blue arm
PFLB00123423.fits
Raw up-the-ramp exposure, visit 1234, taken at LAM with spectrograph 2 using the IR arm

pfsConfig-0xad349fe2.fits
The fiber and targetting information for the PFI configuration with hash 0xad349fe2

pfsFlat-000333-m2.fits
Flat for spectrograph 2, medium resolution arm, valid for 333 <= visit < ??
pfsBias-000333-r1.fits
Bias for spectrograph 1, red arm
pfsBias-000333-m1.fits
Bias for spectrograph 1, medium resolution arm (identical to pfsBias-000333-1r.fits)
pfsDark-000333-b3.fits
Dark for spectrograph 3, blue arm
pfsFiberTrace-000333-n2.fits
Fiber traces for spectrograph 2, IR arm

pfsArm-000666-b1.fits
Extracted spectra for visit 666, from spectrograph 1's blue arm
pfsPSF-000666-b1.fits
The 2-D PSF for spectrograph 1's blue arm in visit 666

pfsObject-07621-2,2-001-02468ace-03-0xdeadbeef.fits
Combined spectrum for HSC (the "001") object 0x02468ace, located in tract 7621, patch 2,2.

The pfsVisitHash SHA-1 (calculated from the 3 visits included in this reduction) is 0xdeadbeef.

pfsSimObject-00000-0,0-000-13579bdf.fits
A simulated spectrum (the "000") for object 0x13579bdf, located in tract 0, patch 0,0

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

        Parameters
        ----------

The raw CCD images will have:
NAMP=8
AMPCOLS=520
CCDROWS=4224

OVERROWS = 76 # as of 2016-02-01
OVERCOLS = 32 # as of 2016-02-01

LEADINROWS = 48 # necked rows
LEADINCOLS  = 8 # real leadin pixels

So the raw CCD data will have
NROW = 4300 #  CCDROWS + OVERROWS
NCOL = 4416 # NAMP*(AMPCOLS + OVERCOLS)

And the ISR-extracted CCD images will probably be close to:
NROW = 4176  # CCDROWS - LEADINROWS
NCOL = 4096  # NAMP*(AMPCOLS - LEADINCOLS)
This value of NROW is applicable to the pfsArm files.

The raw H4RG images will be in the range
NROW = 4096 to 8192
NCOL = 4096
(n.b. there's a 4-pixel reference pixel border)

The ISR-extracted H4RG images will be no bigger than
NROW = 4088
NCOL = 4088

-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-

        Open Questions and TBD
        ----------------------

Pin down the HDUs in pfsArm files

Specify more details about expected header keywords

Provide examples of all these files

Decide where to put information about the analysis resulting in pfsArm files

Define the 1-D outputs

Define pfsObject files for medium resolution data

Consider using healpix or HTM ID not tract/patch

Do we need to save the covariance of the wavelength solution

Should we save the per-pfsConfigId combined spectra?  Note that this is doable within
my naming scheme now we have added a pfsVisitHash and an nVisit field

Reruns in directory tree, and also headers of course

DB of bad visits

Define format of pfsPSF files.  Choosing a better name is probably hopeless

Need a DB (as well as the butler) to map visit to visit0 for calibration products

Do we want to model the sky over the entire exposure?  I put in a place holder that is
per spectrograph/arm but it isn't obvious that this is correct.  E.g. there's spatial
structure over the whole focal plane, and the lines in the red and IR are closely related.

Think of a better name than NCOARSE

Do we really want HDU #1 (rebinned spectra) in medium-resolution pfsObject files to
omit the blue and IR data?

