PyCuAmpcor: code cleanup, add docstrings

2020-11-17 23:22:37 -08:00 · 2020-11-17 23:22:37 -08:00 · 38646456d3
parent a393282b69
commit 38646456d3
32 changed files with 1391 additions and 1834 deletions
--- a/contrib/PyCuAmpcor/README.md
+++ b/contrib/PyCuAmpcor/README.md
@ -10,13 +10,13 @@

 ## 1. Introduction

-Ampcor (Amplitude cross correlation) in InSAR processing offers an estimate of spatial displacements (offsets) with the feature tracking (also called as speckle tracking or pixel tracking) method. The offsets are in dimensions of a pixel or sub-pixel (with additional oversampling).
+Ampcor (Amplitude cross correlation) in InSAR processing offers an estimate of spatial displacements (offsets) with the feature tracking method (also called as speckle tracking or pixel tracking). The offsets are in dimensions of a pixel or sub-pixel (with additional oversampling).

 In practice, we

  * choose a rectangle window, $R(x,y)$, from the reference image, serving as the template,

- * choose a series of windows of the same size, $S(x+u, y+v)$, from the search image around the same location but offsetted by $(u,v)$;
+ * choose a series of windows of the same size, $S(x+u, y+v)$, from the search image; the search windows are shifted in location by $(u,v)$;

  * perform cross-correlation between the search windows with the reference window, to obtain the normalized correlation surface $c(u,v)$;

@ -24,13 +24,13 @@ In practice, we

 A detailed formulation can be found, e.g., by J. P. Lewis with [the frequency domain approach](http://scribblethink.org/Work/nvisionInterface/nip.html).

-PyCuAmpcor follows the same procedure as the FORTRAN code, ampcor.F, in RIOPAC. In order to optimize the performance on GPU, some implementations are slightly different. In the [list the procedures](#5-list-of-procedures), we show the detailed steps of PyCuAmpcor, as well as its differences to ROIPAC.
+PyCuAmpcor follows the same procedure as the FORTRAN code, ampcor.F, in RIOPAC. In order to optimize the performance on GPU, some implementations are slightly different. In the [list the procedures](#5-list-of-procedures), we show the detailed steps of PyCuAmpcor, as well as their differences.

 ## 2. Installation

 ### 2.1 Installation with ISCE2

-PyCuAmpcor is included in [ISCE2](https://github.com/isce-framework/isce2), and can be compiled/installed by CMake or Scons, together with ISCE2. An installation guide can be found at [isce-framework](https://github.com/isce-framework/isce2#building-isce).
+PyCuAmpcor is included in [ISCE2](https://github.com/isce-framework/isce2), and can be compiled/installed by CMake or SCons, together with ISCE2. An installation guide can be found at [isce-framework](https://github.com/isce-framework/isce2#building-isce).

 Some special notices for PyCuAmpcor:

@ -41,11 +41,14 @@ Some special notices for PyCuAmpcor:
    * CMake, use the Release build type *-DCMAKE_BUILD_TYPE=Release*
    * SCons, it is disabled by default with the -DNDEBUG flag in SConscript

-* PyCuAmpcor requires GPUs with CUDA support and compute-capabilities >=2.0. You may (must in some cases, e.g., sm_35 with CUDA) specify the targeted compute capability by
+* PyCuAmpcor requires CUDA-Enabled GPUs with compute capabilities >=2.0. You may  specify the targeted architecture by

    * CMake, add the flag *-DCMAKE_CUDA_FLAGS="-arch=sm_60"*, sm_35 for K40/80, sm_60 for P100, sm_70 for V100.
+
    * SCons, modify the *scons_tools/cuda.py* file by adding *-arch=sm_60* to *env['ENABLESHAREDNVCCFLAG']*.
    
+  Note that if the *-arch* option is not specified, CUDA 10 uses sm_30 as default while CUDA 11 uses sm_52 as default. GPU architectures with lower compute capabilities will not run the compiled code properly.  
+
 ### 2.2 Standalone Installation

 You may also install PyCuAmpcor as a standalone package.
@ -103,14 +106,17 @@ rm $outprefix$outsuffix*
 cuDenseOffsets.py --reference $reference --secondary $secondary --ww $ww --wh $wh --sw $sw --sh $sh --mm $mm --kw $kw --kh $kh --gross $gross --rr $rgshift --aa $azshift --oo $oo --deramp $deramp --outprefix $outprefix --outsuffix $outsuffix --gpuid $gpuid  --usemmap $usemmap --mmapsize $mmapsize --nwac $nwac --nwdc $nwdc 
 ```

-In the above script, the computation starts from the (mm+sh, mm+sw) pixel in the reference image, take a series of template windows of size (wh, ww) with a skip (sh, sw), cross-correlate with the corresponding windows in the secondary image, and iterate till the end of the images. The output offset fields are stored in *outprefix+outputsuffix+'bip'*, which is in BIP format, i.e., each pixel has two bands of float32 data, (offsetDown, offsetAcross). The total number of pixels is given by the total number of windows (numberWindowDown, numberWindowAcross), which is computed by the script and also saved to the xml file.
+Note that in PyCuAmpcor, the following names for directions are equivalent:
+* row, height, down, azimuth, along the track.
+* column, width, across, range, along the sight.
+
+In the above script, the computation starts from the (mm+sh, mm+sw) pixel in the reference image, take a series of template windows of size (wh, ww) with a skip (sh, sw), cross-correlate with the corresponding windows in the secondary image, and iterate till the end of the images. The output offset fields are stored in *outprefix+outputsuffix+'.bip'*, which is in BIP format, i.e., each pixel has two bands of float32 data, (offsetDown, offsetAcross). The total number of pixels is given by the total number of windows (numberWindowDown, numberWindowAcross), which is computed by the script and also saved to the xml file.

 If you are interested in a particular region instead of the whole image, you may specify the location of the starting pixel (in reference image) and the number of windows desired by adding

 ```
 --startpixelac $startPixelAcross --startpixeldw $startPixelDown --nwa $numberOfWindowsAcross --nwd $numberOfWindowsDown
 ```
-This option is also helpful for debugging.

 PyCuAmpcor supports two types of gross offset fields,
 * static (--gross=0), i.e., a constant shift between reference and secondary images. The static gross offsets can be passed by *--rr $rgshift --aa $azshift*. Note that the margin as well as the starting pixel may be adjusted.
@ -178,18 +184,15 @@ objOffset.runAmpcor()

 PyCuAmpcor now uses exclusively the GDAL driver to read images, only single-precision binary data are supported. (Image heights/widths are still required as inputs; they are mainly for dimension checking.  We will update later to read them with the GDAL driver). Multi-band is not currently supported, but can be added if desired.

-The offset output is arranged in BIP format, with each pixel (azimuth offset, range offset). In addition to a static gross offset (i.e., a constant for all search windows), PyCuAmpcor supports varying gross offsets as inputs (e.g., for glaciers, users can compute the gross offsets with the velocity model for different locations and use them as inputs for PyCuAmpcor. See 2.1 for details.
-
-The offsetImage only ouputs the (dense) offset values computed from the cross-correlations. Users need to add offsetImage and grossOffsetImage to obtain the total offsets.
+The offset output is arranged in BIP format, with each pixel (azimuth offset, range offset). In addition to a static gross offset (i.e., a constant for all search windows), PyCuAmpcor supports varying gross offsets as inputs (e.g., for glaciers, users can compute the gross offsets with the velocity model for different locations and use them as inputs for PyCuAmpcor.

+The offsetImage only outputs the (dense) offset values computed from the cross-correlations. Users need to add offsetImage and grossOffsetImage to obtain the total offsets.

 The dimension/direction names used in PyCuAmpcor are:
 * the inner-most dimension x(i): row, height, down, azimuth, along the track.
 * the outer-most dimension y(j): column, width, across, range, along the sight.
-* C/C++/Python use row-major indexing: a[i][j] -> a[i*WIDTH+j]
-* FORTRAN/BLAS/CUBLAS use column-major indexing: a[i][j]->a[i+j*LENGTH]

-Note that ampcor.F in general uses y for rows and x for columns which is opposite to PyCuAmpcor.
+Note that ampcor.F and GDAL in general use y for rows and x for columns.

 Note also PyCuAmpcor parameters refer to the names used by the PyCuAmpcor Python class. They may be different from those used in C/C++/CUDA, or the cuDenseOffsets.py args.

@ -245,7 +248,7 @@ and the number of windows may be computed along lines as
   objOffset.numberWindowDown = (referenceImageHeight-2*margin-2*halfSearchRangeDown-windowSizeHeight) // skipSampleDown
 ```

-If there is a gross offset, you may also need to subtract that when computing the number of windows.
+If there is a gross offset, you may also need to subtract it when computing the number of windows.

 The output offset fields will be of size (numberWindowDown, numberWindowAcross). The total number of windows numberWindows = numberWindowDown\*numberWindowAcross.

@ -269,7 +272,7 @@ With a constant gross offset, call

 to set the starting pixels of all reference and secondary windows.

-The starting pixel for the seconday window will be (referenceStartPixelDownStatic-halfSearchRangeDown+grossOffsetDown, referenceStartPixelAcrossStatic-halfSearchRangeAcross+grossOffsetAcross).
+The starting pixel for the secondary window will be (referenceStartPixelDownStatic-halfSearchRangeDown+grossOffsetDown, referenceStartPixelAcrossStatic-halfSearchRangeAcross+grossOffsetAcross).

 For cases you choose a varying grossOffset, you may use two numpy arrays to pass the information to PyCuAmpcor, e.g.,

@ -336,12 +339,12 @@ This step provides an initial estimate of the offset, usually with a large searc
 **Difference to ROIPAC**

 * RIOPAC only offers the time-domain algorithm. The frequency-domain algorithm is faster and is set as default in PyCuAmpcor.
-* RIOPAC proceeds from here only for windows with *good* match. To maintain parallelism, PyCuAmpcor proceeds anyway while leaving the *filtering* to users in post processing.
+* RIOPAC proceeds from here only for windows with *good* match, or with high coherence. To maintain parallelism, PyCuAmpcor proceeds anyway while leaving the *filtering* to users in post processing.


 ### 5.3 Extract a smaller window from the secondary window for oversampling

-* From the secondary window, we extract a smaller window of size (windowSizeHeightRaw+2\*halfZoomWindowSizeRaw, windowSizeWidthRaw+2\*halfZoomWindowSizeRaw) with the center determined by the peak position. If the peak postion, e.g., along height, is OffsetInit (taking values in \[0, 2\*halfSearchRangeDownRaw\]), the starting position to extract will be OffsetInit+halfSearchRangeDownRaw-halfZoomWindowSizeRaw.
+* From the secondary window, we extract a smaller window of size (windowSizeHeightRaw+2\*halfZoomWindowSizeRaw, windowSizeWidthRaw+2\*halfZoomWindowSizeRaw) with the center determined by the peak position. If the peak position, e.g., along height, is OffsetInit (taking values in \[0, 2\*halfSearchRangeDownRaw\]), the starting position to extract will be OffsetInit+halfSearchRangeDownRaw-halfZoomWindowSizeRaw.

 **Parameters**

@ -356,8 +359,8 @@ RIOPAC extracts the secondary window centering at the correlation surface peak.

 ### 5.4 Oversampling reference and (extracted) secondary windows

-* oversample both the reference and the (extracted) secondary windows by a factor of 2, which is to avoid aliasing in the complex multiplication of the SAR images. The oversampling is performed with FFT (zero padding), same as in RIOPAC.
-* A deramping procedure is in general required for complex signals before oversampling, to shift the band center to 0. The procedure is only designed to remove a linear phase ramp. It doesn't work for InSAR TOPS mode, whose ramp goes quadratic. Instead, the amplitudes are taken before oversampling.
+* Oversample both the reference and the (extracted) secondary windows by a factor of 2, which is to avoid aliasing in the complex multiplication of the SAR images. The oversampling is performed with FFT (zero padding), same as in RIOPAC.
+* A deramping procedure is in general required for complex signals before oversampling, to shift the band center to 0. The procedure is only designed to remove a linear phase ramp. It doesn't work for TOPSAR, whose ramp goes quadratic. Instead, the amplitudes are taken before oversampling.
 * the amplitudes (real) are then taken for each pixel of the complex signals in reference and secondary windows.

 **Parameters**
@ -410,33 +413,4 @@ Note that this offset does not include the pre-defined gross offset. Users need

 **Difference to ROIPAC**

-RIOPAC by default uses the sinc interpolator (one needs to change the FORTRAN code to use FFT). There is no differnce with the sinc interpolator, while for FFT, RIOPAC always enlarges the window to a power of 2.
-
-
-## 6. Additional Notes
-
-### 6.1 Sinc Oversampler
-
-The since oversampler/interpolator may be selected to oversample the correlation surface with *--corr-osm=1* in *cuDenseOffsets.py*, or *objOffset.corrSurfaceOverSamplingMethod=1*.
-
-The sinc interpolating formula is defined as
-
- $$x(t) = \sum_{n=-\infty}^{\infty} x_n f( \Omega_c t-n )$$
-
-
-with $f(x) = \text{sinc}(x)$ or a complex filter such as the sinc(x) convoluted with Hamming Window used in ampcor.
-
-```
-   parameter(MAXDECFACTOR=4096) ! maximum lags in interpolation kernels
-   r_fintp(0:MAXINTLGH) ! interpolation kernel values  
-   i_decfactor = 4096 ! Range migration decimation Factor 
-   parameter (MAXINTKERLGH=256) !maximum interpolation kernel length 
-   MAXINTLGH=MAXINTKERLGH*MAXDECFACTOR ! maximum interpolation kernel array size 
-   i_weight = 1 
-   r_pedestal = 0.0 
-   r_beta = .75 
-   r_relfiltlen = 6.0 
-   r_fintp(0:MAXINTLGH)       
-```
-
-Note that these parameters are hardwired; you need to change the source code to change these parameters.
+RIOPAC by default uses the sinc interpolator (the FFT method is included but one needs to change the FORTRAN code to switch). For since interpolator, there is no difference in implementations. For FFT, RIOPAC always enlarges the window to a size in power of 2.
--- a/contrib/PyCuAmpcor/src/GDALImage.cu
+++ b/contrib/PyCuAmpcor/src/GDALImage.cu
@ -1,22 +1,24 @@
-#include "GDALImage.h"
-#include <iostream>
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <unistd.h>
-#include <fcntl.h>
-#include <assert.h>
-#include "cudaError.h"
-#include <errno.h>
-#include <unistd.h>
-
-
 /**
- * \brief Constructor
+ * @file  GDALImage.h
+ * @brief Implementations of GDALImage class
 *
- * @param filename a std::string with the raster image file name
 */

+// my declaration
+#include "GDALImage.h"
+
+// dependencies
+#include <iostream>
+#include "cudaError.h"
+
+/**
+ * Constructor
+ * @brief Create a GDAL image object
+ * @param filename a std::string with the raster image file name
+ * @param band the band number
+ * @param cacheSizeInGB read buffer size in GigaBytes
+ * @param useMmap whether to use memory map
+ */
 GDALImage::GDALImage(std::string filename, int band, int cacheSizeInGB, int useMmap)
   : _useMmap(useMmap)
 {
@ -31,7 +33,7 @@ GDALImage::GDALImage(std::string filename, int band, int cacheSizeInGB, int useM
    int count = _poDataset->GetRasterCount();
    if(band > count)
    {
-        std::cout << "The desired band " << band << " is greated than " << count << " bands available";
+        std::cout << "The desired band " << band << " is greater than " << count << " bands available";
        throw;
    }

@ -72,8 +74,6 @@ GDALImage::GDALImage(std::string filename, int band, int cacheSizeInGB, int useM
 		    &_pixelSize,
 		    &pnLineSpace,
 		    papszOptions);
-
-        // check it
        if(!_poBandVirtualMem)
            throw;

@ -83,38 +83,52 @@ GDALImage::GDALImage(std::string filename, int band, int cacheSizeInGB, int useM
    else { // use a buffer
        checkCudaErrors(cudaMallocHost((void **)&_memPtr, _bufferSize));
    }
-
    // make sure memPtr is not Null
    if (!_memPtr)
+    {
+        std::cout << "unable to locate the memory buffer\n";
        throw;
-
+    }
    // all done
 }


-/// load a tile of data h_tile x w_tile from CPU (mmap) to GPU
-/// @param dArray pointer for array in device memory
-/// @param h_offset Down/Height offset
-/// @param w_offset Across/Width offset
-/// @param h_tile Down/Height tile size
-/// @param w_tile Across/Width tile size
-/// @param stream CUDA stream for copying
-void GDALImage::loadToDevice(void *dArray, size_t h_offset, size_t w_offset, size_t h_tile, size_t w_tile, cudaStream_t stream)
+/**
+ * Load a tile of data h_tile x w_tile from CPU to GPU
+ * @param dArray pointer for array in device memory
+ * @param h_offset Down/Height offset
+ * @param w_offset Across/Width offset
+ * @param h_tile Down/Height tile size
+ * @param w_tile Across/Width tile size
+ * @param stream CUDA stream for copying
+ * @note Need to use size_t type to pass the parameters to cudaMemcpy2D correctly
+ */
+void GDALImage::loadToDevice(void *dArray, size_t h_offset, size_t w_offset,
+    size_t h_tile, size_t w_tile, cudaStream_t stream)
 {
+
    size_t tileStartOffset = (h_offset*_width + w_offset)*_pixelSize;

    char * startPtr = (char *)_memPtr ;
    startPtr += tileStartOffset;

-    if (_useMmap)
-        checkCudaErrors(cudaMemcpy2DAsync(dArray, w_tile*_pixelSize, startPtr, _width*_pixelSize,
-                                      w_tile*_pixelSize, h_tile, cudaMemcpyHostToDevice,stream));
-    else {
+    if (_useMmap) {
+        // direct copy from memory map buffer to device memory
+        checkCudaErrors(cudaMemcpy2DAsync(dArray, // dst
+            w_tile*_pixelSize,                    // dst pitch
+            startPtr,                             // src
+            _width*_pixelSize,                    // src pitch
+            w_tile*_pixelSize,                    // width in Bytes
+            h_tile,                               // height
+            cudaMemcpyHostToDevice,stream));
+    }
+    else { // use a cpu buffer to load image data to gpu
+
        // get the total tile size in bytes
        size_t tileSize = h_tile*w_tile*_pixelSize;
        // if the size is bigger than existing buffer, reallocate
        if (tileSize > _bufferSize) {
-            // maybe we need to make it to fit the pagesize
+            // TODO: fit the pagesize
            _bufferSize = tileSize;
            checkCudaErrors(cudaFree(_memPtr));
            checkCudaErrors(cudaMallocHost((void **)&_memPtr, _bufferSize));
@ -126,17 +140,18 @@ void GDALImage::loadToDevice(void *dArray, size_t h_offset, size_t w_offset, siz
            _memPtr, // pData
            w_tile*h_tile, 1, // nBufXSize, nBufYSize
            _dataType, //eBufType
-            0, 0, //nPixelSpace, nLineSpace in pData
-            NULL //psExtraArg extra resampling callback
+            0, 0 //nPixelSpace, nLineSpace in pData
            );
-
        if(err != CE_None)
-            throw;
+            throw; // throw if reading error occurs; message reported by GDAL
+
        // copy from buffer to gpu
        checkCudaErrors(cudaMemcpyAsync(dArray, _memPtr, tileSize, cudaMemcpyHostToDevice, stream));
    }
+    // all done
 }

+/// destructor
 GDALImage::~GDALImage()
 {
    // free the virtual memory
--- a/contrib/PyCuAmpcor/src/GDALImage.h
+++ b/contrib/PyCuAmpcor/src/GDALImage.h
@ -1,61 +1,65 @@
-// -*- c++ -*-
 /**
- * \brief Class for an image described GDAL vrt
+ * @file GDALImage.h
+ * @brief Interface with GDAL vrt driver
 *
- * only complex (pixelOffset=8) or real(pixelOffset=4) images are supported, such as SLC and single-precision TIFF
+ * To read image file with the GDAL vrt driver, including SLC, GeoTIFF images
+ * @warning Only single precision images are supported: complex(pixelOffset=8) or real(pixelOffset=4).
+ * @warning Only single band file is currently supported.
 */

+// code guard
 #ifndef __GDALIMAGE_H
 #define __GDALIMAGE_H

+// dependencies
 #include <string>
 #include <gdal_priv.h>
 #include <cpl_conv.h>

-class GDALImage{

+class GDALImage{
 public:
+    // specify the types
    using size_t = std::size_t;

 private:
-    size_t _fileSize;
-    int _height;
-    int _width;
+    int _height;      ///< image height
+    int _width;       ///< image width

-    // buffer pointer
-    void * _memPtr = NULL;
+    void * _memPtr = NULL; ///< pointer to buffer

-    int _pixelSize; //in bytes
+    int _pixelSize; ///< pixel size in bytes

-    int _isComplex;
+    int _isComplex; ///< whether the image is complex

-    size_t _bufferSize;
-    int _useMmap;
+    size_t _bufferSize; ///< buffer size
+    int _useMmap;   ///< whether to use memory map

+    // GDAL temporary objects
    GDALDataType _dataType;
    CPLVirtualMem * _poBandVirtualMem = NULL;
    GDALDataset * _poDataset = NULL;
    GDALRasterBand * _poBand = NULL;

 public:
+    //disable default constructor
    GDALImage() = delete;
+    // constructor
    GDALImage(std::string fn, int band=1, int cacheSizeInGB=0, int useMmap=1);
+    // destructor
+    ~GDALImage();

+    // get class properties
    void * getmemPtr()
    {
        return(_memPtr);
    }

-    size_t getFileSize()
-    {
-        return (_fileSize);
-    }
-
-    size_t getHeight() {
+    int getHeight() {
        return (_height);
    }

-    size_t getWidth()
+    int getWidth()
    {
        return (_width);
    }
@ -70,9 +74,10 @@ public:
        return _isComplex;
    }

+    // load data from cpu buffer to gpu
    void loadToDevice(void *dArray, size_t h_offset, size_t w_offset, size_t h_tile, size_t w_tile, cudaStream_t stream);
-    ~GDALImage();

 };

 #endif //__GDALIMAGE_H
+// end of file
--- a/contrib/PyCuAmpcor/src/Makefile
+++ b/contrib/PyCuAmpcor/src/Makefile
@ -1,8 +1,8 @@
 PROJECT = CUAMPCOR

 LDFLAGS =  -lcuda -lcudart -lcufft -lgdal
-CXXFLAGS = -std=c++11 -fpermissive -fPIC -shared
-NVCCFLAGS = -std=c++11 -ccbin g++ -m64 \
+CXXFLAGS = -std=c++11 -fpermissive -DNDEBUG -fPIC -shared
+NVCCFLAGS = -std=c++11 -m64 -DNDEBUG \
    -gencode arch=compute_35,code=sm_35 \
 	-gencode arch=compute_60,code=sm_60 \
    -Xcompiler -fPIC -shared -Wno-deprecated-gpu-targets \
@ -70,4 +70,4 @@ pyampcor: $(OBJS)
 	rm -f PyCuAmpcor.cpp && python3 setup.py build_ext --inplace

 clean:
-	rm -rf *.o *so build *~  PyCuAmpcor.cpp ctest *.dat
+	rm -rf *.o *so build *~  PyCuAmpcor.cpp *.dat
--- a/contrib/PyCuAmpcor/src/PyCuAmpcor.pyx
+++ b/contrib/PyCuAmpcor/src/PyCuAmpcor.pyx
@ -1,6 +1,7 @@
 #
 # PYX file to control Python module interface to underlying CUDA-Ampcor code
 #
+
 from libcpp.string cimport string
 import numpy as np
 cimport numpy as np
@ -9,23 +10,21 @@ cimport numpy as np
 cdef extern from "cudaUtil.h":
    int gpuDeviceInit(int)
    void gpuDeviceList()
-    int gpuGetMaxGflopsDeviceId()

 def listGPU():
    gpuDeviceList()

-def findGPU():
-    return gpuGetMaxGflopsDeviceId()
-
 def setGPU(int id):
    return gpuDeviceInit(id)

+def version():
+    return "2.0.0"

 cdef extern from "cuAmpcorParameter.h":
    cdef cppclass cuAmpcorParameter:
        cuAmpcorParameter() except +
        int algorithm      					## Cross-correlation algorithm: 0=freq domain 1=time domain
-        int deviceID       					## Targeted GPU device ID: use -1 to auto select
+        int deviceID       					## Targeted GPU device ID
        int nStreams       					## Number of streams to asynchonize data transfers and compute kernels
        int derampMethod   					## Method for deramping 0=None, 1=average, 2=phase gradient

@ -449,7 +448,7 @@ cdef class PyCuAmpcor(object):
        self.c_cuAmpcor.runAmpcor()


-
+# end of file



--- a/contrib/PyCuAmpcor/src/cuAmpcorChunk.cu
+++ b/contrib/PyCuAmpcor/src/cuAmpcorChunk.cu
@ -13,9 +13,7 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)

    // load reference image chunk
    loadReferenceChunk();
-
-    //std::cout << "load reference chunk ok\n";
-
+    // take amplitudes
    cuArraysAbs(c_referenceBatchRaw, r_referenceBatchRaw, stream);

 #ifdef CUAMPCOR_DEBUG
@ -24,6 +22,7 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    r_referenceBatchRaw->outputToFile("r_referenceBatchRaw", stream);
 #endif

+    // compute and subtract mean values (for normalized)
    cuArraysSubtractMean(r_referenceBatchRaw, stream);

 #ifdef CUAMPCOR_DEBUG
@ -33,6 +32,7 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)

    // load secondary image chunk
    loadSecondaryChunk();
+    // take amplitudes
    cuArraysAbs(c_secondaryBatchRaw, r_secondaryBatchRaw, stream);

 #ifdef CUAMPCOR_DEBUG
@ -41,9 +41,7 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    r_secondaryBatchRaw->outputToFile("r_secondaryBatchRaw", stream);
 #endif

-    //std::cout << "load secondary chunk ok\n";
-
-    //cross correlation for none-oversampled data
+    //cross correlation for un-oversampled data
    if(param->algorithm == 0) {
        cuCorrFreqDomain->execute(r_referenceBatchRaw, r_secondaryBatchRaw, r_corrBatchRaw);
    } else {
@ -51,10 +49,11 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    }

 #ifdef CUAMPCOR_DEBUG
-    // dump the unrenormalized correlation surface
+    // dump the un-normalized correlation surface
    r_corrBatchRaw->outputToFile("r_corrBatchRawUnNorm", stream);
 #endif

+    // normalize the correlation surface
    cuCorrNormalize(r_referenceBatchRaw, r_secondaryBatchRaw, r_corrBatchRaw, stream);

 #ifdef CUAMPCOR_DEBUG
@ -64,23 +63,18 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)

    // find the maximum location of none-oversampled correlation
    // 41 x 41, if halfsearchrange=20
-    //cuArraysMaxloc2D(r_corrBatchRaw, offsetInit, stream);
    cuArraysMaxloc2D(r_corrBatchRaw, offsetInit, r_maxval, stream);

    // Estimation of statistics
-    // Author: Minyan Zhong
    // Extraction of correlation surface around the peak
    cuArraysCopyExtractCorr(r_corrBatchRaw, r_corrBatchRawZoomIn, i_corrBatchZoomInValid, offsetInit, stream);

-    //cudaDeviceSynchronize();
-
    // Summation of correlation and data point values
    cuArraysSumCorr(r_corrBatchRawZoomIn, i_corrBatchZoomInValid, r_corrBatchSum, i_corrBatchValidCount, stream);

 #ifdef CUAMPCOR_DEBUG
    i_corrBatchZoomInValid->outputToFile("i_corrBatchZoomInValid", stream);
    r_corrBatchSum->outputToFile("r_corrBatchSum", stream);
-    // snr and cov will be outputted anyway
 #endif

    // SNR
@ -90,20 +84,13 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    cuEstimateVariance(r_corrBatchRaw, offsetInit, r_maxval, r_covValue, stream);

 #ifdef CUAMPCOR_DEBUG
-    // debug: output the intermediate results
-
-    // std::cout << "Offset from first search:\n";
-    // offsetInit->debuginfo(stream);
-    // dump the results
    offsetInit->outputToFile("i_offsetInit", stream);
    r_maxval->outputToFile("r_maxval", stream);
    r_corrBatchRawZoomIn->outputToFile("r_corrBatchRawStatZoomIn", stream);
    i_corrBatchZoomInValid->outputToFile("i_corrBatchStatZoomInValid", stream);
-    //r_snr, r_cov will be always saved to files
 #endif

    // Using the approximate estimation to adjust secondary image (half search window size becomes only 4 pixels)
-    // offsetInit->debuginfo(stream);
    // determine the starting pixel to extract secondary images around the max location
    cuDetermineSecondaryExtractOffset(offsetInit,
        maxLocShift,
@ -114,17 +101,14 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
        stream);

 #ifdef CUAMPCOR_DEBUG
-    // std::cout << "max location adjusted if close to boundary\n";
-    // offsetInit->debuginfo(stream);
-    // std::cout << "and the shift of the center\n";
-    // maxLocShift->debuginfo(stream);
    offsetInit->outputToFile("i_offsetInitAdjusted", stream);
    maxLocShift->outputToFile("i_maxLocShift", stream);
 #endif

    // oversample reference
-    // (deramping now included in oversampler)
+    // (deramping included in oversampler)
    referenceBatchOverSampler->execute(c_referenceBatchRaw, c_referenceBatchOverSampled, param->derampMethod);
+    // take amplitudes
    cuArraysAbs(c_referenceBatchOverSampled, r_referenceBatchOverSampled, stream);

 #ifdef CUAMPCOR_DEBUG
@ -133,7 +117,7 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    r_referenceBatchOverSampled->outputToFile("r_referenceBatchOverSampled", stream);
 #endif

-    // subtrace the mean value
+    // compute and subtract the mean value
    cuArraysSubtractMean(r_referenceBatchOverSampled, stream);

 #ifdef CUAMPCOR_DEBUG
@ -144,6 +128,7 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    // extract secondary and oversample
    cuArraysCopyExtract(c_secondaryBatchRaw, c_secondaryBatchZoomIn, offsetInit, stream);
    secondaryBatchOverSampler->execute(c_secondaryBatchZoomIn, c_secondaryBatchOverSampled, param->derampMethod);
+    // take amplitudes
    cuArraysAbs(c_secondaryBatchOverSampled, r_secondaryBatchOverSampled, stream);

 #ifdef CUAMPCOR_DEBUG
@ -171,19 +156,14 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    cuCorrNormalize(r_referenceBatchOverSampled, r_secondaryBatchOverSampled, r_corrBatchZoomIn, stream);

 #ifdef CUAMPCOR_DEBUG
-    //std::cout << "debug correlation oversample\n";
-    //std::cout << r_referenceBatchOverSampled->height << " " << r_referenceBatchOverSampled->width << "\n";
-    //std::cout << r_secondaryBatchOverSampled->height << " " << r_secondaryBatchOverSampled->width << "\n";
-    //std::cout << r_corrBatchZoomIn->height << " " << r_corrBatchZoomIn->width << "\n";
    // dump the oversampled correlation surface (normalized)
    r_corrBatchZoomIn->outputToFile("r_corrBatchZoomIn", stream);
 #endif

-    // remove the last row and col to get even sequences (for sinc oversampler)
+    // remove the last row and col to get even sequences
    cuArraysCopyExtract(r_corrBatchZoomIn, r_corrBatchZoomInAdjust, make_int2(0,0), stream);

 #ifdef CUAMPCOR_DEBUG
-    //std::cout << "debug oversampling " << r_corrBatchZoomInAdjust << " " << r_corrBatchZoomInOverSampled << "\n";
    // dump the adjusted correlation Surface
    r_corrBatchZoomInAdjust->outputToFile("r_corrBatchZoomInAdjust", stream);
 #endif
@ -219,28 +199,19 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    cuSubPixelOffset(offsetInit, offsetZoomIn, offsetFinal,
        param->oversamplingFactor, param->rawDataOversamplingFactor,
        param->halfSearchRangeDownRaw, param->halfSearchRangeAcrossRaw,
-        param->halfZoomWindowSizeRaw, param->halfZoomWindowSizeRaw,
        stream);

-// #ifdef CUAMPCOR_DEBUG
-    // std::cout << "Offsets: Oversampled and Final)\n";
-    // offsetZoomIn->debuginfo(stream);
-    // offsetFinal->debuginfo(stream);
-// #endif
-
-    // Do insertion.
-    // Offsetfields.
+    // Insert the chunk results to final images
    cuArraysCopyInsert(offsetFinal, offsetImage, idxDown_*param->numberWindowDownInChunk, idxAcross_*param->numberWindowAcrossInChunk,stream);
-
-    // Old: save max correlation coefficients.
-    //cuArraysCopyInsert(corrMaxValue, snrImage, idxDown_*param->numberWindowDownInChunk, idxAcross_*param->numberWindowAcrossInChunk,stream);
-    // New: save SNR
+    // snr
    cuArraysCopyInsert(r_snrValue, snrImage, idxDown_*param->numberWindowDownInChunk, idxAcross_*param->numberWindowAcrossInChunk,stream);
-
    // Variance.
    cuArraysCopyInsert(r_covValue, covImage, idxDown_*param->numberWindowDownInChunk, idxAcross_*param->numberWindowAcrossInChunk,stream);
+    // all done
+
 }

+/// set chunk index
 void cuAmpcorChunk::setIndex(int idxDown_, int idxAcross_)
 {
    idxChunkDown = idxDown_;
@ -260,13 +231,11 @@ void cuAmpcorChunk::setIndex(int idxDown_, int idxAcross_)
 	else {
 		nWindowsAcross = param->numberWindowAcrossInChunk;
 	}
-	//std::cout << "DEBUG setIndex" << idxChunk << " " << nWindowsDown << " " << nWindowsAcross << "\n";
-
 }

 /// obtain the starting pixels for each chip
-/// @param[in] oStartPixel
-///
+/// @param[in] oStartPixel start pixel locations for all chips
+/// @param[out] rstartPixel  start pixel locations for chips within the chunk
 void cuAmpcorChunk::getRelativeOffset(int *rStartPixel, const int *oStartPixel, int diff)
 {
    for(int i=0; i<param->numberWindowDownInChunk; ++i) {
@ -279,7 +248,6 @@ void cuAmpcorChunk::getRelativeOffset(int *rStartPixel, const int *oStartPixel,
            int idxInAll = (iDown+idxChunkDown*param->numberWindowDownInChunk)*param->numberWindowAcross
 				+ idxChunkAcross*param->numberWindowAcrossInChunk+iAcross;
            rStartPixel[idxInChunk] = oStartPixel[idxInAll] - diff;
-            //fprintf(stderr, "relative offset %d %d %d %d\n", i, j, rStartPixel[idxInChunk], diff);
        }
    }
 }
@ -314,7 +282,6 @@ void cuAmpcorChunk::loadReferenceChunk()

        // load the data from cpu
        referenceImage->loadToDevice((void *)c_referenceChunkRaw->devData, startD, startA, height, width, stream);
-        //std::cout << "debug load reference: " << startD << " " <<  startA << " " <<  height << " "  << width << "\n";

        //copy the chunk to a batch format (nImages, height, width)
        // if derampMethod = 0 (no deramp), take amplitudes; otherwise, copy complex data
@ -399,6 +366,7 @@ void cuAmpcorChunk::loadSecondaryChunk()
    }
 }

+/// constructor
 cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, GDALImage *secondary_,
    cuArrays<float2> *offsetImage_, cuArrays<float> *snrImage_, cuArrays<float3> *covImage_,
    cudaStream_t stream_)
@ -413,14 +381,6 @@ cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, G

    stream = stream_;

-    // std::cout << "debug Chunk creator " << param->maxReferenceChunkHeight << " " << param->maxReferenceChunkWidth << "\n";
-    // try allocate/deallocate on the fly to save gpu memory 07/09/19
-    // c_referenceChunkRaw = new cuArrays<float2> (param->maxReferenceChunkHeight, param->maxReferenceChunkWidth);
-    // c_referenceChunkRaw->allocate();
-
-    // c_secondaryChunkRaw = new cuArrays<float2> (param->maxSecondaryChunkHeight, param->maxSecondaryChunkWidth);
-    // c_secondaryChunkRaw->allocate();
-
    ChunkOffsetDown = new cuArrays<int> (param->numberWindowDownInChunk, param->numberWindowAcrossInChunk);
    ChunkOffsetDown->allocate();
    ChunkOffsetDown->allocateHost();
@ -527,10 +487,6 @@ cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, G


    // new arrays due to snr estimation
-    // std::cout<< "corrRawZoomInHeight: " << param->corrRawZoomInHeight << "\n";
-    // std::cout<< "corrRawZoomInWidth: " << param->corrRawZoomInWidth << "\n";
-    std::cout << "Size of corr_surface used for statistics: " << param->corrRawZoomInHeight << " x " << param->corrRawZoomInWidth << "\n";
-
    r_corrBatchRawZoomIn = new cuArrays<float> (
 			param->corrRawZoomInHeight,
 			param->corrRawZoomInWidth,
@ -599,33 +555,11 @@ cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, G
 #ifdef CUAMPCOR_DEBUG
    std::cout << "all objects in chunk are created ...\n";
 #endif
-
 }
+
+// destructor
 cuAmpcorChunk::~cuAmpcorChunk()
 {
-    /*
-    delete referenceChunkRaw;
-    delete secondaryChunkRaw;
-    delete ChunkOffsetDown;
-    delete ChunkOffsetAcross;
-    delete referenceBatchRaw;
-    delete secondaryBatchRaw;
-    delete referenceChunkOverSampled;
-    delete secondaryChunkOverSampled;
-    delete referenceChunkOverSampler;
-    delete secondaryChunkOverSampler;
-    delete referenceChunk;
-    delete secondaryChunk;
-    delete corrChunk;
-    delete offsetInit;
-    delete zoomInOffset;
-    delete offsetFinal;
-    delete corrChunkZoomIn;
-    delete corrChunkZoomInOverSampled;
-    delete corrOverSampler;
-    delete corrSincOverSampler;
-    delete corrMaxValue;
-    if(param->algorithm == 0)
-        delete cuCorrFreqDomain;
-    */
 }
+
+// end of file
--- a/contrib/PyCuAmpcor/src/cuAmpcorChunk.h
+++ b/contrib/PyCuAmpcor/src/cuAmpcorChunk.h
@ -1,6 +1,8 @@
 /*
- * cuAmpcorChunk.h
- * Purpose: a group of chips processed at the same time
+ * @file  cuAmpcorChunk.h
+ * @brief Ampcor processor for a batch of windows
+ *
+ *
 */

 #ifndef __CUAMPCORCHUNK_H
@ -13,78 +15,82 @@
 #include "cuSincOverSampler.h"
 #include "cuCorrFrequency.h"

+
+/**
+ * cuAmpcor processor for a chunk (a batch of windows)
+ */
 class cuAmpcorChunk{
 private:
-    int idxChunkDown;
-	int idxChunkAcross;
-    int idxChunk;
-    int nWindowsDown;
-    int nWindowsAcross;
+    int idxChunkDown;     ///< index of the chunk in total batches, down
+	int idxChunkAcross;   ///< index of the chunk in total batches, across
+    int idxChunk;         ///<
+    int nWindowsDown;     ///< number of windows in one chunk, down
+    int nWindowsAcross;   ///< number of windows in one chunk, across

-	int devId;
-	cudaStream_t stream;
+	int devId;            ///< GPU device ID to use
+	cudaStream_t stream;  ///< CUDA stream to use

-	GDALImage *referenceImage;
-	GDALImage *secondaryImage;
-	cuAmpcorParameter *param;
-	cuArrays<float2> *offsetImage;
-	cuArrays<float> *snrImage;
-	cuArrays<float3> *covImage;
+	GDALImage *referenceImage;  ///< reference image object
+	GDALImage *secondaryImage;  ///< secondary image object
+	cuAmpcorParameter *param;   ///< reference to the (global) parameters
+	cuArrays<float2> *offsetImage; ///< output offsets image
+	cuArrays<float> *snrImage;     ///< snr image
+	cuArrays<float3> *covImage;    ///< cov image

-    // gpu buffer
+    // local variables and workers
+    // gpu buffer to load images from file
 	cuArrays<float2> * c_referenceChunkRaw, * c_secondaryChunkRaw;
 	cuArrays<float> * r_referenceChunkRaw, * r_secondaryChunkRaw;

-	// gpu windows raw data
+	// windows raw (not oversampled) data, complex and real
    cuArrays<float2> * c_referenceBatchRaw, * c_secondaryBatchRaw, * c_secondaryBatchZoomIn;
    cuArrays<float> * r_referenceBatchRaw, * r_secondaryBatchRaw;

-    // gpu windows oversampled data
+    // windows oversampled data
    cuArrays<float2> * c_referenceBatchOverSampled, * c_secondaryBatchOverSampled;
    cuArrays<float> * r_referenceBatchOverSampled, * r_secondaryBatchOverSampled;
    cuArrays<float> * r_corrBatchRaw, * r_corrBatchZoomIn, * r_corrBatchZoomInOverSampled, * r_corrBatchZoomInAdjust;

+    // offset data
    cuArrays<int> *ChunkOffsetDown, *ChunkOffsetAcross;

+    // oversampling processors for complex images
 	cuOverSamplerC2C *referenceBatchOverSampler, *secondaryBatchOverSampler;

+    // oversampling processor for correlation surface
    cuOverSamplerR2R *corrOverSampler;
    cuSincOverSamplerR2R *corrSincOverSampler;

-	//for frequency domain
+	// cross-correlation processor with frequency domain algorithm
 	cuFreqCorrelator *cuCorrFreqDomain, *cuCorrFreqDomain_OverSampled;

+    // save offset results in different stages
 	cuArrays<int2> *offsetInit;
 	cuArrays<int2> *offsetZoomIn;
 	cuArrays<float2> *offsetFinal;
 	cuArrays<int2> *maxLocShift; //record the maxloc from the extract center
    cuArrays<float> *corrMaxValue;
-
-
-    //SNR estimation
-
-    cuArrays<float> *r_corrBatchRawZoomIn;
-    cuArrays<float> *r_corrBatchSum;
-    cuArrays<int> *i_corrBatchZoomInValid, *i_corrBatchValidCount;
-
-    cuArrays<float> *r_snrValue;
-
    cuArrays<int2> *i_maxloc;
    cuArrays<float> *r_maxval;

-    // Varince estimation.
+    // SNR estimation
+    cuArrays<float> *r_corrBatchRawZoomIn;
+    cuArrays<float> *r_corrBatchSum;
+    cuArrays<int> *i_corrBatchZoomInValid, *i_corrBatchValidCount;
+    cuArrays<float> *r_snrValue;
+
+    // Variance estimation.
    cuArrays<float3> *r_covValue;

 public:
-	cuAmpcorChunk()	{}
-	//cuAmpcorChunk(cuAmpcorParameter *param_, SlcImage *reference_, SlcImage *secondary_);
+    // constructor
+	cuAmpcorChunk(cuAmpcorParameter *param_,
+	    GDALImage *reference_, GDALImage *secondary_,
+	    cuArrays<float2> *offsetImage_, cuArrays<float> *snrImage_,
+	    cuArrays<float3> *covImage_, cudaStream_t stream_);

+    //
 	void setIndex(int idxDown_, int idxAcross_);
-
-	cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, GDALImage *secondary_, cuArrays<float2> *offsetImage_,
-	            cuArrays<float> *snrImage_, cuArrays<float3> *covImage_, cudaStream_t stream_);
-
-
    void loadReferenceChunk();
    void loadSecondaryChunk();
    void getRelativeOffset(int *rStartPixel, const int *oStartPixel, int diff);
--- a/contrib/PyCuAmpcor/src/cuAmpcorController.cu
+++ b/contrib/PyCuAmpcor/src/cuAmpcorController.cu
@ -1,6 +1,12 @@
-// Implementation of cuAmpcorController
+/**
+ * @file cuAmpcorController.cu
+ * @brief Implementations of cuAmpcorController
+ */

+// my declaration
 #include "cuAmpcorController.h"
+
+// dependencies
 #include "GDALImage.h"
 #include "cuArrays.h"
 #include "cudaUtil.h"
@ -8,11 +14,27 @@
 #include "cuAmpcorUtil.h"
 #include <iostream>

-cuAmpcorController::cuAmpcorController() { param = new cuAmpcorParameter();}
-cuAmpcorController::~cuAmpcorController() { delete param; }
+// constructor
+cuAmpcorController::cuAmpcorController()
+{
+    // create a new set of parameters
+    param = new cuAmpcorParameter();
+}

-void cuAmpcorController::runAmpcor() {
+// destructor
+cuAmpcorController::~cuAmpcorController()
+{
+    delete param;
+}

+
+/**
+ *  Run ampcor
+ *
+ *
+ */
+void cuAmpcorController::runAmpcor()
+{
    // set the gpu id
    param->deviceID = gpuDeviceInit(param->deviceID);
    // initialize the gdal driver
@ -26,15 +48,11 @@ void cuAmpcorController::runAmpcor() {
    cuArrays<float> *snrImage, *snrImageRun;
    cuArrays<float3> *covImage, *covImageRun;

-    // For debugging.
-    // cuArrays<int> *corrValidCountImage;
-    // cuArrays<float> *corrSumImage;
-
+    // nWindowsDownRun is defined as numberChunk * numberWindowInChunk
+    // It may be bigger than the actual number of windows
    int nWindowsDownRun = param->numberChunkDown * param->numberWindowDownInChunk;
    int nWindowsAcrossRun = param->numberChunkAcross * param->numberWindowAcrossInChunk;

-    //std::cout << "The number of windows to be processed (might be bigger) " << nWindowsDownRun << " x " << param->numberWindowDown << "\n";
-
    offsetImageRun = new cuArrays<float2>(nWindowsDownRun, nWindowsAcrossRun);
    offsetImageRun->allocate();

@ -56,12 +74,17 @@ void cuAmpcorController::runAmpcor() {
    covImage = new cuArrays<float3>(param->numberWindowDown, param->numberWindowAcross);
    covImage->allocate();

+    // set up the cuda streams
    cudaStream_t streams[param->nStreams];
    cuAmpcorChunk *chunk[param->nStreams];
+    // iterate over cuda streams
    for(int ist=0; ist<param->nStreams; ist++)
    {
-        cudaStreamCreate(&streams[ist]);
-        chunk[ist]= new cuAmpcorChunk(param, referenceImage, secondaryImage, offsetImageRun, snrImageRun, covImageRun,
+        // create each stream
+        checkCudaErrors(cudaStreamCreate(&streams[ist]));
+        // create the chunk processor for each stream
+        chunk[ist]= new cuAmpcorChunk(param, referenceImage, secondaryImage,
+            offsetImageRun, snrImageRun, covImageRun,
            streams[ist]);

    }
@ -69,37 +92,43 @@ void cuAmpcorController::runAmpcor() {
    int nChunksDown = param->numberChunkDown;
    int nChunksAcross = param->numberChunkAcross;

-    std::cout << "Total number of windows (azimuth x range):  " <<param->numberWindowDown << " x " << param->numberWindowAcross  << std::endl;
-    std::cout << "to be processed in the number of chunks: " <<nChunksDown << " x " << nChunksAcross  << std::endl;
+    // report info
+    std::cout << "Total number of windows (azimuth x range):  "
+        << param->numberWindowDown << " x " << param->numberWindowAcross
+        << std::endl;
+    std::cout << "to be processed in the number of chunks: "
+        << nChunksDown << " x " << nChunksAcross  << std::endl;

+    // iterative over chunks down
    for(int i = 0; i<nChunksDown; i++)
    {
-        std::cout << "Processing chunk (" << i <<", x" << ")" << std::endl;
+        std::cout << "Processing chunk (" << i <<", x" << ") out of " << nChunksDown << std::endl;
+        // iterate over chunks across
        for(int j=0; j<nChunksAcross; j+=param->nStreams)
        {
-			//std::cout << "Processing chunk(" << i <<", " << j <<")" << std::endl;
+            // iterate over cuda streams to process chunks
            for(int ist = 0; ist < param->nStreams; ist++)
            {
-                if(j+ist < nChunksAcross) {
-
-                    chunk[ist]->run(i, j+ist);
+                int chunkIdxAcross = j+ist;
+                if(chunkIdxAcross < nChunksAcross) {
+                    chunk[ist]->run(i, chunkIdxAcross);
                }
            }
        }
    }

+    // wait all streams are done
    cudaDeviceSynchronize();

-    // Do extraction.
+    // extraction of the run images to output images
    cuArraysCopyExtract(offsetImageRun, offsetImage, make_int2(0,0), streams[0]);
    cuArraysCopyExtract(snrImageRun, snrImage, make_int2(0,0), streams[0]);
    cuArraysCopyExtract(covImageRun, covImage, make_int2(0,0), streams[0]);
-
+    // save outputs to files
    offsetImage->outputToFile(param->offsetImageName, streams[0]);
    snrImage->outputToFile(param->snrImageName, streams[0]);
    covImage->outputToFile(param->covImageName, streams[0]);
-
-
+    // also save the gross offsets
    outputGrossOffsets();

    // Delete arrays.
@ -112,13 +141,19 @@ void cuAmpcorController::runAmpcor() {
    delete covImageRun;

    for (int ist=0; ist<param->nStreams; ist++)
+    {
+        checkCudaErrors(cudaStreamDestroy(streams[ist]));
        delete chunk[ist];
+    }

    delete referenceImage;
    delete secondaryImage;

 }

+/**
+ * Output gross offset fields
+ */
 void cuAmpcorController::outputGrossOffsets()
 {
    cuArrays<float2> *grossOffsets = new cuArrays<float2>(param->numberWindowDown, param->numberWindowAcross);
@ -130,72 +165,4 @@ void cuAmpcorController::outputGrossOffsets()
    delete grossOffsets;
 }

-
-/*
-void cuAmpcorController::setAlgorithm(int n) { param->algorithm = n; } // 0 - freq domain; 1 - time domain
-int cuAmpcorController::getAlgorithm() { return param->algorithm; }
-void cuAmpcorController::setDeviceID(int n) { param->deviceID = n; }
-int cuAmpcorController::getDeviceID() { return param->deviceID; }
-void cuAmpcorController::setNStreams(int n) { param->nStreams = n; }
-int cuAmpcorController::getNStreams() { return param->nStreams; }
-void cuAmpcorController::setWindowSizeHeight(int n) { param->windowSizeHeight = n; }
-int cuAmpcorController::getWindowSizeHeight() { return param->windowSizeHeight; }
-void cuAmpcorController::setWindowSizeWidth(int n) { param->windowSizeWidth = n; }
-int cuAmpcorController::getWindowSizeWidth() { return param->windowSizeWidth; }
-void cuAmpcorController::setSearchWindowSizeHeight(int n) { param->searchWindowSizeHeight = n; }
-int cuAmpcorController::getSearchWindowSizeHeight() { return param->windowSizeHeight; }
-void cuAmpcorController::setSearchWindowSizeWidth(int n) { param->searchWindowSizeWidth = n; }
-void cuAmpcorController::setRawOversamplingFactor(int n) { param->rawDataOversamplingFactor = n; }
-void cuAmpcorController::setZoomWindowSize(int n) { param->zoomWindowSize = n; }
-void cuAmpcorController::setOversamplingFactor(int n) { param->oversamplingFactor = n; }
-//void cuAmpcorController::setAcrossLooks(int n) { param->acrossLooks = n; } // 1 - single look
-//void cuAmpcorController::setDownLooks(int n) { param->downLooks = n; } // 1 - single look
-//void cuAmpcorController::setSkipSampleAcross(int n) { param->skipSampleAcross = n; }
-//void cuAmpcorController::setSkipSampleDown(int n) { param->skipSampleDown = n; }
-void cuAmpcorController::setSkipSampleAcrossRaw(int n) { param->skipSampleAcrossRaw = n; }
-int cuAmpcorController::getSkipSampleAcrossRaw() { return param->skipSampleAcrossRaw; }
-void cuAmpcorController::setSkipSampleDownRaw(int n) { param->skipSampleDownRaw = n; }
-int cuAmpcorController::getSkipSampleDownRaw() { return param->skipSampleDownRaw; }
-void cuAmpcorController::setNumberWindowDown(int n) { param->numberWindowDown = n; }
-void cuAmpcorController::setNumberWindowAcross(int n) { param->numberWindowAcross = n; }
-void cuAmpcorController::setNumberWindowDownInChunk(int n) { param->numberWindowDownInChunk = n; }
-void cuAmpcorController::setNumberWindowAcrossInChunk(int n) { param->numberWindowAcrossInChunk = n; }
-//void cuAmpcorController::setRangeSpacing1(float n) { param->rangeSpacing1 = n; } // deprecated
-//void cuAmpcorController::setRangeSpacing2(float n) { param->rangeSpacing2 = n; } // deprecated
-//void cuAmpcorController::setImageDatatype1(int n) { param->imageDataType1 = n; }
-//void cuAmpcorController::setImageDatatype2(int n) { param->imageDataType2 = n; }
-void cuAmpcorController::setThresholdSNR(float n) { param->thresholdSNR = n; } // deprecated(?)
-//void cuAmpcorController::setThresholdCov(float n) { param->thresholdCov = n; } // deprecated(?)
-//void cuAmpcorController::setBand1(int n) { param->band1 = n; }
-//void cuAmpcorController::setBand2(int n) { param->band2 = n; }
-void cuAmpcorController::setReferenceImageName(std::string s) { param->referenceImageName = s; }
-std::string cuAmpcorController::getReferenceImageName() {return param->referenceImageName;}
-void cuAmpcorController::setSecondaryImageName(std::string s) { param->secondaryImageName = s; }
-std::string cuAmpcorController::getSecondaryImageName() {return param->secondaryImageName;}
-void cuAmpcorController::setReferenceImageWidth(int n) { param->referenceImageWidth = n; }
-void cuAmpcorController::setReferenceImageHeight(int n) { param->referenceImageHeight = n; }
-void cuAmpcorController::setSecondaryImageWidth(int n) { param->secondaryImageWidth = n; }
-void cuAmpcorController::setSecondaryImageHeight(int n) { param->secondaryImageHeight = n; }
-//void cuAmpcorController::setReferenceStartPixelAcross(int n) { param->referenceStartPixelAcross = n; }
-//void cuAmpcorController::setReferenceStartPixelDown(int n) { param->referenceStartPixelDown = n; }
-//void cuAmpcorController::setSecondaryStartPixelAcross(int n) { param->secondaryStartPixelAcross = n; }
-//void cuAmpcorController::setSecondaryStartPixelDown(int n) { param->secondaryStartPixelDown = n; }
-//void cuAmpcorController::setGrossOffsetMethod(int n) { param->grossOffsetMethod = n; }
-//int cuAmpcorController::getGrossOffsetMethod() { return param->grossOffsetMethod; }
-//void cuAmpcorController::setAcrossGrossOffset(int n) { param->acrossGrossOffset = n; }
-//void cuAmpcorController::setDownGrossOffset(int n) { param->downGrossOffset = n; }
-//int* cuAmpcorController::getGrossOffsets() {return param->grossOffsets;}
-
-void cuAmpcorController::setGrossOffsets(int *in, int size) {
-    assert(size = 2*param->numberWindowAcross*param->numberWindowDown);
-    if (param->grossOffsets == NULL)
-        param->grossOffsets = (int *)malloc(size*sizeof(int));
-    mempcpy(param->grossOffsets, in, size*sizeof(int));
-    fprintf(stderr, "copy grossOffsets %d\n", size);
-}
-void cuAmpcorController::setOffsetImageName(std::string s) { param->offsetImageName = s; }
-void cuAmpcorController::setSNRImageName(std::string s) { param->snrImageName = s; }
-//void cuAmpcorController::setMargin(int n) { param->margin = n; }
-void cuAmpcorController::setDerampMethod(int n) { param->derampMethod = n; }
-int cuAmpcorController::getDerampMethod() { return param->derampMethod; }
-*/
+// end of file
--- a/contrib/PyCuAmpcor/src/cuAmpcorController.h
+++ b/contrib/PyCuAmpcor/src/cuAmpcorController.h
@ -1,115 +1,35 @@
 /**
- * cuAmpcorController.h
- * Header file for the controller class (interfaces to Python/Cython)
+ * @file  cuAmpcorController.h
+ * @brief The controller for running cuAmcor
 *
+ * cuAmpController is the main processor, also interface to python
+ * It determines the total number of windows, the starting pixels for each window.
+ * It then divides windows into chunks (batches), and creates cuAmpcorChunk instances
+ *  to process each chunk.
+ * A chunk includes multiple windows, to maximize the use of GPU cores.
+ * Different cuAmpcorChunk processors use different cuda streams, to overlap
+ *  the kernel execution with data copying.
 */

+// code guard
 #ifndef CU_AMPCOR_CONTROLLER_H
 #define CU_AMPCOR_CONTROLLER_H

+// dependencies
 #include "cuAmpcorParameter.h"
-#include <cstring>

 class cuAmpcorController {
 public:    
-    cuAmpcorParameter *param;   
+    cuAmpcorParameter *param;  ///< the parameter set
+    // constructor
    cuAmpcorController();
+    // destructor
    ~cuAmpcorController();
+    // run interface
    void runAmpcor();
+    // output gross offsets
    void outputGrossOffsets();
-    /*
-    void setAlgorithm(int);
-    int getAlgorithm();
-    void setDeviceID(int);
-    int getDeviceID();
-    void setNStreams(int);
-    int getNStreams();
-    void setWindowSizeHeight(int);
-    int getWindowSizeHeight();
-    void setWindowSizeWidth(int);
-    int getWindowSizeWidth();
-    void setSearchWindowSizeHeight(int);
-    int getSearchWindowSizeHeight();
-    void setSearchWindowSizeWidth(int);
-    int setSearchWindowSizeWidth();
-    void setRawOversamplingFactor(int);
-    int getRawOversamplingFactor();
-    void setZoomWindowSize(int);
-    int getZoomWindowSize();
-    void setOversamplingFactor(int);
-    int getOversamplingFactor();
-    void setAcrossLooks(int);
-    int getAcrossLoos();
-    void setDownLooks(int);
-    int getDownLooks();
-    void setSkipSampleAcrossRaw(int);
-    int getSkipSampleAcrossRaw();
-    void setSkipSampleDownRaw(int);
-    int getSkipSampleDownRaw();
-    void setNumberWindowMethod(int);
-    int getNumberWindowMethod();
-    void setNumberWindowDown(int);
-    int getNumberWindowDown();
-    void setNumberWindowAcross(int);
-    int getNumberWindowAcross();
-    void setNumberWindowDownInChunk(int);
-    int getNumberWindowDownInChunk();
-    void setNumberWindowAcrossInChunk(int);
-    int getNumberWindowAcrossInChunk();
-    void setRangeSpacing1(float);
-    float getRangeSpacing1();
-    void setRangeSpacing2(float);
-    float getRangeSpacing2();
-    void setImageDatatype1(int);
-    int getImageDatatype1();
-    void setImageDatatype2(int);
-    int getImageDatatype2();
-    void setThresholdSNR(float);
-    float getThresholdSNR();
-    void setThresholdCov(float);
-    float getThresholdCov();
-    void setBand1(int);
-    int getBand1();
-    void setBand2(int);
-    int getBand2();
-    void setReferenceImageName(std::string);
-    std::string getReferenceImageName();
-    void setSecondaryImageName(std::string);
-    std::string getSecondaryImageName();
-    void setReferenceImageWidth(int);
-    int getReferenceImageWidth();
-    void setReferenceImageHeight(int);
-    int getReferenceImageHeight();
-    void setSecondaryImageWidth(int);
-    int getSecondaryImageWidth();
-    void setSecondaryImageHeight(int);
-    int getSecondaryImageHeight();
-    void setStartPixelMethod(int);
-    int getStartPixelMethod();
-    void setReferenceStartPixelAcross(int);
-    int getReferenceStartPixelAcross();
-    void setReferenceStartPixelDown(int);
-    int setReferenceStartPixelDown();
-    void setSecondaryStartPixelAcross(int);
-    int getSecondaryStartPixelAcross();
-    void setSecondaryStartPixelDown(int);
-    int getSecondaryStartPixelDown();
-    void setGrossOffsetMethod(int);
-    int getGrossOffsetMethod();
-    void setAcrossGrossOffset(int);
-    int getAcrossGrossOffset();
-    void setDownGrossOffset(int);
-    int getDownGrossOffset();
-    void setGrossOffsets(int *, int);
-    int* getGrossOffsets();
-    void setOffsetImageName(std::string);
-    std::string getOffsetImageName();
-    void setSNRImageName(std::string);
-    std::string getSNRImageName(); 
-    //void setMargin(int);
-    //int getMargin();
-    //void setNumberThreads(int);
-    void setDerampMethod(int);
-    int getDerampMethod();*/
 };
 #endif
+
+// end of file
--- a/contrib/PyCuAmpcor/src/cuAmpcorParameter.cu
+++ b/contrib/PyCuAmpcor/src/cuAmpcorParameter.cu
@ -1,5 +1,5 @@
 /**
- * cuAmpcorParameter.cu
+ * @file cuAmpcorParameter.cu
 * Input parameters for ampcor
 */

@ -100,10 +100,6 @@ void cuAmpcorParameter::setupParameters()
        exit(EXIT_FAILURE);
    }

-    // modified 02/12/2018 to include one more chunk
-    // e.g. numberWindowDownInChunk=102, numberWindowDown=10, results in numberChunkDown=11
-    // the last chunk will include 2 windows, numberWindowDownInChunkRun = 2.
-
    numberChunkDown = IDIVUP(numberWindowDown, numberWindowDownInChunk);
    numberChunkAcross = IDIVUP(numberWindowAcross, numberWindowAcrossInChunk);
    numberChunks = numberChunkDown*numberChunkAcross;
@ -168,6 +164,7 @@ void cuAmpcorParameter::setStartPixels(int *mStartD, int *mStartA, int *gOffsetD
    setChunkStartPixels();
 }

+/// set starting pixels for each window with a varying gross offset
 void cuAmpcorParameter::setStartPixels(int mStartD, int mStartA, int *gOffsetD, int *gOffsetA)
 {
    for(int row=0; row<numberWindowDown; row++)
@ -186,9 +183,9 @@ void cuAmpcorParameter::setStartPixels(int mStartD, int mStartA, int *gOffsetD,
    setChunkStartPixels();
 }

+/// set starting pixels for each window with a constant gross offset
 void cuAmpcorParameter::setStartPixels(int mStartD, int mStartA, int gOffsetD, int gOffsetA)
 {
-    //fprintf(stderr, "set start pixels %d %d %d %d\n", mStartD, mStartA, gOffsetD, gOffsetA);
    for(int row=0; row<numberWindowDown; row++)
    {
 		for(int col = 0; col < numberWindowAcross; col++)
@ -205,6 +202,7 @@ void cuAmpcorParameter::setStartPixels(int mStartD, int mStartA, int gOffsetD, i
    setChunkStartPixels();
 }

+/// set starting pixels for each chunk
 void cuAmpcorParameter::setChunkStartPixels()
 {

@ -228,7 +226,6 @@ void cuAmpcorParameter::setChunkStartPixels()
            int sChunkED = 0;
            int sChunkEA = 0;

-            // modified 02/12/2018
            int numberWindowDownInChunkRun = numberWindowDownInChunk;
 	        int numberWindowAcrossInChunkRun = numberWindowAcrossInChunk;
 	        // modify the number of windows in last chunk
@ -237,7 +234,6 @@ void cuAmpcorParameter::setChunkStartPixels()
 	        if(jchunk == numberChunkAcross -1)
 		        numberWindowAcrossInChunkRun = numberWindowAcross - numberWindowAcrossInChunk*(numberChunkAcross -1);

-
            for(int i=0; i<numberWindowDownInChunkRun; i++)
            {
                for(int j=0; j<numberWindowAcrossInChunkRun; j++)
@ -337,3 +333,4 @@ cuAmpcorParameter::~cuAmpcorParameter()
 {
 	deallocateArrays();
 }
+// end of file
--- a/contrib/PyCuAmpcor/src/cuAmpcorParameter.h
+++ b/contrib/PyCuAmpcor/src/cuAmpcorParameter.h
@ -1,9 +1,9 @@
 /**
- * cuAmpcorParameter.h
- * Header file for Ampcor Parameter Class
+ * @file  cuAmpcorParameter.h
+ * @brief A class holds cuAmpcor process parameters
 *
 * Author: Lijun Zhu @ Seismo Lab, Caltech
- * March 2017
+ * March 2017; last modified October 2020
 */

 #ifndef __CUAMPCORPARAMETER_H
@ -29,129 +29,132 @@
 /// 4a. Optionally, check the range of windows is within the SLC image range: param->checkPixelInImageRange()
 /// Steps 1, 3, 4 are mandatory. If step 2 is missing, default values will be used

-
 class cuAmpcorParameter{
 public:
-    int algorithm;      /// Cross-correlation algorithm: 0=freq domain (default) 1=time domain
-    int deviceID;       /// Targeted GPU device ID: use -1 to auto select
-    int nStreams;       /// Number of streams to asynchonize data transfers and compute kernels
-    int derampMethod;   /// Method for deramping 0=None, 1=average, 2=phase gradient
+    int algorithm;      ///< Cross-correlation algorithm: 0=freq domain (default) 1=time domain
+    int deviceID;       ///< Targeted GPU device ID: use -1 to auto select
+    int nStreams;       ///< Number of streams to asynchonize data transfers and compute kernels
+    int derampMethod;   ///< Method for deramping 0=None, 1=average

    // chip or window size for raw data
-    int windowSizeHeightRaw;        /// Template window height (original size)
-    int windowSizeWidthRaw;         /// Template window width (original size)
-    int searchWindowSizeHeightRaw;  /// Search window height (original size)
-    int searchWindowSizeWidthRaw;   /// Search window width (orignal size)
+    int windowSizeHeightRaw;        ///< Template window height (original size)
+    int windowSizeWidthRaw;         ///< Template window width (original size)
+    int searchWindowSizeHeightRaw;  ///< Search window height (original size)
+    int searchWindowSizeWidthRaw;   ///< Search window width (orignal size)

-    int halfSearchRangeDownRaw;   ///(searchWindowSizeHeightRaw-windowSizeHeightRaw)/2
-    int halfSearchRangeAcrossRaw;    ///(searchWindowSizeWidthRaw-windowSizeWidthRaw)/2
+    int halfSearchRangeDownRaw;   ///< (searchWindowSizeHeightRaw-windowSizeHeightRaw)/2
+    int halfSearchRangeAcrossRaw;    ///< (searchWindowSizeWidthRaw-windowSizeWidthRaw)/2
   	// search range is (-halfSearchRangeRaw, halfSearchRangeRaw)

-    int searchWindowSizeHeightRawZoomIn;
-    int searchWindowSizeWidthRawZoomIn;
+    int searchWindowSizeHeightRawZoomIn; ///< search window height used for zoom in
+    int searchWindowSizeWidthRawZoomIn;  ///< search window width used for zoom in

-    int corrStatWindowSize;     /// window to estimate snr
-    int corrRawZoomInHeight;
-    int corrRawZoomInWidth;
+    int corrStatWindowSize;     ///< correlation surface size used to estimate snr
+    int corrRawZoomInHeight;    ///< correlation surface height used for oversampling
+    int corrRawZoomInWidth;     ///< correlation surface width used for oversampling

    // chip or window size after oversampling
-    int rawDataOversamplingFactor;  /// Raw data overampling factor (from original size to oversampled size)
-    int windowSizeHeight;           /// Template window length (oversampled size)
-    int windowSizeWidth;            /// Template window width (original size)
-    int searchWindowSizeHeight;     /// Search window height (oversampled size)
-    int searchWindowSizeWidth;      /// Search window width (oversampled size)
+    int rawDataOversamplingFactor;  ///< Raw data overampling factor (from original size to oversampled size)
+    int windowSizeHeight;           ///< Template window length (oversampled size)
+    int windowSizeWidth;            ///< Template window width (original size)
+    int searchWindowSizeHeight;     ///< Search window height (oversampled size)
+    int searchWindowSizeWidth;      ///< Search window width (oversampled size)

    // strides between chips/windows
-    int skipSampleDownRaw;   /// Skip size between neighboring windows in Down direction (original size)
-    int skipSampleAcrossRaw; /// Skip size between neighboring windows in across direction (original size)
-    //int skipSampleDown;      /// Skip size between neighboring windows in Down direction (oversampled size)
-    //int skipSampleAcross;    /// Skip size between neighboring windows in Across direction (oversampled size)
+    int skipSampleDownRaw;   ///< Skip size between neighboring windows in Down direction (original size)
+    int skipSampleAcrossRaw; ///< Skip size between neighboring windows in across direction (original size)

    // Zoom in region near location of max correlation
-    int zoomWindowSize;      /// Zoom-in window size in correlation surface (same for down and across directions)
-    int halfZoomWindowSizeRaw; /// = half of zoomWindowSize/rawDataOversamplingFactor
+    int zoomWindowSize;      ///< Zoom-in window size in correlation surface (same for down and across directions)
+    int halfZoomWindowSizeRaw; ///<  half of zoomWindowSize/rawDataOversamplingFactor
+
+    int oversamplingFactor;  ///< Oversampling factor for interpolating correlation surface
+    int oversamplingMethod;  ///< correlation surface oversampling method 0 = fft (default)  1 = sinc


-
-    int oversamplingFactor;  /// Oversampling factor for interpolating correlation surface
-    int oversamplingMethod;  /// 0 = fft (default)  1 = sinc
-
-
-    float thresholdSNR;      /// Threshold of Signal noise ratio to remove noisy data
+    float thresholdSNR;      ///< Threshold of Signal noise ratio to remove noisy data

    //reference image
-    std::string referenceImageName;    /// reference SLC image name
-    int imageDataType1;             /// reference image data type, 2=cfloat=complex=float2 1=float
-    int referenceImageHeight;          /// reference image height
-    int referenceImageWidth;           /// reference image width
+    std::string referenceImageName;    ///< reference SLC image name
+    int imageDataType1;                ///< reference image data type, 2=cfloat=complex=float2 1=float
+    int referenceImageHeight;          ///< reference image height
+    int referenceImageWidth;           ///< reference image width

    //secondary image
-    std::string secondaryImageName;     /// secondary SLC image name
-    int imageDataType2;             /// secondary image data type, 2=cfloat=complex=float2 1=float
-    int secondaryImageHeight;           /// secondary image height
-    int secondaryImageWidth;            /// secondary image width
+    std::string secondaryImageName;     ///< secondary SLC image name
+    int imageDataType2;                 ///< secondary image data type, 2=cfloat=complex=float2 1=float
+    int secondaryImageHeight;           ///< secondary image height
+    int secondaryImageWidth;            ///< secondary image width

    // total number of chips/windows
-    int numberWindowDown;           /// number of total windows (down)
-    int numberWindowAcross;         /// number of total windows (across)
-    int numberWindows; 				/// numberWindowDown*numberWindowAcross
+    int numberWindowDown;           ///< number of total windows (down)
+    int numberWindowAcross;         ///< number of total windows (across)
+    int numberWindows; 				///< numberWindowDown*numberWindowAcross

    // number of chips/windows in a batch/chunk
-    int numberWindowDownInChunk;    /// number of windows processed in a chunk (down)
-    int numberWindowAcrossInChunk;  /// number of windows processed in a chunk (across)
-    int numberWindowsInChunk; 		/// numberWindowDownInChunk*numberWindowAcrossInChunk
-    int numberChunkDown;            /// number of chunks (down)
-    int numberChunkAcross;          /// number of chunks (across)
-    int numberChunks;
+    int numberWindowDownInChunk;    ///< number of windows processed in a chunk (down)
+    int numberWindowAcrossInChunk;  ///< number of windows processed in a chunk (across)
+    int numberWindowsInChunk; 		///< numberWindowDownInChunk*numberWindowAcrossInChunk
+    int numberChunkDown;            ///< number of chunks (down)
+    int numberChunkAcross;          ///< number of chunks (across)
+    int numberChunks;               ///< total number of chunks

-    int useMmap;                    /// whether to use mmap 0=not 1=yes (default = 0)
-    int mmapSizeInGB;               /// size for mmap buffer(useMmap=1) or a cpu memory buffer (useMmap=0)
+    int useMmap;                    ///< whether to use mmap 0=not 1=yes (default = 0)
+    int mmapSizeInGB;               ///< size for mmap buffer(useMmap=1) or a cpu memory buffer (useMmap=0)

-    int referenceStartPixelDown0;
-    int referenceStartPixelAcross0;
-    int *referenceStartPixelDown;  /// reference starting pixels for each window (down)
-    int *referenceStartPixelAcross;/// reference starting pixels for each window (across)
-    int *secondaryStartPixelDown;   /// secondary starting pixels for each window (down)
-    int *secondaryStartPixelAcross; /// secondary starting pixels for each window (across)
-    int grossOffsetDown0;
-    int grossOffsetAcross0;
-    int *grossOffsetDown;		/// Gross offsets between reference and secondary windows (down) : secondaryStartPixel - referenceStartPixel
-    int *grossOffsetAcross;     /// Gross offsets between reference and secondary windows (across)
+    int referenceStartPixelDown0;    ///< first starting pixel in reference image (down)
+    int referenceStartPixelAcross0;  ///< first starting pixel in reference image (across)
+    int *referenceStartPixelDown;    ///< reference starting pixels for each window (down)
+    int *referenceStartPixelAcross;  ///< reference starting pixels for each window (across)
+    int *secondaryStartPixelDown;    ///< secondary starting pixels for each window (down)
+    int *secondaryStartPixelAcross;  ///< secondary starting pixels for each window (across)
+    int grossOffsetDown0;       ///< gross offset static component (down)
+    int grossOffsetAcross0;     ///< gross offset static component (across)
+    int *grossOffsetDown;		///< Gross offsets between reference and secondary windows (down)
+    int *grossOffsetAcross;     ///< Gross offsets between reference and secondary windows (across)

-    int *referenceChunkStartPixelDown;
-    int *referenceChunkStartPixelAcross;
-    int *secondaryChunkStartPixelDown;
-    int *secondaryChunkStartPixelAcross;
-    int *referenceChunkHeight;
-    int *referenceChunkWidth;
-    int *secondaryChunkHeight;
-    int *secondaryChunkWidth;
-    int maxReferenceChunkHeight, maxReferenceChunkWidth;
-    int maxSecondaryChunkHeight, maxSecondaryChunkWidth;
+    int *referenceChunkStartPixelDown;    ///< reference starting pixels for each chunk (down)
+    int *referenceChunkStartPixelAcross;  ///< reference starting pixels for each chunk (across)
+    int *secondaryChunkStartPixelDown;    ///< secondary starting pixels for each chunk (down)
+    int *secondaryChunkStartPixelAcross;  ///< secondary starting pixels for each chunk (across)
+    int *referenceChunkHeight;   ///< reference chunk height
+    int *referenceChunkWidth;    ///< reference chunk width
+    int *secondaryChunkHeight;   ///< secondary chunk height
+    int *secondaryChunkWidth;    ///< secondary chunk width
+    int maxReferenceChunkHeight, maxReferenceChunkWidth; ///< max reference chunk size
+    int maxSecondaryChunkHeight, maxSecondaryChunkWidth; ///< max secondary chunk size

-    std::string grossOffsetImageName;
-    std::string offsetImageName;    /// Output Offset fields filename
-    std::string snrImageName;       /// Output SNR filename
-    std::string covImageName;
+    std::string grossOffsetImageName;  ///< gross offset output filename
+    std::string offsetImageName;       ///< Offset fields output filename
+    std::string snrImageName;          ///< Output SNR filename
+    std::string covImageName;          ///< Output variance filename

-    cuAmpcorParameter();  /// Class constructor and default parameters setter
-    ~cuAmpcorParameter(); /// Class descontructor
+    // Class constructor and default parameters setter
+    cuAmpcorParameter();
+    // Class descontructor
+    ~cuAmpcorParameter();

-    void allocateArrays();      /// Allocate various arrays after the number of Windows is given
-    void deallocateArrays();    /// Deallocate arrays on exit
+    // Allocate various arrays after the number of Windows is given
+    void allocateArrays();
+    // Deallocate arrays on exit
+    void deallocateArrays();


-    /// Three methods to set reference/secondary starting pixels and gross offsets from input reference start pixel(s) and gross offset(s)
-	/// 1 (int *, int *, int *, int *): varying reference start pixels and gross offsets
-	/// 2 (int, int, int *, int *): fixed reference start pixel (first window) and varying gross offsets
-	/// 3 (int, int, int, int): fixed reference start pixel(first window) and fixed gross offsets
+    // Three methods to set reference/secondary starting pixels and gross offsets from input reference start pixel(s) and gross offset(s)
+	// 1 (int *, int *, int *, int *): varying reference start pixels and gross offsets
+	// 2 (int, int, int *, int *): fixed reference start pixel (first window) and varying gross offsets
+	// 3 (int, int, int, int): fixed reference start pixel(first window) and fixed gross offsets
    void setStartPixels(int*, int*, int*, int*);
    void setStartPixels(int, int, int*, int*);
    void setStartPixels(int, int, int, int);
+    // set starting pixels for each chunk
    void setChunkStartPixels();
-    void checkPixelInImageRange(); /// check whether
-    void setupParameters();     /// Process other parameters after Python Input
+    // check whether all chunks/windows are within the image range
+    void checkPixelInImageRange();
+    // Process other parameters after Python Input
+    void setupParameters();

 };

-#endif
+#endif //__CUAMPCORPARAMETER_H
+//end of file
--- a/contrib/PyCuAmpcor/src/cuAmpcorUtil.h
+++ b/contrib/PyCuAmpcor/src/cuAmpcorUtil.h
@ -1,10 +1,11 @@
 /*
- * cuAmpcorUtil.h
- * header file to include the various routines for ampcor
- * serves as an index
+ * @file cuAmpcorUtil.h
+ * @brief Header file to include various routines for cuAmpcor
+ *
+ *
 */

-
+// code guard
 #ifndef __CUAMPCORUTIL_H
 #define __CUAMPCORUTIL_H

@ -16,7 +17,7 @@
 #include "float2.h"


-//in cuArraysCopy.cu: various utitlies for copy images file in gpu memory
+//in cuArraysCopy.cu: various utilities for copy images file in gpu memory
 void cuArraysCopyToBatch(cuArrays<float2> *image1, cuArrays<float2> *image2, int strideH, int strideW, cudaStream_t stream);
 void cuArraysCopyToBatchWithOffset(cuArrays<float2> *image1, const int lda1, cuArrays<float2> *image2,
 	const int *offsetH, const int* offsetW, cudaStream_t stream);
@ -27,6 +28,7 @@ void cuArraysCopyToBatchWithOffsetR2C(cuArrays<float> *image1, const int lda1, c
 void cuArraysCopyC2R(cuArrays<float2> *image1, cuArrays<float> *image2, int strideH, int strideW, cudaStream_t stream);

 // same routine name overloaded for different data type
+// extract data from a large image
 void cuArraysCopyExtract(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut, cuArrays<int2> *offset, cudaStream_t stream);
 void cuArraysCopyExtract(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut, int2 offset, cudaStream_t stream);
 void cuArraysCopyExtract(cuArrays<float2> *imagesIn, cuArrays<float> *imagesOut, int2 offset, cudaStream_t stream);
@ -39,26 +41,23 @@ void cuArraysCopyInsert(cuArrays<float3> *imageIn, cuArrays<float3> *imageOut, i
 void cuArraysCopyInsert(cuArrays<float> *imageIn, cuArrays<float> *imageOut, int offsetX, int offsetY, cudaStream_t stream);
 void cuArraysCopyInsert(cuArrays<int> *imageIn, cuArrays<int> *imageOut, int offsetX, int offersetY, cudaStream_t stream);

-void cuArraysCopyInversePadded(cuArrays<float> *imageIn, cuArrays<float> *imageOut,cudaStream_t stream);
-
 void cuArraysCopyPadded(cuArrays<float> *imageIn, cuArrays<float> *imageOut,cudaStream_t stream);
 void cuArraysCopyPadded(cuArrays<float> *imageIn, cuArrays<float2> *imageOut,cudaStream_t stream);
 void cuArraysCopyPadded(cuArrays<float2> *imageIn, cuArrays<float2> *imageOut,cudaStream_t stream);
 void cuArraysSetConstant(cuArrays<float> *imageIn, float value, cudaStream_t stream);

-//in cuDeramp.cu: deramping phase
-void cuDeramp(int method, cuArrays<float2> *images, cudaStream_t stream);
-void cuDerampMethod1(cuArrays<float2> *images, cudaStream_t stream);
-void cuDerampMethod2(cuArrays<float2> *images, cudaStream_t stream);
-void cpuDerampMethod3(cuArrays<float2> *images, cudaStream_t stream);
-
-//in cuArraysPadding.cu: various utilities for oversampling padding
-void cuArraysPadding(cuArrays<float2> *image1, cuArrays<float2> *image2, cudaStream_t stream);
-void cuArraysPaddingMany(cuArrays<float2> *image1, cuArrays<float2> *image2, cudaStream_t stream);
 void cuArraysR2C(cuArrays<float> *image1, cuArrays<float2> *image2, cudaStream_t stream);
 void cuArraysC2R(cuArrays<float2> *image1, cuArrays<float> *image2, cudaStream_t stream);
 void cuArraysAbs(cuArrays<float2> *image1, cuArrays<float> *image2, cudaStream_t stream);

+// cuDeramp.cu: deramping phase
+void cuDeramp(int method, cuArrays<float2> *images, cudaStream_t stream);
+void cuDerampMethod1(cuArrays<float2> *images, cudaStream_t stream);
+
+// cuArraysPadding.cu: various utilities for oversampling padding
+void cuArraysPadding(cuArrays<float2> *image1, cuArrays<float2> *image2, cudaStream_t stream);
+void cuArraysPaddingMany(cuArrays<float2> *image1, cuArrays<float2> *image2, cudaStream_t stream);
+
 //in cuCorrNormalization.cu: utities to normalize the cross correlation function
 void cuArraysSubtractMean(cuArrays<float> *images, cudaStream_t stream);
 void cuCorrNormalize(cuArrays<float> *templates, cuArrays<float> *images, cuArrays<float> *results, cudaStream_t stream);
@ -68,11 +67,11 @@ void cuArraysMaxloc2D(cuArrays<float> *images, cuArrays<int2> *maxloc, cuArrays<
 void cuArraysMaxloc2D(cuArrays<float> *images, cuArrays<int2> *maxloc, cudaStream_t stream);
 void cuSubPixelOffset(cuArrays<int2> *offsetInit, cuArrays<int2> *offsetZoomIn, cuArrays<float2> *offsetFinal,
                      int OverSampleRatioZoomin, int OverSampleRatioRaw,
-                      int xHalfRangeInit,  int yHalfRangeInit, int xHalfRangeZoomIn, int yHalfRangeZoomIn,
+                      int xHalfRangeInit,  int yHalfRangeInit,
                      cudaStream_t stream);

-void cuDetermineInterpZone(cuArrays<int2> *maxloc, cuArrays<int2> *zoomInOffset, cuArrays<float> *corrOrig, cuArrays<float> *corrZoomIn, cudaStream_t stream);
-void cuDetermineSecondaryExtractOffset(cuArrays<int2> *maxLoc, cuArrays<int2> *maxLocShift, int xOldRange, int yOldRange, int xNewRange, int yNewRange, cudaStream_t stream);
+void cuDetermineSecondaryExtractOffset(cuArrays<int2> *maxLoc, cuArrays<int2> *maxLocShift,
+        int xOldRange, int yOldRange, int xNewRange, int yNewRange, cudaStream_t stream);

 //in cuCorrTimeDomain.cu: cross correlation in time domain
 void cuCorrTimeDomain(cuArrays<float> *templates, cuArrays<float> *images, cuArrays<float> *results, cudaStream_t stream);
--- a/contrib/PyCuAmpcor/src/cuArrays.cu
+++ b/contrib/PyCuAmpcor/src/cuArrays.cu
@ -1,7 +1,14 @@
+/**
+ * \file  cuArrays.cu
+ * \brief  Implementations for cuArrays class
+ *
+ */

+// dependencies
 #include "cuArrays.h"
 #include "cudaError.h"

+// allocate arrays in device memory
 template <typename T>
 void cuArrays<T>::allocate()
 {
@ -9,14 +16,15 @@
    is_allocated = 1;
 }

+// allocate arrays in host memory
 template <typename T>
 void cuArrays<T>::allocateHost()
 {
    hostData = (T *)malloc(getByteSize());
-        //checkCudaErrors(cudaMallocHost((void **)&hostData, getByteSize()));
    is_allocatedHost = 1;
 }

+// deallocate arrays in device memory
 template <typename T>
 void cuArrays<T>::deallocate()
 {
@ -24,44 +32,69 @@
    is_allocated = 0;
 }

+// deallocate arrays in host memory
 template <typename T>
 void cuArrays<T>::deallocateHost()
 {
-		//checkCudaErrors(cudaFreeHost(hostData));
    free(hostData);
    is_allocatedHost = 0;
 }

+// copy arrays from device to host
+// use asynchronous for possible overlaps between data copying and kernel execution
 template <typename T>
 void cuArrays<T>::copyToHost(cudaStream_t stream)
 {
-        //std::cout << "debug copy " << is_allocatedHost << " " << is_allocated  << " " << getByteSize() << "\n";
    checkCudaErrors(cudaMemcpyAsync(hostData, devData, getByteSize(), cudaMemcpyDeviceToHost, stream));
 }

+// copy arrays from host to device
 template <typename T>
 void cuArrays<T>::copyToDevice(cudaStream_t stream)
 {
    checkCudaErrors(cudaMemcpyAsync(devData, hostData, getByteSize(), cudaMemcpyHostToDevice, stream));
 }

+// set to 0
 template <typename T>
 void cuArrays<T>::setZero(cudaStream_t stream)
 {
    checkCudaErrors(cudaMemsetAsync(devData, 0, getByteSize(), stream));
 }

+// output (partial) data when debugging
+template <typename T>
+void cuArrays<T>::debuginfo(cudaStream_t stream) {
+    // output size info
+    std::cout << "Image height,width,count: " << height << "," << width << "," << count << std::endl;
+    // check whether host data is allocated
+    if( !is_allocatedHost)
+        allocateHost();
+    // copy to host
+    copyToHost(stream);
+
+    // set a max output range
+    int range = std::min(10, size*count);
+    // first 10 data
+    for(int i=0; i<range; i++)
+        std::cout << "(" <<hostData[i]  << ")" ;
+    std::cout << std::endl;
+    // last 10 data
+    if(size*count>range) {
+        for(int i=size*count-range; i<size*count; i++)
+            std::cout << "(" <<hostData[i] << ")" ;
+        std::cout << std::endl;
+    }
+}
+
+// need specializations for x,y components
 template<>
 void cuArrays<float2>::debuginfo(cudaStream_t stream) {
-		//std::cout << height << " " << width << " " << count << std::endl;
-        //std::cout << height << " " << width << " " << count << std::endl;
+    std::cout << "Image height,width,count: " << height << "," << width << "," << count << std::endl;
    if( !is_allocatedHost)
        allocateHost();
    copyToHost(stream);

-        //cudaStreamSynchronize(stream);
-        //std::cout << "debug debuginfo " << size << " " << count << " " << stream << "\n";
-
    int range = std::min(10, size*count);

    for(int i=0; i<range; i++)
@ -74,13 +107,32 @@
    }
 }

+template<>
+void cuArrays<float3>::debuginfo(cudaStream_t stream) {
+    std::cout << "Image height,width,count: " << height << "," << width << "," << count << std::endl;
+    if( !is_allocatedHost)
+        allocateHost();
+    copyToHost(stream);
+
+    int range = std::min(10, size*count);
+
+    for(int i=0; i<range; i++)
+        std::cout << "(" <<hostData[i].x << ", " << hostData[i].y << ")" ;
+    std::cout << std::endl;
+    if(size*count>range) {
+        for(int i=size*count-range; i<size*count; i++)
+            std::cout << "(" <<hostData[i].x << ", " << hostData[i].y << ", " << hostData[i].z <<")";
+        std::cout << std::endl;
+    }
+}

 template<>
 void cuArrays<int2>::debuginfo(cudaStream_t stream) {
-		//std::cout << height << " " << width << " " << count << std::endl;
+    std::cout << "Image height,width,count: " << height << "," << width << "," << count << std::endl;
    if( !is_allocatedHost)
        allocateHost();
    copyToHost(stream);
+
    int range = std::min(10, size*count);

    for(int i=0; i<range; i++)
@ -93,25 +145,7 @@
    }
 }

-	template <>
-	void cuArrays<float>::debuginfo(cudaStream_t stream) {
-		std::cout << height << " " << width << " " << count << std::endl;
-        if( !is_allocatedHost)
-    		allocateHost();
-        copyToHost(stream);
-		
-		int range = std::min(10, size*count);
-	
-		for(int i=0; i<range; i++)
-			std::cout << "(" <<hostData[i]  << ")" ;
-		std::cout << std::endl;
-		if(size*count>range) {
-            for(int i=size*count-range; i<size*count; i++)
-                std::cout << "(" <<hostData[i] << ")" ;
-            std::cout << std::endl;
-        }
-	}
-
+// output to file by copying to host at first
 template<typename T>
 void cuArrays<T>::outputToFile(std::string fn, cudaStream_t stream)
 {
@ -121,6 +155,7 @@
    outputHostToFile(fn);
 }

+// save the host data to (binary) file
 template <typename T>
 void cuArrays<T>::outputHostToFile(std::string fn)
 {
@ -130,45 +165,11 @@
    file.close();
 }

-	/*
-	template<>
-	void cuArrays<float>::outputToFile(std::string fn, cudaStream_t stream)
-	{
-		float *data;
-		data = (float *)malloc(size*count*sizeof(float));
-		cudaMemcpyAsync(data, devData, size*count*sizeof(float), cudaMemcpyDeviceToHost, stream);
-		std::ofstream file;
-		file.open(fn.c_str(),  std::ios_base::binary);
-		file.write((char *)data, size*count*sizeof(float));
-		file.close();
-	}*/
-
-	template<>
-	void cuArrays<float2>::outputToFile(std::string fn, cudaStream_t stream)
-	{
-		float *data;
-		data = (float *)malloc(size*count*sizeof(float2));
-		checkCudaErrors(cudaMemcpyAsync(data, devData, size*count*sizeof(float2), cudaMemcpyDeviceToHost, stream));
-		std::ofstream file;
-		file.open(fn.c_str(),  std::ios_base::binary);
-		file.write((char *)data, size*count*sizeof(float2));
-		file.close();
-	}
-
-	template<>
-	void cuArrays<float3>::outputToFile(std::string fn, cudaStream_t stream)
-	{
-		float *data;
-		data = (float *)malloc(size*count*sizeof(float3));
-		checkCudaErrors(cudaMemcpyAsync(data, devData, size*count*sizeof(float3), cudaMemcpyDeviceToHost, stream));
-		std::ofstream file;
-		file.open(fn.c_str(),  std::ios_base::binary);
-		file.write((char *)data, size*count*sizeof(float3));
-		file.close();
-	}
-
+// instantiations, required by python extensions
 template class cuArrays<float>;
 template class cuArrays<float2>;
 template class cuArrays<float3>;
 template class cuArrays<int2>;
 template class cuArrays<int>;
+
+// end of file
--- a/contrib/PyCuAmpcor/src/cuArrays.h
+++ b/contrib/PyCuAmpcor/src/cuArrays.h
@ -1,15 +1,17 @@
-/*
- * cuArrays.h
- * Header file for declaring a group of images
+/**
+ * @file  cuArrays.h
+ * @brief Header file for cuArrays class
 *
- * Lijun Zhu
- * Seismo Lab, Caltech
- * V1.0 11/29/2016
- */
+ * A class describes a batch of images (in 2d arrays).
+ * Each image has size (height, width)
+ * The number of images (countH, countW) or (1, count).
+ **/

-#ifndef __CUIMAGES_H
-#define __CUIMAGES_H
+// code guard
+#ifndef __CUARRAYS_H
+#define __CUARRAYS_H

+// cuda dependencies
 #include <cuda.h>
 #include <driver_types.h>

@ -19,27 +21,28 @@
 #include <ctime>


-
 template <typename T>
 class cuArrays{
+
 public:
-    int height; // x, row, down, length, azimuth, along the track 
+    int height; ///< x, row, down, length, azimuth, along the track
    int width;  // y, col, across, range, along the sight 
-    int size;   // chip size, heigh*width 
+    int size;   // one image size, height*width
    int countW; // number of images along width direction
    int countH; // number of images along height direction
    int count;  // countW*countH, number of images
    T* devData; // pointer to data in device (gpu) memory 
-    T* hostData;
+    T* hostData; // pointer to data in host (cpu) memory
    
-    bool is_allocated;
-    bool is_allocatedHost;
+    bool is_allocated; // whether the data is allocated in device memory
+    bool is_allocatedHost; // whether the data is allocated in host memory

+    // default constructor, empty
    cuArrays() : width(0), height(0), size(0), countW(0), countH(0), count(0), 
        is_allocated(0), is_allocatedHost(0), 
        devData(0), hostData(0) {}
        
-    // single image
+    // constructor for single image
    cuArrays(size_t h, size_t w) : width(w), height(h), countH(1), countW(1), count(1),
        is_allocated(0), is_allocatedHost(0), 
        devData(0), hostData(0)
@ -47,7 +50,7 @@ public:
        size = w*h;
    }
 		
-    // 
+    // constructor for multiple images with a total count
    cuArrays(size_t h, size_t w, size_t n) : width(w), height(h), countH(1), countW(n), count(n),
        is_allocated(0), is_allocatedHost(0), 
        devData(0), hostData(0) 
@ -55,6 +58,7 @@ public:
        size = w*h;
    }

+	// constructor for multiple images with (countH, countW)
    cuArrays(size_t h, size_t w, size_t ch, size_t cw) : width(w), height(h), countW(cw), countH(ch),
        is_allocated(0), is_allocatedHost(0), 
        devData(0), hostData(0) 
@ -63,24 +67,29 @@ public:
        count = countH*countW;
    }		

+    // memory allocation
    void allocate();
    void allocateHost();
    void deallocate();
    void deallocateHost();

+    // copy data between device and host memories
    void copyToHost(cudaStream_t stream);
    void copyToDevice(cudaStream_t stream);

+	// get the total size
    size_t getSize()
    {
        return size*count;
    }

-    long getByteSize()
+    // get the total size in byte
+    inline long getByteSize()
    {
        return width*height*count*sizeof(T);
    }

+    // destructor
    ~cuArrays() 
    {
        if(is_allocated)
@ -89,12 +98,16 @@ public:
            deallocateHost();
    }

+    // set zeroes
    void setZero(cudaStream_t stream);
+    // output when debugging
    void debuginfo(cudaStream_t stream) ;
    void debuginfo(cudaStream_t stream, float factor);
+    // write to files
    void outputToFile(std::string fn, cudaStream_t stream);
    void outputHostToFile(std::string fn);

 };

-#endif //__CUIMAGES_H	
+#endif //__CUARRAYS_H
+//end of file
--- a/contrib/PyCuAmpcor/src/cuArraysCopy.cu
+++ b/contrib/PyCuAmpcor/src/cuArraysCopy.cu
@ -1,22 +1,18 @@
-/* imagecopy.cu
- * various utitilies for copying images in device memory
+/**
+ * @file cuArraysCopy.cu
+ * @brief Utilities for copying/converting images to different format
 *
- * Lijun Zhu @ Seismo Lab, Caltech
- * v1.0 Jan 2017
+ * All methods are declared in cuAmpcorUtil.h
 */

+
+// dependencies
 #include "cuArrays.h"
 #include "cudaUtil.h"
 #include "cudaError.h"
 #include "float2.h"

-/*
-inline __device__ float cuAbs(float2 a)
-{
-	return sqrtf(a.x*a.x+a.y*a.y);
-}*/
-
-// copy a chunk into a batch of chips for a given stride
+// cuda kernel for cuArraysCopyToBatch
 __global__ void cuArraysCopyToBatch_kernel(const float2 *imageIn, const int inNX, const int inNY,
    float2 *imageOut, const int outNX, const int outNY,
    const int nImagesX, const int nImagesY,
@ -33,6 +29,8 @@ __global__ void cuArraysCopyToBatch_kernel(const float2 *imageIn, const int inNX
 	imageOut[idxOut] = imageIn[idxIn];
 }

+// copy a chunk into a batch of chips for a given stride
+// used to extract chips from a raw image
 void cuArraysCopyToBatch(cuArrays<float2> *image1, cuArrays<float2> *image2,
 	int strideH, int strideW, cudaStream_t stream)
 {
@ -47,7 +45,6 @@ void cuArraysCopyToBatch(cuArrays<float2> *image1, cuArrays<float2> *image2,
 	getLastCudaError("cuArraysCopyToBatch_kernel");
 }

-
 // copy a chunk into a batch of chips for a set of offsets (varying strides), from complex to complex
 __global__ void cuArraysCopyToBatchWithOffset_kernel(const float2 *imageIn, const int inNY,
    float2 *imageOut, const int outNX, const int outNY, const int nImages,
@ -69,7 +66,6 @@ void cuArraysCopyToBatchWithOffset(cuArrays<float2> *image1, const int lda1, cuA
 	const int nthreads = 16;
 	dim3 blockSize(nthreads, nthreads, 1);
 	dim3 gridSize(IDIVUP(image2->height,nthreads), IDIVUP(image2->width,nthreads), image2->count);
-    //fprintf(stderr, "copy tile to batch, %d %d\n", lda1, image2->count);
 	cuArraysCopyToBatchWithOffset_kernel<<<gridSize,blockSize, 0 , stream>>> (
 		image1->devData, lda1,
 		image2->devData, image2->height, image2->width, image2->count,
@ -97,7 +93,6 @@ void cuArraysCopyToBatchAbsWithOffset(cuArrays<float2> *image1, const int lda1,
 	const int nthreads = 16;
 	dim3 blockSize(nthreads, nthreads, 1);
 	dim3 gridSize(IDIVUP(image2->height,nthreads), IDIVUP(image2->width,nthreads), image2->count);
-    //fprintf(stderr, "copy tile to batch, %d %d\n", lda1, image2->count);
 	cuArraysCopyToBatchAbsWithOffset_kernel<<<gridSize,blockSize, 0 , stream>>> (
 		image1->devData, lda1,
 		image2->devData, image2->height, image2->width, image2->count,
@ -125,7 +120,6 @@ void cuArraysCopyToBatchWithOffsetR2C(cuArrays<float> *image1, const int lda1, c
 	const int nthreads = 16;
 	dim3 blockSize(nthreads, nthreads, 1);
 	dim3 gridSize(IDIVUP(image2->height,nthreads), IDIVUP(image2->width,nthreads), image2->count);
-    //fprintf(stderr, "copy tile to batch, %d %d\n", lda1, image2->count);
 	cuArraysCopyToBatchWithOffsetR2C_kernel<<<gridSize,blockSize, 0 , stream>>> (
 		image1->devData, lda1,
 		image2->devData, image2->height, image2->width, image2->count,
@ -133,7 +127,7 @@ void cuArraysCopyToBatchWithOffsetR2C(cuArrays<float> *image1, const int lda1, c
 	getLastCudaError("cuArraysCopyToBatchWithOffsetR2C_kernel");
 }

-//copy a chunk into a series of chips
+//copy a chunk into a series of chips, from complex to real
 __global__ void cuArraysCopyC2R_kernel(const float2 *imageIn, const int inNX, const int inNY,
    float *imageOut, const int outNX, const int outNY,
    const int nImagesX, const int nImagesY,
@ -148,10 +142,8 @@ __global__ void cuArraysCopyC2R_kernel(const float2 *imageIn, const int inNX, co
 	int idxImageY = idxImage%nImagesY;
 	int idxIn = (idxImageX*strideX+outx)*inNY + idxImageY*strideY+outy;
 	imageOut[idxOut] = complexAbs(imageIn[idxIn])*factor;
-	//printf( "%d\n", idxOut);
 }

-//tested
 void cuArraysCopyC2R(cuArrays<float2> *image1, cuArrays<float> *image2,
 	int strideH, int strideW, cudaStream_t stream)
 {
@ -167,6 +159,7 @@ void cuArraysCopyC2R(cuArrays<float2> *image1, cuArrays<float> *image2,
 	getLastCudaError("cuda Error: cuArraysCopyC2R_kernel");
 }

+//copy a chunk into a series of chips, from complex to real, with varying strides
 __global__ void cuArraysCopyExtractVaryingOffset(const float *imageIn, const int inNX, const int inNY,
     float *imageOut, const int outNX, const int outNY, const int nImages,
     const int2 *offsets)
@ -183,10 +176,11 @@ __global__ void cuArraysCopyExtractVaryingOffset(const float *imageIn, const int
 	}
 }

-/* copy a tile of images to another image, with starting pixels offsets
- * param[in] imageIn inut images
- * param[out] imageOut output images of dimension nImages*outNX*outNY
- */
+///
+/// Copy a tile of images to another image, with starting pixels offsets
+/// @param[in] imageIn inut images
+/// param[out] imageOut output images of dimension nImages*outNX*outNY
+///
 void cuArraysCopyExtract(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut, cuArrays<int2> *offsets, cudaStream_t stream)
 {
 	//assert(imagesIn->height >= imagesOut && inNY >= outNY);
@ -215,6 +209,10 @@ __global__ void cuArraysCopyExtractVaryingOffset_C2C(const float2 *imageIn, cons
 	}
 }

+/**
+ * copy/extract complex images from a large size to a smaller size from the location (offsetX, offsetY)
+ * offset is varying for each image
+ */
 void cuArraysCopyExtract(cuArrays<float2> *imagesIn, cuArrays<float2> *imagesOut, cuArrays<int2> *offsets, cudaStream_t stream)
 {
 	//assert(imagesIn->height >= imagesOut && inNY >= outNY);
@ -266,9 +264,10 @@ __global__ void cuArraysCopyExtractVaryingOffsetCorr(const float *imageIn, const
    }
 }

-/* copy a tile of images to another image, with starting pixels offsets accouting for boundary
- * param[in] imageIn inut images
- * param[out] imageOut output images of dimension nImages*outNX*outNY
+/**
+ * copy a tile of images to another image, with starting pixels offsets accouting for boundary
+ * @param[in] imageIn inut images
+ * @param[out] imageOut output images of dimension nImages*outNX*outNY
 */
 void cuArraysCopyExtractCorr(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut, cuArrays<int> *imagesValid, cuArrays<int2> *maxloc, cudaStream_t stream)
 {
@ -335,7 +334,9 @@ __global__ void cuArraysCopyExtract_C2C_FixedOffset(const float2 *imageIn, const
 	}
 }

-
+/**
+ * copy/extract complex images from a large size to a smaller size from the location (offsetX, offsetY)
+ */
 void cuArraysCopyExtract(cuArrays<float2> *imagesIn, cuArrays<float2> *imagesOut, int2 offset, cudaStream_t stream)
 {
 	//assert(imagesIn->height >= imagesOut && inNY >= outNY);
@ -368,16 +369,15 @@ __global__ void cuArraysCopyExtract_C2C_FixedOffset(const float3 *imageIn, const
 	}
 }

-
+/**
+ * copy/extract float3 images from a large size to a smaller size from the location (offsetX, offsetY)
+ */
 void cuArraysCopyExtract(cuArrays<float3> *imagesIn, cuArrays<float3> *imagesOut, int2 offset, cudaStream_t stream)
 {
 	//assert(imagesIn->height >= imagesOut && inNY >= outNY);
 	const int nthreads = NTHREADS2D;
 	dim3 threadsperblock(nthreads, nthreads,1);
 	dim3 blockspergrid(IDIVUP(imagesOut->height,nthreads), IDIVUP(imagesOut->width,nthreads), imagesOut->count);
-    //std::cout << "debug copyExtract" << imagesOut->width << imagesOut->height << "\n";
-    //imagesIn->debuginfo(stream);
-    //imagesOut->debuginfo(stream);
 	cuArraysCopyExtract_C2C_FixedOffset<<<blockspergrid, threadsperblock,0, stream>>>
        (imagesIn->devData, imagesIn->height, imagesIn->width,
 	    imagesOut->devData, imagesOut->height, imagesOut->width, imagesOut->count, offset.x, offset.y);
@ -402,8 +402,10 @@ __global__ void cuArraysCopyExtract_C2R_FixedOffset(const float2 *imageIn, const
 	}
 }

-
-
+/**
+ * copy/extract complex images from a large size to float images (by taking real parts)
+ * with a smaller size from the location (offsetX, offsetY)
+ */
 void cuArraysCopyExtract(cuArrays<float2> *imagesIn, cuArrays<float> *imagesOut, int2 offset, cudaStream_t stream)
 {
 	//assert(imagesIn->height >= imagesOut && inNY >= outNY);
@ -429,7 +431,9 @@ __global__ void cuArraysCopyInsert_kernel(const float2* imageIn, const int inNX,
 	}
 }

-
+/**
+ * copy/insert complex images from a smaller size to a larger size from the location (offsetX, offsetY)
+ */
 void cuArraysCopyInsert(cuArrays<float2> *imageIn, cuArrays<float2> *imageOut, int offsetX, int offsetY, cudaStream_t stream)
 {
 	const int nthreads = 16;
@ -453,6 +457,9 @@ __global__ void cuArraysCopyInsert_kernel(const float3* imageIn, const int inNX,
 	}
 }

+/**
+ * copy/insert float3 images from a smaller size to a larger size from the location (offsetX, offsetY)
+ */
 void cuArraysCopyInsert(cuArrays<float3> *imageIn, cuArrays<float3> *imageOut, int offsetX, int offsetY, cudaStream_t stream)
 {
 	const int nthreads = 16;
@ -477,7 +484,9 @@ __global__ void cuArraysCopyInsert_kernel(const float* imageIn, const int inNX,
 	}
 }

-
+/**
+ * copy/insert real images from a smaller size to a larger size from the location (offsetX, offsetY)
+ */
 void cuArraysCopyInsert(cuArrays<float> *imageIn, cuArrays<float> *imageOut, int offsetX, int offsetY, cudaStream_t stream)
 {
 	const int nthreads = 16;
@ -488,7 +497,7 @@ void cuArraysCopyInsert(cuArrays<float> *imageIn, cuArrays<float> *imageOut, int
 	getLastCudaError("cuArraysCopyInsert Float error");
 }

-//
+

 __global__ void cuArraysCopyInsert_kernel(const int* imageIn, const int inNX, const int inNY,
   int* imageOut, const int outNY, const int offsetX, const int offsetY)
@ -502,7 +511,9 @@ __global__ void cuArraysCopyInsert_kernel(const int* imageIn, const int inNX, co
 	}
 }

-
+/**
+ * copy/insert int images from a smaller size to a larger size from the location (offsetX, offsetY)
+ */
 void cuArraysCopyInsert(cuArrays<int> *imageIn, cuArrays<int> *imageOut, int offsetX, int offsetY, cudaStream_t stream)
 {
 	const int nthreads = 16;
@ -512,38 +523,6 @@ void cuArraysCopyInsert(cuArrays<int> *imageIn, cuArrays<int> *imageOut, int off
 	       imageOut->devData,  imageOut->width, offsetX, offsetY);
 	getLastCudaError("cuArraysCopyInsert Integer error");
 }
-//
-
-
-__global__ void cuArraysCopyInversePadded_kernel(float *imageIn, int inNX, int inNY, int sizeIn,
-    float *imageOut, int outNX, int outNY, int sizeOut, int nImages)
-{
-	int outx = threadIdx.x + blockDim.x*blockIdx.x;
-	int outy = threadIdx.y + blockDim.y*blockIdx.y;
-
-	if(outx < outNX && outy < outNY)
-	{
-		int idxImage = blockIdx.z;
-		int idxOut = IDX2R(outx, outy, outNY)+idxImage*sizeOut;
-		if(outx < inNX && outy <inNY) {
-			int idxIn = IDX2R(inNX-outx-1, inNY-outy-1, inNY)+idxImage*sizeIn;
-			imageOut[idxOut] = imageIn[idxIn];
-		}
-		else
-		{	imageOut[idxOut] = 0.0f; }
-	}
-}
-
-void cuArraysCopyInversePadded(cuArrays<float> *imageIn, cuArrays<float> *imageOut,cudaStream_t stream)
-{
-	const int nthreads = 16;
-	int nImages = imageIn->count;
-	dim3 blockSize(nthreads, nthreads,1);
-	dim3 gridSize(IDIVUP(imageOut->height,nthreads), IDIVUP(imageOut->width,nthreads), nImages);
-	cuArraysCopyInversePadded_kernel<<<gridSize, blockSize, 0, stream>>>(imageIn->devData, imageIn->height, imageIn->width, imageIn->size,
-	   imageOut->devData, imageOut->height, imageOut->width, imageOut->size, nImages);
-	 getLastCudaError("cuArraysCopyInversePadded error");
-}


 __global__ void cuArraysCopyPadded_R2R_kernel(float *imageIn, int inNX, int inNY, int sizeIn,
@ -565,6 +544,9 @@ __global__ void cuArraysCopyPadded_R2R_kernel(float *imageIn, int inNX, int inNY
 	}
 }

+/**
+ * copy real images from a smaller size to a larger size while padding 0 for extra elements
+ */
 void cuArraysCopyPadded(cuArrays<float> *imageIn, cuArrays<float> *imageOut,cudaStream_t stream)
 {
 	const int nthreads = 16;
@ -596,7 +578,9 @@ __global__ void cuArraysCopyPadded_C2C_kernel(float2 *imageIn, int inNX, int inN
 	}
 }

-
+/**
+ * copy complex images from a smaller size to a larger size while padding 0 for extra elements
+ */
 void cuArraysCopyPadded(cuArrays<float2> *imageIn, cuArrays<float2> *imageOut,cudaStream_t stream)
 {
 	const int nthreads = NTHREADS2D;
@ -609,6 +593,7 @@ void cuArraysCopyPadded(cuArrays<float2> *imageIn, cuArrays<float2> *imageOut,cu
 	 getLastCudaError("cuArraysCopyInversePadded error");
 }

+// kernel for cuArraysCopyPadded
 __global__ void cuArraysCopyPadded_R2C_kernel(float *imageIn, int inNX, int inNY, int sizeIn,
    float2 *imageOut, int outNX, int outNY, int sizeOut, int nImages)
 {
@ -629,7 +614,9 @@ __global__ void cuArraysCopyPadded_R2C_kernel(float *imageIn, int inNX, int inNY
 	}
 }

-
+/**
+ * copy real images to complex images (imaginary part=0) with larger size (pad 0 for extra elements)
+ */
 void cuArraysCopyPadded(cuArrays<float> *imageIn, cuArrays<float2> *imageOut,cudaStream_t stream)
 {
 	const int nthreads = NTHREADS2D;
@ -642,7 +629,7 @@ void cuArraysCopyPadded(cuArrays<float> *imageIn, cuArrays<float2> *imageOut,cud
 	 getLastCudaError("cuArraysCopyPadded error");
 }

-
+// cuda kernel for setting a constant value
 __global__ void cuArraysSetConstant_kernel(float *image, int size, float value)
 {
 	int idx = threadIdx.x + blockDim.x*blockIdx.x;
@ -653,6 +640,10 @@ __global__ void cuArraysSetConstant_kernel(float *image, int size, float value)
 	}
 }

+/**
+ * Set real images to a constant value
+ * @note use setZero if value=0 because cudaMemset is faster
+ */
 void cuArraysSetConstant(cuArrays<float> *imageIn, float value, cudaStream_t stream)
 {
 	const int nthreads = 256;
@ -662,3 +653,68 @@ void cuArraysSetConstant(cuArrays<float> *imageIn, float value, cudaStream_t str
        (imageIn->devData, imageIn->size, value);
 	 getLastCudaError("cuArraysCopyPadded error");
 }
+
+
+// convert float to float2(complex)
+__global__ void cuArraysR2C_kernel(float *image1, float2 *image2, int size)
+{
+	int idx =  threadIdx.x + blockDim.x*blockIdx.x;
+	if(idx < size)
+	{
+		image2[idx].x = image1[idx];
+		image2[idx].y =  0.0f;
+	}
+}
+
+/**
+ * Convert real images to complex images (set imaginary parts to 0)
+ */
+void cuArraysR2C(cuArrays<float> *image1, cuArrays<float2> *image2, cudaStream_t stream)
+{
+	int size = image1->getSize();
+	cuArraysR2C_kernel<<<IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>(image1->devData, image2->devData, size);
+	getLastCudaError("cuArraysR2C");
+}
+
+
+// take real part of float2 to float
+__global__ void cuArraysC2R_kernel(float2 *image1, float *image2, int size)
+{
+	int idx =  threadIdx.x + blockDim.x*blockIdx.x;
+	if(idx < size)
+	{
+		image2[idx] = image1[idx].x;
+	}
+}
+
+/**
+ * Take real parts of complex images
+ */
+void cuArraysC2R(cuArrays<float2> *image1, cuArrays<float> *image2, cudaStream_t stream)
+{
+	int size = image1->getSize();
+	cuArraysC2R_kernel<<<IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>(image1->devData, image2->devData, size);
+	getLastCudaError("cuArraysC2R");
+}
+
+// cuda kernel for cuArraysAbs
+__global__ void cuArraysAbs_kernel(float2 *image1, float *image2, int size)
+{
+	int idx =  threadIdx.x + blockDim.x*blockIdx.x;
+	if(idx < size)
+	{
+		image2[idx] = complexAbs(image1[idx]);
+	}
+}
+
+/**
+ * Obtain abs (amplitudes) of complex images
+ */
+void cuArraysAbs(cuArrays<float2> *image1, cuArrays<float> *image2, cudaStream_t stream)
+{
+	int size = image1->getSize();
+	cuArraysAbs_kernel<<<IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>(image1->devData, image2->devData, size);
+	getLastCudaError("cuArraysAbs_kernel");
+}
+
+// end of file
--- a/contrib/PyCuAmpcor/src/cuArraysPadding.cu
+++ b/contrib/PyCuAmpcor/src/cuArraysPadding.cu
@ -1,13 +1,12 @@
 /*
- * cuArraysPadding.cu
- * Padding Utitilies for oversampling
+ * @file  cuArraysPadding.cu
+ * @brief Utilities for padding zeros to cuArrays
 */

 #include "cuAmpcorUtil.h"
 #include "float2.h"

-//padding zeros in the middle, move quads to corners  
-//for raw chunk data oversampling
+// cuda kernel for cuArraysPadding
 __global__ void cuArraysPadding_kernel(
 	const float2 *image1, const int height1, const int width1,
 	float2 *image2, const int height2, const int width2)
@ -21,8 +20,6 @@ __global__ void cuArraysPadding_kernel(
 		int tx2 = height2 -1 -tx;
 		int ty2 = width2 -1 -ty;
 		 
-		//printf("%d %d %d\n", tx, height1, height2); 
-		 
 		image2[IDX2R(tx, ty, width2)] = image1[IDX2R(tx, ty, width1)];
 		image2[IDX2R(tx2, ty, width2)] = image1[IDX2R(tx1, ty, width1)];
 		image2[IDX2R(tx, ty2, width2)] = image1[IDX2R(tx, ty1, width1)];
@ -30,7 +27,13 @@ __global__ void cuArraysPadding_kernel(
 		
 	}
 }
-//tested 
+
+/**
+ * Padding zeros in the middle, move quads to corners
+ * @param[in] image1 input images
+ * @param[out] image2 output images
+ * @note This routine is for a single image, no longer used
+ */
 void cuArraysPadding(cuArrays<float2> *image1, cuArrays<float2> *image2, cudaStream_t stream)
 {
 	int ThreadsPerBlock = NTHREADS2D;
@ -38,7 +41,9 @@ void cuArraysPadding(cuArrays<float2> *image1, cuArrays<float2> *image2, cudaStr
 	int BlockPerGridy = IDIVUP (image1->width/2, ThreadsPerBlock);
 	dim3 dimBlock(ThreadsPerBlock, ThreadsPerBlock);
 	dim3 dimGrid(BlockPerGridx, BlockPerGridy);
+	// set output image to 0
 	checkCudaErrors(cudaMemsetAsync(image2->devData, 0, image2->getByteSize(),stream));
+	// copy the quads of input images to four corners of the output images
 	cuArraysPadding_kernel<<<dimGrid, dimBlock, 0, stream>>>(
 		image1->devData, image1->height, image1->width,
 		image2->devData, image2->height, image2->width);
@ -50,7 +55,7 @@ inline __device__ float2 cmplxMul(float2 c, float a)
 	return make_float2(c.x*a, c.y*a);
 }

-//padding for zoomIned correlation oversampling/interpolation 
+// cuda kernel for
 __global__ void cuArraysPaddingMany_kernel(
 	const float2 *image1, const int height1, const int width1, const int size1,
 	float2 *image2, const int height2, const int width2, const int size2, const float factor )
@ -67,7 +72,6 @@ __global__ void cuArraysPaddingMany_kernel(
 		
 		int stride1 = blockIdx.z*size1;
 		int stride2 = blockIdx.z*size2;
-		//printf("%d %d %d\n", tx, height1, height2); 
 		
 		image2[IDX2R(tx,  ty,  width2)+stride2] = image1[IDX2R(tx,  ty,  width1)+stride1]*factor;
 		image2[IDX2R(tx2, ty,  width2)+stride2] = cmplxMul(image1[IDX2R(tx1, ty,  width1)+stride1], factor);
@ -76,6 +80,12 @@ __global__ void cuArraysPaddingMany_kernel(
 	}
 }

+/**
+ * Padding zeros for FFT oversampling
+ * @param[in] image1 input images
+ * @param[out] image2 output images
+ * @note To keep the band center at (0,0), move quads to corners and pad zeros in the middle
+ */
 void cuArraysPaddingMany(cuArrays<float2> *image1, cuArrays<float2> *image2, cudaStream_t stream)
 {
 	int ThreadsPerBlock = NTHREADS2D;
@ -91,63 +101,7 @@ void cuArraysPaddingMany(cuArrays<float2> *image1, cuArrays<float2> *image2, cud
 		image2->devData, image2->height, image2->width, image2->size, factor);
 	getLastCudaError("cuArraysPadding_kernel");
 } 
-
-
-// convert float to float2(complex)
-__global__ void cuArraysR2C_kernel(float *image1, float2 *image2, int size)
-{
-	int idx =  threadIdx.x + blockDim.x*blockIdx.x;
-	if(idx < size)
-	{
-		image2[idx].x = image1[idx];
-		image2[idx].y =  0.0f;
-	}
-}
-
-//tested
-void cuArraysR2C(cuArrays<float> *image1, cuArrays<float2> *image2, cudaStream_t stream)
-{
-	int size = image1->getSize();
-	cuArraysR2C_kernel<<<IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>(image1->devData, image2->devData, size);
-	getLastCudaError("cuArraysR2C");
-}
-
-
-// take real part of float2 to float
-__global__ void cuArraysC2R_kernel(float2 *image1, float *image2, int size)
-{
-	int idx =  threadIdx.x + blockDim.x*blockIdx.x;
-	if(idx < size)
-	{
-		image2[idx] = image1[idx].x;
-	}
-}
-
-//tested
-void cuArraysC2R(cuArrays<float2> *image1, cuArrays<float> *image2, cudaStream_t stream)
-{
-	int size = image1->getSize();
-	cuArraysC2R_kernel<<<IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>(image1->devData, image2->devData, size);
-	getLastCudaError("cuArraysC2R");
-}
-
-// take real part of float2 to float
-__global__ void cuArraysAbs_kernel(float2 *image1, float *image2, int size)
-{
-	int idx =  threadIdx.x + blockDim.x*blockIdx.x;
-	if(idx < size)
-	{
-		image2[idx] = complexAbs(image1[idx]);
-	}
-}
-
-//tested
-void cuArraysAbs(cuArrays<float2> *image1, cuArrays<float> *image2, cudaStream_t stream)
-{
-	int size = image1->getSize();
-	cuArraysAbs_kernel<<<IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>(image1->devData, image2->devData, size);
-	getLastCudaError("cuArraysAbs_kernel");
-}
+//end of file



--- a/contrib/PyCuAmpcor/src/cuCorrFrequency.cu
+++ b/contrib/PyCuAmpcor/src/cuCorrFrequency.cu
@ -1,17 +1,26 @@
 /*
- * cuCorrFrequency.cu
- * define a class to save FFT plans and intermediate data for cross correlation in frequency domain
+ * @file  cuCorrFrequency.cu
+ * @brief A class performs cross correlation in frequency domain
 */

 #include "cuCorrFrequency.h"
 #include "cuAmpcorUtil.h"

+/*
+ * cuFreqCorrelator Constructor
+ * @param imageNX height of each image
+ * @param imageNY width of each image
+ * @param nImages number of images in the batch
+ * @param stream CUDA stream
+ */
 cuFreqCorrelator::cuFreqCorrelator(int imageNX, int imageNY, int nImages, cudaStream_t stream_)
 {
+
    int imageSize = imageNX*imageNY;
    int fImageSize = imageNX*(imageNY/2+1);
    int n[NRANK] ={imageNX, imageNY};
    
+    // set up fft plans
    cufft_Error(cufftPlanMany(&forwardPlan, NRANK, n,
                              NULL, 1, imageSize,
                              NULL, 1, fImageSize, 
@ -24,6 +33,7 @@ cuFreqCorrelator::cuFreqCorrelator(int imageNX, int imageNY, int nImages, cudaSt
    cufftSetStream(forwardPlan, stream);
    cufftSetStream(backwardPlan, stream);

+    // set up work arrays
    workFM = new cuArrays<float2>(imageNX, (imageNY/2+1), nImages);
    workFM->allocate();
    workFS = new cuArrays<float2>(imageNX, (imageNY/2+1), nImages);
@ -32,6 +42,7 @@ cuFreqCorrelator::cuFreqCorrelator(int imageNX, int imageNY, int nImages, cudaSt
    workT->allocate();
 }

+/// destructor
 cuFreqCorrelator::~cuFreqCorrelator()
 {
    cufft_Error(cufftDestroy(forwardPlan));
@ -41,46 +52,40 @@ cuFreqCorrelator::~cuFreqCorrelator()
    workT->deallocate();
 }	

+
+/**
+ * Execute the cross correlation
+ * @param[in] templates the reference windows
+ * @param[in] images the search windows
+ * @param[out] results the correlation surfaces
+ */
+
 void cuFreqCorrelator::execute(cuArrays<float> *templates, cuArrays<float> *images, cuArrays<float> *results)
 {
+    // pad the reference windows to the the size of search windows
    cuArraysCopyPadded(templates, workT, stream);
+    // forward fft to frequency domain
    cufft_Error(cufftExecR2C(forwardPlan, workT->devData, workFM->devData));
    cufft_Error(cufftExecR2C(forwardPlan, images->devData, workFS->devData));
+    // cufft doesn't normalize, so manually get the image size for normalization
    float coef = 1.0/(images->size);
+    // multiply reference with secondary windows in frequency domain
    cuArraysElementMultiplyConjugate(workFM, workFS, coef, stream);
+    // backward fft to get correlation surface in time domain
    cufft_Error(cufftExecC2R(backwardPlan, workFM->devData, workT->devData));
+    // extract to get proper size of correlation surface
    cuArraysCopyExtract(workT, results, make_int2(0, 0), stream);
-    //workT->outputToFile("test",stream);
-} 
-
-
-
-__global__ void cudaKernel_elementMulC(float2 *ainout, float2 *bin, size_t size)
-{
-    int idx = threadIdx.x + blockIdx.x*blockDim.x;
-    if(idx < size) {
-        cuComplex prod; 
-        prod = cuCmulf(ainout[idx], bin[idx]);
-        ainout [idx] = prod;
-    }
-} 
-
-void cuArraysElementMultiply(cuArrays<float2> *image1, cuArrays<float2> *image2, cudaStream_t stream)
-{
-    int size = image1->getSize();
-    int threadsperblock = NTHREADS;
-    int blockspergrid = IDIVUP (size, threadsperblock);
-    cudaKernel_elementMulC<<<blockspergrid, threadsperblock, 0, stream>>>(image1->devData, image2->devData, size );
-    getLastCudaError("cuArraysElementMultiply error\n");
+    // all done
 }

+// a = a^* * b
 inline __device__ float2 cuMulConj(float2 a, float2 b)
 {
    return make_float2(a.x*b.x + a.y*b.y, -a.y*b.x + a.x*b.y);
 }

-
-__global__ void cudaKernel_elementMulConjugate(float2 *ainout, float2 *bin, size_t size, float coef)
+// cuda kernel for cuArraysElementMultiplyConjugate
+__global__ void cudaKernel_elementMulConjugate(float2 *ainout, float2 *bin, int size, float coef)
 {
    int idx = threadIdx.x + blockIdx.x*blockDim.x;
    if(idx < size) {
@ -90,6 +95,12 @@ __global__ void cudaKernel_elementMulConjugate(float2 *ainout, float2 *bin, size
    }
 } 

+/**
+ * Perform multiplication of coef*Conjugate[image1]*image2 for each element
+ * @param[inout] image1, the first image
+ * @param[in] image2, the secondary image
+ * @param[in] coef, usually the normalization factor
+ */
 void cuArraysElementMultiplyConjugate(cuArrays<float2> *image1, cuArrays<float2> *image2, float coef, cudaStream_t stream)
 {
    int size = image1->getSize();
@ -98,3 +109,4 @@ void cuArraysElementMultiplyConjugate(cuArrays<float2> *image1, cuArrays<float2>
    cudaKernel_elementMulConjugate<<<blockspergrid, threadsperblock, 0, stream>>>(image1->devData, image2->devData, size, coef );
    getLastCudaError("cuArraysElementMultiply error\n");
 } 
+//end of file
--- a/contrib/PyCuAmpcor/src/cuCorrFrequency.h
+++ b/contrib/PyCuAmpcor/src/cuCorrFrequency.h
@ -1,29 +1,37 @@
 /*
- * cuCorrFrequency.h
- * define a class to save FFT plans and intermediate data for cross correlation in frequency domain
+ * @file  cuCorrFrequency.h
+ * @brief A class performs cross correlation in frequency domain
 */

+// code guard
 #ifndef __CUCORRFREQUENCY_H
 #define __CUCORRFREQUENCY_H

+// dependencies
 #include "cudaUtil.h"
 #include "cuArrays.h"

 class cuFreqCorrelator
 {
 private:
+    // handles for forward/backward fft
 	cufftHandle forwardPlan;
 	cufftHandle backwardPlan;
-	 
+	// work data
 	cuArrays<float2> *workFM;
 	cuArrays<float2> *workFS;
 	cuArrays<float> *workT;
+	// cuda stream
 	cudaStream_t stream;

 public:
+    // constructor
 	cuFreqCorrelator(int imageNX, int imageNY, int nImages, cudaStream_t stream_);
+	// destructor
 	~cuFreqCorrelator();
+	// executor
 	void execute(cuArrays<float> *templates, cuArrays<float> *images, cuArrays<float> *results);
 };

-#endif
+#endif //__CUCORRFREQUENCY_H
+// end of file
--- a/contrib/PyCuAmpcor/src/cuCorrNormalization.cu
+++ b/contrib/PyCuAmpcor/src/cuCorrNormalization.cu
@ -1,13 +1,31 @@
 /*
- * cuCorrNormalization.cu
- * various utilities related to normalization of images
- * including calculating mean, subtract mean, ....
+ * @file cuCorrNormalization.cu
+ * @brief Utilities to normalize the correlation surface
+ *
+ * The mean and variance of the normalization factor can be computed from the
+ *   cumulative/prefix sum (or sum area table) s(u,v), and s2(u,v).
+ * We follow the algorithm by Evghenii Gaburov, Tim Idzenga, Willem Vermin, in the nxcor package.
+ * 1. Iterate over rows and for each row, the cumulative sum for elements in the row
+ *    is computed as c_row(u,v) = \sum_(v'<v) f(u, v')
+ *    and we keep track of the sum of area of width Ny, i.e.,
+ *         c(u,v) = \sum_{u'<=u} [c_row(u', v+Ny) - c_row(u', v)],
+ *         or c(u,v) = c(u-1, v) + [c_row(u, v+Ny) - c_row(u, v)]
+ * 2. When row reaches the window height u=Nx-1,
+ *    c(u,v) provides the sum of area for the first batch of windows sa(0,v).
+ * 3. proceeding to row = u+1, we compute both c_row(u+1, v) and c_row(u-Nx, v)
+ *    i.e., we add the sum from new row and remove the sum from the first row in c(u,v):
+ *    c(u+1,v)= c(u,v) + [c_row(u+1,v+Ny)-c_row(u+1, v)] - [c_row(u-Nx, v+Ny)-c_row(u-Nx, v)].
+ * 4. Iterate 3. over the rest rows, and c(u,v) provides the sum of areas for new row of windows.
 *
 */

 #include "cuAmpcorUtil.h"
 #include <cfloat>
+#include <stdio.h>

+ // sum reduction within a block
+ // the following implementation is compatible for sm_20 and above
+ // newer architectures may support faster implementations, such as warp shuffle, cooperative groups
 template <const int Nthreads>
 __device__ float sumReduceBlock(float sum, volatile float *shmem)
 {
@ -33,7 +51,7 @@ __device__ float sumReduceBlock(float sum, volatile float *shmem)
    return shmem[0];
 }

-/* subtracts mean value from the images */
+// cuda kernel to subtract mean value from the images
 template<const int Nthreads>
 __global__ void cuArraysMean_kernel(float *images, float *image_sum, int imageSize, float invSize, int nImages)
 {
@ -49,28 +67,34 @@ __global__ void cuArraysMean_kernel(float *images, float *image_sum, int imageSi
    float   *imageD = images + imageOffset;

    float sum  = 0.0f;
+    // perform the reduction beyond one block
+    // save the results for each thread in block
    for (int i = tid; i < imageSize; i += Nthreads)
            sum += imageD[i];
+    // reduction within the block
    sum = sumReduceBlock<Nthreads>(sum, shmem);

    const float mean = sum * invSize;
    if(tid ==0) image_sum[bid] = mean;
 }

+/**
+ * Compute mean values for images
+ * @param[in] images Input images
+ * @param[out] mean Output mean values
+ * @param[in] stream cudaStream
+ */
 void cuArraysMeanValue(cuArrays<float> *images, cuArrays<float> *mean, cudaStream_t stream)
 {
 	const dim3 grid(images->count, 1, 1);
-	//const int Nthreads=512;
 	const int imageSize = images->width*images->height;
 	const float invSize = 1.0f/imageSize;

-	cuArraysMean_kernel<512> <<<grid,512,0,stream>>>(images->devData, mean->devData, imageSize, invSize, images->count);
+	cuArraysMean_kernel<NTHREADS> <<<grid,NTHREADS,0,stream>>>(images->devData, mean->devData, imageSize, invSize, images->count);
 	getLastCudaError("cuArraysMeanValue kernel error\n");
 }

-
-
-/* subtracts mean value from the images */
+// cuda kernel to compute and subtracts mean value from the images
 template<const int Nthreads>
 __global__ void cuArraysSubtractMean_kernel(float *images, int imageSize, float invSize, int nImages)
 {
@ -85,30 +109,37 @@ __global__ void cuArraysSubtractMean_kernel(float *images, int imageSize, float
    const int imageOffset = imageIdx * imageSize;
    float   *imageD = images + imageOffset;

+    // compute the sum
    float sum  = 0.0f;
    for (int i = tid; i < imageSize; i += Nthreads)
            sum += imageD[i];
    sum = sumReduceBlock<Nthreads>(sum, shmem);

+    // compute the mean
    const float mean = sum * invSize;
-    
+    // subtract the mean from each pixel
    for (int i = tid; i < imageSize; i += Nthreads)
            imageD[i] -= mean;
 }

+/**
+ * Compute and subtract mean values from images
+ * @param[inout] images Input/Output images
+ * @param[out] mean Output mean values
+ * @param[in] stream cudaStream
+ */
 void cuArraysSubtractMean(cuArrays<float> *images, cudaStream_t stream)
 {
 	const dim3 grid(images->count, 1, 1);
-	//const int Nthreads=512;
 	const int imageSize = images->width*images->height;
 	const float invSize = 1.0f/imageSize;

-	cuArraysSubtractMean_kernel<512> <<<grid,512,0,stream>>>(images->devData, imageSize, invSize, images->count);
+	cuArraysSubtractMean_kernel<NTHREADS> <<<grid,NTHREADS,0,stream>>>(images->devData, imageSize, invSize, images->count);
 	getLastCudaError("cuArraysSubtractMean kernel error\n");
 }


-// Summation on extracted correlation surface (Minyan)
+// cuda kernel to compute summation on extracted correlation surface (Minyan)
 template<const int Nthreads>
 __global__ void cuArraysSumCorr_kernel(float *images, int *imagesValid, float *imagesSum, int *imagesValidCount, int imageSize, int nImages)
 {
@ -141,24 +172,26 @@ __global__ void cuArraysSumCorr_kernel(float *images, int *imagesValid, float *i
    }
 }

-void cuArraysSumCorr(cuArrays<float> *images, cuArrays<int> *imagesValid, cuArrays<float> *imagesSum, cuArrays<int> *imagesValidCount, cudaStream_t stream)
+/**
+ * Compute the variance of images (for SNR)
+ * @param[in] images Input images
+ * @param[in] imagesValid validity flags for each pixel
+ * @param[out] imagesSum variance
+ * @param[out] imagesValidCount count of total valid pixels
+ * @param[in] stream cudaStream
+ */
+void cuArraysSumCorr(cuArrays<float> *images, cuArrays<int> *imagesValid, cuArrays<float> *imagesSum,
+    cuArrays<int> *imagesValidCount, cudaStream_t stream)
 {
    const dim3 grid(images->count, 1, 1);
-//const int Nthreads=512;
    const int imageSize = images->width*images->height;

-cuArraysSumCorr_kernel<512> <<<grid,512,0,stream>>>(images->devData, imagesValid->devData,
+    cuArraysSumCorr_kernel<NTHREADS> <<<grid,NTHREADS,0,stream>>>(images->devData, imagesValid->devData,
        imagesSum->devData, imagesValidCount->devData, imageSize, images->count);
-
    getLastCudaError("cuArraysSumValueCorr kernel error\n");
-
 }

-// end of summation on extracted correlation surface (Minyan)
-
-
-
-/* intra-block inclusive prefix sum */
+// intra-block inclusive prefix sum
 template<int Nthreads2>
 __device__ void inclusive_prefix_sum(float sum, volatile float *shmem)
 {
@ -177,7 +210,7 @@ __device__ void inclusive_prefix_sum(float sum, volatile float *shmem)
    }
 }

-
+// prefix sum of pixel value and pixel value^2
 template<const int Nthreads2>
 __device__ float2 partialSums(const float v, volatile float* shmem, const int stride)
 {
@ -190,12 +223,10 @@ __device__ float2 partialSums(const float v, volatile float* shmem, const int st
    inclusive_prefix_sum<Nthreads2>(v*v, shMem2);
    const float Sum  = shMem [tid-1 + stride] - shMem [tid-1];
    const float Sum2 = shMem2[tid-1 + stride] - shMem2[tid-1];
-    //__syncthreads();
-    
    return make_float2(Sum, Sum2);
 }

-
+// cuda kernel for cuCorrNormalize
 template<const int Nthreads2>
 __global__ void cuCorrNormalize_kernel(
 	int nImages,
@ -211,9 +242,6 @@ __global__ void cuCorrNormalize_kernel(
    const int imageIdx = blockIdx.z;
    if (imageIdx >= nImages) return;

-    //if(tid ==0 ) printf("debug corrNorm, %d %d %d %d %d %d %d %d %d\n", templateNX, templateNY, templateSize,
-    //imageNX, imageNY, imageSize,     resultNX, resultNY, resultSize);
-    
    const int    imageOffset = imageIdx *    imageSize;
    const int templateOffset = imageIdx * templateSize;
    const int   resultOffset = imageIdx *   resultSize;
@ -222,17 +250,7 @@ __global__ void cuCorrNormalize_kernel(
    const float *templateD = templateIn  + templateOffset;
    float *  resultD =   resultOut +   resultOffset;

-    /*template sum squar */
-    
-    float templateSum = 0.0f;
-    
-    for(uint i=tid; i<templateSize; i+=Nthreads)
-    {
-        templateSum += templateD[i];
-    }
-    templateSum = sumReduceBlock<Nthreads>(templateSum, shmem);
-    __syncthreads();
-    
+    // template sum^2
    float templateSum2 = 0.0f;
    for (int i = tid; i < templateSize; i += Nthreads)
        {
@ -242,94 +260,123 @@ __global__ void cuCorrNormalize_kernel(
    templateSum2 = sumReduceBlock<Nthreads>(templateSum2, shmem);
    __syncthreads();

-    //if(tid ==0) printf("template sum %d %g %g \n", imageIdx, templateSum, templateSum2);
-    /*********/
-
+    // reset shared memory value
    shmem[tid] = shmem[tid + Nthreads] = shmem[tid + 2*Nthreads] = 0.0f;
    __syncthreads();

+    // perform the prefix sum and sum^2 for secondary window
+    // see notes above
    float imageSum  = 0.0f;
    float imageSum2 = 0.0f;
    int iaddr = 0;
    const int windowSize = templateNX*imageNY;
+    // iterative till reaching the templateNX row of the secondary window
+    // or the first row of correlation surface may be computed
    while (iaddr < windowSize)
    {
+        // cum sum for each row with a width=templateNY
        const float2 res = partialSums<Nthreads2>(imageD[iaddr + tid], shmem, templateNY);
+        // add to the total, which keeps track of the sum of area for each window
        imageSum  += res.x;
        imageSum2 += res.y;
+        // move to next row
        iaddr     += imageNY;
    }
-    
+    // row reaches the end of first batch of windows
+    // normalize the first row of the correlation surface
    if (tid < resultNY)
    {
-        //if(blockIdx.z ==0) printf("image sum %d %g %g \n", tid, imageSum*templateCoeff, sqrtf(imageSum2*templateCoeff));
-        
+        // normalizing factor
        const float norm2 = (imageSum2 - imageSum*imageSum*templateCoeff)*templateSum2;
+        // normalize the correlation surface
        resultD[tid] *= rsqrtf(norm2 + FLT_EPSILON);
    }
-    
-    /*********/
-    
+    // iterative over the rest rows
    while (iaddr < imageSize)
    {
+        // the prefix sum of the row removed is recomputed, to be subtracted
        const float2 res1 = partialSums<Nthreads2>(imageD[iaddr-windowSize + tid], shmem, templateNY);
+        // the prefix sum of the new row, to be added
        const float2 res2 = partialSums<Nthreads2>(imageD[iaddr            + tid], shmem, templateNY);
        imageSum  += res2.x - res1.x;
        imageSum2 += res2.y - res1.y;
+        // move to next row
        iaddr     += imageNY;
-        
+        // normalize the correlation surface
        if (tid < resultNY)
        {
-            const int         ix = iaddr/imageNY;
-            const int       addr = (ix-templateNX)*resultNY;
-            
-            //printf("test norm %d %d %d %d %f\n", tid, ix, addr, addr+tid, resultD[addr + tid]);
-            
+            const int ix = iaddr/imageNY; // get row index
+            const int addr = (ix-templateNX)*resultNY; // get the correlation surface row index
            const float norm2 = (imageSum2 - imageSum*imageSum*templateCoeff)*templateSum2;
            resultD[addr + tid] *= rsqrtf(norm2 + FLT_EPSILON);
        }
    }
 }

+/**
+ * Normalize a correlation surface
+ * @param[in] templates Reference windows with mean subtracted
+ * @param[in] images Secondary windows
+ * @param[inout] results un-normalized correlation surface as input and normalized as output
+ * @param[in] stream cudaStream
+ * @warning The current implementation uses one thread for one column, therefore,
+ *   the secondary window width is limited to <=1024, the max threads in a block.
+ * @todo an implementation for arbitrary window width, might not be as efficient
+ */
 void cuCorrNormalize(cuArrays<float> *templates, cuArrays<float> *images, cuArrays<float> *results, cudaStream_t stream)
 {
 	const int nImages = images->count;
 	const int imageNY = images->width;
 	const dim3 grid(1, 1, nImages);
 	const float invTemplateSize = 1.0f/templates->size;
-    //printf("test normalize %d %g\n", templates->size, invTemplateSize);
-	if      (imageNY <=   64) cuCorrNormalize_kernel< 6><<<grid,  64, 0, stream>>>(nImages, 
+
+	if      (imageNY <=   64) {
+	    cuCorrNormalize_kernel< 6><<<grid,  64, 0, stream>>>(nImages,
    		templates->devData, templates->height, templates->width, templates->size,
 	    	images->devData, images->height, images->width, images->size,
 		    results->devData, results->height, results->width, results->size,
 		    invTemplateSize);
-    else if (imageNY <=  128) cuCorrNormalize_kernel< 7><<<grid, 128, 0, stream>>>(nImages, 
+		getLastCudaError("cuCorrNormalize kernel error");
+    }
+    else if (imageNY <=  128) {
+        cuCorrNormalize_kernel< 7><<<grid, 128, 0, stream>>>(nImages,
 		    templates->devData, templates->height, templates->width, templates->size,
 		    images->devData, images->height, images->width, images->size,
 		    results->devData, results->height, results->width, results->size,
 		    invTemplateSize);
-    else if (imageNY <=  256) cuCorrNormalize_kernel< 8><<<grid, 256, 0, stream>>>(nImages, 
+		getLastCudaError("cuCorrNormalize kernel error");
+    }
+    else if (imageNY <=  256) {
+        cuCorrNormalize_kernel< 8><<<grid, 256, 0, stream>>>(nImages,
 		    templates->devData, templates->height, templates->width, templates->size,
 		    images->devData, images->height, images->width, images->size,
 		    results->devData, results->height, results->width, results->size,
 		    invTemplateSize);
-    else if (imageNY <=  512) cuCorrNormalize_kernel< 9><<<grid, 512, 0, stream>>>(nImages, 
+		getLastCudaError("cuCorrNormalize kernel error");
+    }
+    else if (imageNY <=  512) {
+        cuCorrNormalize_kernel< 9><<<grid, 512, 0, stream>>>(nImages,
    		templates->devData, templates->height, templates->width, templates->size,
 	    	images->devData, images->height, images->width, images->size,
 		    results->devData, results->height, results->width, results->size,
 		    invTemplateSize);
-    else if (imageNY <= 1024) cuCorrNormalize_kernel<10><<<grid,1024, 0, stream>>>(nImages, 
+		getLastCudaError("cuCorrNormalize kernel error");
+    }
+    else if (imageNY <= 1024) {
+        cuCorrNormalize_kernel<10><<<grid,1024, 0, stream>>>(nImages,
 		    templates->devData, templates->height, templates->width, templates->size,
 		    images->devData, images->height, images->width, images->size,
 		    results->devData, results->height, results->width, results->size,
 		    invTemplateSize);
+		getLastCudaError("cuCorrNormalize kernel error");
+    }
    else
    {
-        fprintf(stderr, "The image size along across direction %d should be smaller than 1024.\n", imageNY);
-        assert(0);
-    }
-    getLastCudaError("cuCorrNormalize kernel error\n");
-	
+        fprintf(stderr, "The (oversampled) window size along the across direction %d should be smaller than 1024.\n", imageNY);
+        throw;
    }


+}
+
+// end of file
--- a/contrib/PyCuAmpcor/src/cuCorrTimeDomain.cu
+++ b/contrib/PyCuAmpcor/src/cuCorrTimeDomain.cu
@ -1,16 +1,20 @@
 /*
- * cuCorrTimetime.cu 
- * correlation between two sets of images in time domain
+ * @file  cuCorrTimetime.cu
+ * @brief Correlation between two sets of images in time domain
+ *
+ * This code is adapted from the nxcor package.
 */

 #include "cuAmpcorUtil.h"

+
+// cuda kernel for cuCorrTimeDomain
 template<const int nthreads, const int NPT>
 __global__ void cuArraysCorrTime_kernel(
 	const int nImages,
-	const float *templateIn, const int templateNY, const int templateNX, const int templateSize,   
-	const float *imageIn, const int imageNY, const int imageNX, const int imageSize,  
-    float *resultOut, const int resultNY, const int resultNX, const int resultSize)
+	const float *templateIn, const int templateNX, const int templateNY, const int templateSize,
+	const float *imageIn, const int imageNX, const int imageNY, const int imageSize,
+    float *resultOut, const int resultNX, const int resultNY, const int resultSize)
 {
    __shared__ float shmem[nthreads*(1+NPT)];
    const int tid = threadIdx.x;
@ -26,14 +30,14 @@ __global__ void cuArraysCorrTime_kernel(
    const float *templateD = templateIn  + templateOffset + tid;
     float *  resultD =   resultOut +   resultOffset;

-    const int q  = min(nthreads/resultNX, 4);
+    const int q  = min(nthreads/resultNY, 4);
    const int nt = nthreads/q;
    const int ty = threadIdx.x / nt;
    const int tx = threadIdx.x - nt * ty;

-    const int templateNXq = templateNX/q;
-    const int jbeg = templateNXq * ty;
-    const int jend = ty+1 >= q ? templateNX : templateNXq + jbeg;
+    const int templateNYq = templateNY/q;
+    const int jbeg = templateNYq * ty;
+    const int jend = ty+1 >= q ? templateNY : templateNYq + jbeg;

    float *shTemplate = shmem;
    float *shImage    = shmem + nthreads;
@ -43,13 +47,13 @@ __global__ void cuArraysCorrTime_kernel(
    for (int k = 0; k < NPT; k++)
        corrCoeff[k] = 0.0f;

-    int iaddr = yc*imageNX;
+    int iaddr = yc*imageNY;


    float img[NPT];
-    for (int k = 0; k < NPT-1; k++, iaddr += imageNX)
+    for (int k = 0; k < NPT-1; k++, iaddr += imageNY)
        img[k] = imageD[iaddr];
-    for (int taddr = 0; taddr < templateSize; taddr += templateNX, iaddr += imageNX)
+    for (int taddr = 0; taddr < templateSize; taddr += templateNY, iaddr += imageNY)
    {
        shTemplate[tid] = templateD[taddr];
        img     [NPT-1] =    imageD[iaddr];
@ -59,7 +63,7 @@ __global__ void cuArraysCorrTime_kernel(
            img[k] = img[k+1];
        __syncthreads();

-        if (tx < resultNX && ty < q)
+        if (tx < resultNY && ty < q)
        {
 #pragma unroll 8
            for (int j = jbeg; j < jend; j++)
@ -78,16 +82,22 @@ __global__ void cuArraysCorrTime_kernel(
            corrCoeff[k] += shmem[j + nthreads*k];
    __syncthreads();

-    if (tid < resultNX)
+    if (tid < resultNY)
    {
-        int raddr = yc*resultNX + tid;
-        for (int k = 0; k < NPT; k++, raddr += resultNX)
+        int raddr = yc*resultNY + tid;
+        for (int k = 0; k < NPT; k++, raddr += resultNY)
            if (raddr < resultSize)
                resultD[raddr] = corrCoeff[k];
    }
 }

-
+/**
+ * Perform cross correlation in time domain
+ * @param[in] templates Reference images
+ * @param[in] images Secondary images
+ * @param[out] results Output correlation surface
+ * @param[in] stream cudaStream
+ */
 void cuCorrTimeDomain(cuArrays<float> *templates,
 			   cuArrays<float> *images,
 			   cuArrays<float> *results,
@ -95,52 +105,84 @@ void cuCorrTimeDomain(cuArrays<float> *templates,
 {
    /* compute correlation matrix */
    const int nImages = images->count;
-    const int imageNX = images->width;
+    const int imageNY = images->width;
    const int NPT = 8;


    const dim3 grid(nImages, (results->width-1)/NPT+1, 1);
-    //fprintf(stderr, "corrTimeDomain %d %d %d\n", imageNX, templates->height, results->height);
-    if      (imageNX <=   64) cuArraysCorrTime_kernel<  64,NPT><<<grid,  64, 0, stream>>>(nImages, 
+    if      (imageNY <=   64) {
+        cuArraysCorrTime_kernel<  64,NPT><<<grid,  64, 0, stream>>>(nImages,
    		templates->devData, templates->height, templates->width, templates->size,
 	    	images->devData, images->height, images->width, images->size,
 		    results->devData, results->height, results->width, results->size);
-    else if (imageNX <=  128) cuArraysCorrTime_kernel< 128,NPT><<<grid, 128, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else if (imageNX <=  192) cuArraysCorrTime_kernel< 192,NPT><<<grid, 192, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else if (imageNX <=  256) cuArraysCorrTime_kernel< 256,NPT><<<grid, 256, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else if (imageNX <=  384) cuArraysCorrTime_kernel< 384,NPT><<<grid, 384, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else if (imageNX <=  512) cuArraysCorrTime_kernel< 512,NPT><<<grid, 512, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else if (imageNX <=  640) cuArraysCorrTime_kernel< 640,NPT><<<grid, 640, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else if (imageNX <=  768) cuArraysCorrTime_kernel< 768,NPT><<<grid, 768, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else if (imageNX <=  896) cuArraysCorrTime_kernel< 896,NPT><<<grid, 896, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else if (imageNX <= 1024) cuArraysCorrTime_kernel<1024,NPT><<<grid,1024, 0, stream>>>(nImages, 
-		templates->devData, templates->height, templates->width, templates->size, 
-		images->devData, images->height, images->width, images->size,
-		results->devData, results->height, results->width, results->size);
-    else assert(0);
 		getLastCudaError("cuArraysCorrTime error");
    }
+    else if (imageNY <=  128)  {
+        cuArraysCorrTime_kernel< 128,NPT><<<grid, 128, 0, stream>>>(nImages,
+		    templates->devData, templates->height, templates->width, templates->size,
+		    images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		getLastCudaError("cuArraysCorrTime error");
+    }
+    else if (imageNY <=  192) {
+        cuArraysCorrTime_kernel< 192,NPT><<<grid, 192, 0, stream>>>(nImages,
+		    templates->devData, templates->height, templates->width, templates->size,
+		    images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		getLastCudaError("cuArraysCorrTime error");
+    }
+    else if (imageNY <=  256) {
+        cuArraysCorrTime_kernel< 256,NPT><<<grid, 256, 0, stream>>>(nImages,
+		    templates->devData, templates->height, templates->width, templates->size,
+		    images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		getLastCudaError("cuArraysCorrTime error");
+    }
+    else if (imageNY <=  384) {
+        cuArraysCorrTime_kernel< 384,NPT><<<grid, 384, 0, stream>>>(nImages,
+    		templates->devData, templates->height, templates->width, templates->size,
+	    	images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		  	getLastCudaError("cuArraysCorrTime error");
+    }
+    else if (imageNY <=  512) {
+        cuArraysCorrTime_kernel< 512,NPT><<<grid, 512, 0, stream>>>(nImages,
+		    templates->devData, templates->height, templates->width, templates->size,
+		    images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		getLastCudaError("cuArraysCorrTime error");
+    }
+    else if (imageNY <=  640) {
+        cuArraysCorrTime_kernel< 640,NPT><<<grid, 640, 0, stream>>>(nImages,
+    		templates->devData, templates->height, templates->width, templates->size,
+	    	images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		getLastCudaError("cuArraysCorrTime error");
+    }
+    else if (imageNY <=  768) {
+        cuArraysCorrTime_kernel< 768,NPT><<<grid, 768, 0, stream>>>(nImages,
+	    	templates->devData, templates->height, templates->width, templates->size,
+		    images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		getLastCudaError("cuArraysCorrTime error");
+    }
+    else if (imageNY <=  896) {
+        cuArraysCorrTime_kernel< 896,NPT><<<grid, 896, 0, stream>>>(nImages,
+		    templates->devData, templates->height, templates->width, templates->size,
+		    images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		getLastCudaError("cuArraysCorrTime error");
+    }
+    else if (imageNY <= 1024) {
+        cuArraysCorrTime_kernel<1024,NPT><<<grid,1024, 0, stream>>>(nImages,
+		    templates->devData, templates->height, templates->width, templates->size,
+		    images->devData, images->height, images->width, images->size,
+		    results->devData, results->height, results->width, results->size);
+		getLastCudaError("cuArraysCorrTime error");
+    }
+    else {
+        fprintf(stderr, "The (oversampled) window size along the across direction %d should be smaller than 1024.\n", imageNY);
+        throw;
+    }
+}
+// end of file
--- a/contrib/PyCuAmpcor/src/cuDeramp.cu
+++ b/contrib/PyCuAmpcor/src/cuDeramp.cu
@ -1,12 +1,15 @@
 /*
- * cuDeramp.cu 
- * Derampling a batch of 2D complex images with GPU
+ * @file  cuDeramp.cu
+ * @brief Derampling a batch of 2D complex images with GPU
 *
- * Method 1: use Fortran code algorithm
- * Method 2: use phase gradient
- * Method 0 or else: no deramping
+ * A phase ramp is equivalent to a frequency shift in frequency domain,
+ *   which needs to be removed (deramping) in order to move the band center
+ *   to zero. This is necessary before oversampling a complex signal.
+ * Method 1: each signal is decomposed into real and imaginary parts,
+ *   and the average phase shift is obtained as atan(\sum imag / \sum real).
+ *   The average is weighted by the amplitudes (coherence).
+ * Method 0 or else: skip deramping
 *
- * v1.0 2/1/2017, Lijun Zhu
 */
 
 #include "cuArrays.h" 
@ -19,10 +22,11 @@
 #include <iomanip>
 #include <cmath>
 #include <limits>
-// note by Lijun
-// cuda does not have a good support on volatile vector struct, e.g. float2
-// I have to use regular float type for shared memory (volatile)

+
+// cuda does not have a good support on volatile vector struct, e.g. float2
+// have to use regular float type for shared memory (volatile) data
+// the following methods are defined to operate float2/complex objects through float
 inline static __device__ void copyToShared(volatile float *s, const int i, const float2 x, const int block) 
 { s[i] = x.x; s[i+block] = x.y; }

@ -34,102 +38,7 @@ inline static __device__ void addInShared(volatile float *s, const int i, const
 { s[i] += s[i+j]; s[i+block] += s[i+j+block];}


-__device__ void debugPhase(float2 c1, float2 c2)
-{
-    float2 cp = complexMulConj(c1, c2);
-    float phase = atan2f(cp.y, cp.x);
-} 
-
-
-template <const int nthreads>
-__device__ float sumReduceBlock(float sum, volatile float *shmem)
-{
-    const int tid = threadIdx.x;
-    shmem[tid] = sum;
-    __syncthreads();
-    
-    if (nthreads >=1024) { if (tid < 512) { shmem[tid] = sum = sum + shmem[tid + 512]; } __syncthreads(); }
-    if (nthreads >= 512) { if (tid < 256) { shmem[tid] = sum = sum + shmem[tid + 256]; } __syncthreads(); }
-    if (nthreads >= 256) { if (tid < 128) { shmem[tid] = sum = sum + shmem[tid + 128]; } __syncthreads(); }
-    if (nthreads >= 128) { if (tid <  64) { shmem[tid] = sum = sum + shmem[tid +  64]; } __syncthreads(); }
-    if (tid < 32)
-    {
-        shmem[tid] = sum = sum + shmem[tid + 32];
-        shmem[tid] = sum = sum + shmem[tid + 16]; 
-        shmem[tid] = sum = sum + shmem[tid +  8];
-        shmem[tid] = sum = sum + shmem[tid +  4];
-        shmem[tid] = sum = sum + shmem[tid +  2];
-        shmem[tid] = sum = sum + shmem[tid +  1]; 
-    }
-    
-    __syncthreads();
-    return shmem[0];
-}
-
-
-template<const int nthreads>
-__global__ void cuDerampMethod2_kernel(float2 *images, const int imageNX, const int imageNY, 
-const int imageSize, const int nImages, const float normCoefX, const float normCoefY)
-{
-    
-    __shared__ float shmem[nthreads];
-    const int tid = threadIdx.x;
-    const int bid = blockIdx.x;    
-    //printf("bid %d\n", bid);
-    float2  *imageD = images + bid*imageSize;
-    
-    int pixelIdx, pixelIdxX, pixelIdxY;
-    
-    float phaseDiffY  = 0.0f;
-    for (int i = tid; i < imageSize; i += nthreads) {
-        pixelIdx = i;
-        pixelIdxY = pixelIdx % imageNY; 
-        if(pixelIdxY < imageNY -1) 
-        {
-            phaseDiffY += complexArg(complexMulConj(imageD[pixelIdx], imageD[pixelIdx+1]));
-        } 
-    }       
-    
-    phaseDiffY=sumReduceBlock<nthreads>(phaseDiffY, shmem);
-    phaseDiffY*=normCoefY;
-    
-    float phaseDiffX  = 0.0f;
-    for (int i = tid; i < imageSize; i += nthreads) {
-        pixelIdx = i;
-        pixelIdxX = pixelIdx / imageNY; 
-        if(pixelIdxX < imageNX -1) 
-        {
-            phaseDiffX += complexArg(complexMulConj(imageD[pixelIdx], imageD[pixelIdx+imageNY]));
-        } 
-    }      
-    phaseDiffX=sumReduceBlock<nthreads>(phaseDiffX, shmem);
-    phaseDiffX*=normCoefX; 
-    
-    for (int i = tid; i < imageSize; i += nthreads)
-    { 
-        int pixelIdx = i;
-        pixelIdxX = pixelIdx/imageNY;
-        pixelIdxY = pixelIdx%imageNY;
-        float phase = pixelIdxX*phaseDiffX + pixelIdxY*phaseDiffY;
-        imageD[pixelIdx] *= make_float2(cosf(phase), sinf(phase));
-    }
-}
-
-
-void cuDerampMethod2(cuArrays<float2> *images, cudaStream_t stream)
-{
-    const dim3 grid(images->count);
-    const int nthreads=512;
-    const int imageSize = images->width*images->height;
-    const float normCoefY = 1.0f/((images->width-1)*images->height);
-    const float normCoefX = 1.0f/((images->height-1)*images->width);
-    cuDerampMethod2_kernel<nthreads> <<<grid, 512,0,stream>>>
-        (images->devData, images->height, images->width, imageSize, images->count, normCoefX, normCoefY);
-    getLastCudaError("cuDerampMethod2 kernel error\n");
-}
-
-
-
+// kernel to do sum reduction for float2 within a block
 template <const int nthreads>
 __device__ void complexSumReduceBlock(float2& sum, volatile float *shmem)
 {
@ -154,9 +63,7 @@ __device__ void complexSumReduceBlock(float2& sum, volatile float *shmem)
    copyFromShared(sum, shmem, 0, nthreads);
 }

-
-// block id is the image index
-// thread id ranges all pixels in one image
+// cuda kernel for cuDerampMethod1
 template<const int nthreads>
 __global__ void cuDerampMethod1_kernel(float2 *images, const int imageNX, int const imageNY, 
    const int imageSize, const int nImages, const float normCoef)
@ -180,7 +87,6 @@ __global__ void cuDerampMethod1_kernel(float2 *images, const int imageNX, int co
    complexSumReduceBlock<nthreads>(phaseDiffY, shmem);
    //phaseDiffY *= normCoef;
    float phaseY=atan2f(phaseDiffY.y, phaseDiffY.x);
-    //__syncthreads();

    float2 phaseDiffX  = make_float2(0.0f, 0.0f);
    for (int i = tid; i < imageSize; i += nthreads)  {
@ -207,12 +113,17 @@ __global__ void cuDerampMethod1_kernel(float2 *images, const int imageNX, int co
    }     
 }

-
+/**
+ * Deramp a complex signal with Method 1
+ * @brief Each signal is decomposed into real and imaginary parts,
+ *   and the average phase shift is obtained as atan(\sum imag / \sum real).
+ * @param[inout] images input/output complex signals
+ * @param[in] stream cuda stream
+ */
 void cuDerampMethod1(cuArrays<float2> *images, cudaStream_t stream)
 {
    
    const dim3 grid(images->count);
-    //int nthreads;
    const int imageSize = images->width*images->height;
    const float invSize = 1.0f/imageSize;

@ -232,115 +143,19 @@ void cuDerampMethod1(cuArrays<float2> *images, cudaStream_t stream)
        cuDerampMethod1_kernel<512> <<<grid, 512, 0, stream>>>
        (images->devData, images->height, images->width, 
        imageSize, images->count, invSize); }
-    
    getLastCudaError("cuDerampMethod1 kernel error\n");

-    
-}
-
-
-
-/*
-static inline double complexAbs (double2 a)
-{
-    double r = sqrt(a.x*a.x + a.y*a.y);
-    return r;
-}*/
-
-
-
-void cpuDerampMethod3(cuArrays<float2> *imagesD, cudaStream_t stream)
-{
-    float2 *images = (float2 *) malloc(imagesD->getByteSize());
-    float2 phaseDiffX, phaseDiffY;
-    int idxPixel;
-
-    
-    cudaMemcpyAsync(images, imagesD->devData, imagesD->getByteSize(), cudaMemcpyDeviceToHost, stream);
-    
-    int count = imagesD->count;
-    int height = imagesD->height;
-    int width = imagesD->width;
-    float2 cprod; 
-    float phaseX, phaseY;
-    for (int icount = 0; icount < count; icount ++)
-    {
-        phaseDiffY = make_float2(0.0f, 0.0f);
-        for (int i=0; i<height; i++)
-        {
-            for(int j=0; j<width-1; j++) 
-            {
-                idxPixel = icount*width*height + i*width + j;
-                cprod = complexMulConj(images[idxPixel], images[idxPixel+1]);
-                phaseDiffY.x  += (cprod.x);
-                phaseDiffY.y  += (cprod.y);
-            }         
-        }
-        //phaseDiffY /=  height*(width-1);
-         if (complexAbs(phaseDiffY) < 1.e-5) {
-            phaseY = 0.0;
-        }
-        else {  
-            phaseY = atan2(phaseDiffY.y, phaseDiffY.x);
-        }
-            
-        phaseDiffX = make_float2(0.0f, 0.0f);
-        for (int j=0; j<width; j++)
-        {
-            for(int i=0; i<height-1; i++) {
-                idxPixel = icount*width*height + i*width + j;
-                cprod = complexMulConj(images[idxPixel], images[idxPixel+width]);
-                phaseDiffX.x  += (cprod.x);
-                phaseDiffX.y  += (cprod.y);;
-            }
-        }
-        //phaseDiffX /=  (height-1)*width;
-        if (complexAbs(phaseDiffX) < 1.e-5) {
-            phaseX = 0.0;
-        }
-        else {  
-            phaseX = atan2(phaseDiffX.y, phaseDiffX.x);
-        }
-        
-        //printf("cpu deramp %d (%g,%g) (%g,%g)\n", icount, phaseDiffX.x, phaseDiffX.y, phaseDiffY.x, phaseDiffY.y);
-        
-        /*
-        std::setprecision(12);
-        std::cout << "cpu " << icount << " " << 
-            std::setprecision(std::numeric_limits<long double>::digits10 + 1) << phaseX << 
-            " " << std::setprecision(std::numeric_limits<long double>::digits10 + 1) << phaseY << std::endl;
-         std::cout << "cpu "  << phaseDiffX.x << " " << phaseDiffX.y << std::endl;
-        std::cout << "cpu "  << phaseDiffY.x << " " << phaseDiffY.y << std::endl;
-        */
-        for(int i=0; i<height; i++)
-        {
-            for(int j=0; j<width; j++)
-            {
-                idxPixel = icount*width*height + i*width + j;
-                float phase = phaseX*i + phaseY*j;
-                images[idxPixel]*=make_float2(cos(phase), sin(phase));
-            }
-        }
-    }
-    cudaMemcpyAsync(imagesD->devData, images, imagesD->getByteSize(), cudaMemcpyHostToDevice, stream);
-    free(images);
 }
        
 void cuDeramp(int method, cuArrays<float2> *images, cudaStream_t stream)
 {
-    // methods 2-3 are for test purposes only, removed for release
-    // note method 0 is designed for TOPSAR: not only deramping is skipped,
-    //    the amplitude is taken before oversampling
    switch(method) {
-    //case 3:
-    //    cpuDerampMethod3(images, stream);
    case 1:
        cuDerampMethod1(images, stream);
        break;
-    //case 2:
-    //    cuDerampMethod2(images, stream);
-        break;
    default:
        break;
    }
 }
+
+// end of file
--- a/contrib/PyCuAmpcor/src/cuEstimateStats.cu
+++ b/contrib/PyCuAmpcor/src/cuEstimateStats.cu
@ -1,7 +1,8 @@
-/*
-    cuEstimateStats.cu
-
-    9/23/2017, Minyan Zhong
+/**
+ * @file  cuEstimateStats.cu
+ * @brief Estimate the statistics of the correlation surface
+ *
+ * 9/23/2017, Minyan Zhong
 */

 #include "cuArrays.h"
@ -15,7 +16,7 @@
 #include <cmath>
 #include <limits>

-template <const int BLOCKSIZE>
+// cuda kernel for cuEstimateSnr
 __global__ void cudaKernel_estimateSnr(const float* corrSum, const int* corrValidCount, const float* maxval, float* snrValue, const int size)

 {
@ -28,50 +29,25 @@ __global__ void cudaKernel_estimateSnr(const float* corrSum, const int* corrVali
    snrValue[idx] = maxval[idx] * maxval[idx] / mean;
 }

+/**
+ * Estimate the signal to noise ratio (SNR) of the correlation surface
+ * @param[in] corrSum the sum of the correlation surface
+ * @param[in] corrValidCount the number of valid pixels contributing to sum
+ * @param[out] snrValue return snr value
+ * @param[in] stream cuda stream
+ */
 void cuEstimateSnr(cuArrays<float> *corrSum, cuArrays<int> *corrValidCount, cuArrays<float> *maxval, cuArrays<float> *snrValue, cudaStream_t stream)
 {

    int size = corrSum->getSize();
-
-    //std::cout<<size<<std::endl;
-
-    //corrSum->allocateHost();
-
-    //corrSum->copyToHost(stream);
-
-    //std::cout<<"corr sum"<<std::endl;
-
-    //corrValidCount->allocateHost();
-
-    //corrValidCount->copyToHost(stream);
-
-    //std::cout<<"valid count"<<std::endl;
-
-    //maxval->allocateHost();
-
-    //maxval->copyToHost(stream);
-
-    //std::cout<<"maxval"<<std::endl;
-
-
-    //for (int i=0; i<size; i++){
-
-    //    std::cout<<corrSum->hostData[i]<<std::endl;
-    //    std::cout<<corrValidCount->hostData[i]<<std::endl;
-
-    //    std::cout<<maxval->hostData[i]<<std::endl;
-
-    //}
-
-    cudaKernel_estimateSnr<NTHREADS><<< IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>
+    cudaKernel_estimateSnr<<< IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>
        (corrSum->devData, corrValidCount->devData, maxval->devData, snrValue->devData, size);
-
    getLastCudaError("cuda kernel estimate stats error\n");
 }

-
-template <const int BLOCKSIZE> // number of threads per block.
-__global__ void cudaKernel_estimateVar(const float* corrBatchRaw, const int NX, const int NY, const int2* maxloc, const float* maxval, float3* covValue, const int size)
+// cuda kernel for cuEstimateVariance
+__global__ void cudaKernel_estimateVar(const float* corrBatchRaw, const int NX, const int NY,
+    const int2* maxloc, const float* maxval, float3* covValue, const int size)
 {

    // Find image id.
@ -135,13 +111,20 @@ __global__ void cudaKernel_estimateVar(const float* corrBatchRaw, const int NX,
    }
 }

+/**
+ * Estimate the variance of the correlation surface
+ * @param[in] corrBatchRaw correlation surface
+ * @param[in] maxloc maximum location
+ * @param[in] maxval maximum value
+ * @param[out] covValue variance value
+ * @param[in] stream cuda stream
+ */
 void cuEstimateVariance(cuArrays<float> *corrBatchRaw, cuArrays<int2> *maxloc, cuArrays<float> *maxval, cuArrays<float3> *covValue, cudaStream_t stream)
 {
-
    int size = corrBatchRaw->count;
-
    // One dimensional launching parameters to loop over every correlation surface.
-    cudaKernel_estimateVar<NTHREADS><<< IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>
+    cudaKernel_estimateVar<<< IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>
        (corrBatchRaw->devData, corrBatchRaw->height, corrBatchRaw->width, maxloc->devData, maxval->devData, covValue->devData, size);
    getLastCudaError("cudaKernel_estimateVar error\n");
 }
+//end of file
--- a/contrib/PyCuAmpcor/src/cuOffset.cu
+++ b/contrib/PyCuAmpcor/src/cuOffset.cu
@ -1,31 +1,13 @@
 /*
- * maxlocation.cu
- * Purpose: find the location of maximum for a batch of images/vectors
- *          this uses the reduction algorithm similar to summations
+ * @file cuOffset.cu
+ * @brief Utilities used to determine the offset field
 *
- * Author : Lijun Zhu
- *          Seismo Lab, Caltech
- * Version 1.0 10/01/16
 */

 #include "cuAmpcorUtil.h"
 #include <cfloat>

-/*
-__device__ float atomicMaxf(float* address, float val)
-{
-    int *address_as_int =(int*)address;
-    int old = *address_as_int, assumed;
-    while (val > __int_as_float(old)) {
-        assumed = old;
-        old = atomicCAS(address_as_int, assumed,
-                        __float_as_int(val));
-        }
-    return __int_as_float(old);
-}*/
-
-
-// comapre two elements
+// find the max between two elements
 inline static __device__ void maxPairReduce(volatile float* maxval, volatile int* maxloc,
      size_t gid, size_t strideid)
 {
@ -35,7 +17,7 @@ inline static __device__ void maxPairReduce(volatile float* maxval, volatile int
 	}
 }

-// max reduction kernel, save the results to shared memory
+// max reduction kernel
 template<const int BLOCKSIZE>
 __device__ void max_reduction(const float* const images,
    const size_t imageSize,
@ -48,8 +30,8 @@ __device__ void max_reduction(const float* const images,
 	int imageStart = blockIdx.x*imageSize;
 	int imagePixel;

-	// reduction for elements with i, i+BLOCKSIZE, i+2*BLOCKSIZE ...
-	//
+    // reduction for intra-block elements
+	// i.e., for elements with i, i+BLOCKSIZE, i+2*BLOCKSIZE ...
 	for(int gid = tid; gid < imageSize; gid+=blockDim.x)
 	{
 		imagePixel = imageStart+gid;
@ -78,65 +60,11 @@ __device__ void max_reduction(const float* const images,
    __syncthreads();
 }

-//kernel and function for 1D array, find both max value and location
+
+// kernel for 2D array(image), find max location only
 template <const int BLOCKSIZE>
-__global__ void  cuMaxValLoc_kernel( const float* const images, float *maxval, int* maxloc, const size_t imageSize, const size_t nImages)
-{
-    __shared__ float shval[BLOCKSIZE];
-    __shared__ int shloc[BLOCKSIZE];
-    int bid = blockIdx.x;
-    if(bid >= nImages) return;
-
-    max_reduction<BLOCKSIZE>(images, imageSize, nImages, shval, shloc);
-
-    if (threadIdx.x == 0) {
-        maxloc[bid] = shloc[0];
-        maxval[bid] = shval[0];
-    }
-}
-
-void cuArraysMaxValandLoc(cuArrays<float> *images, cuArrays<float> *maxval, cuArrays<int> *maxloc, cudaStream_t stream)
-{
-    const size_t imageSize = images->size;
-    const size_t nImages = images->count;
-    dim3 threadsperblock(NTHREADS);
-    dim3 blockspergrid(nImages);
-    cuMaxValLoc_kernel<NTHREADS><<<blockspergrid, threadsperblock, 0, stream>>>
-        (images->devData, maxval->devData, maxloc->devData, imageSize, nImages);
-    getLastCudaError("cudaKernel fine max location error\n");
-}
-
-//kernel and function for 1D array, find max location only
-template <const int BLOCKSIZE>
-__global__ void  cudaKernel_maxloc(const float* const images, int* maxloc,
-                                   const size_t imageSize, const size_t nImages)
-{
-    __shared__ float shval[BLOCKSIZE];
-    __shared__ int shloc[BLOCKSIZE];
-
-    int bid = blockIdx.x;
-    if(bid >=nImages) return;
-
-    max_reduction<BLOCKSIZE>(images, imageSize, nImages, shval, shloc);
-
-    if (threadIdx.x == 0) {
-        maxloc[bid] = shloc[0];
-    }
-}
-
-void cuArraysMaxLoc(cuArrays<float> *images, cuArrays<int> *maxloc, cudaStream_t stream)
-{
-    int imageSize = images->size;
-    int nImages = maxloc->size;
-
-    cudaKernel_maxloc<NTHREADS><<<nImages, NTHREADS,0, stream>>>
-        (images->devData, maxloc->devData, imageSize, nImages);
-    getLastCudaError("cudaKernel find max location 1D error\n");
-}
-
-//kernel and function for 2D array(image), find max location only
-template <const int BLOCKSIZE>
-__global__ void  cudaKernel_maxloc2D(const float* const images, int2* maxloc, float* maxval, const size_t imageNX, const size_t imageNY, const size_t nImages)
+__global__ void  cudaKernel_maxloc2D(const float* const images, int2* maxloc, float* maxval,
+    const size_t imageNX, const size_t imageNY, const size_t nImages)
 {
    __shared__ float shval[BLOCKSIZE];
    __shared__ int shloc[BLOCKSIZE];
@ -153,6 +81,14 @@ __global__ void  cudaKernel_maxloc2D(const float* const images, int2* maxloc, fl
    }
 }

+/**
+ * Find both the maximum value and the location for a batch of 2D images
+ * @param[in] images input batch of images
+ * @param[out] maxval arrays to hold the max values
+ * @param[out] maxloc arrays to hold the max locations
+ * @param[in] stream cudaStream
+ * @note This routine is overloaded with the routine without maxval
+ */
 void cuArraysMaxloc2D(cuArrays<float> *images, cuArrays<int2> *maxloc,
                      cuArrays<float> *maxval, cudaStream_t stream)
 {
@ -181,6 +117,13 @@ __global__ void  cudaKernel_maxloc2D(const float* const images, int2* maxloc, co
    }
 }

+/**
+ * Find (only) the maximum location for a batch of 2D images
+ * @param[in] images input batch of images
+ * @param[out] maxloc arrays to hold the max locations
+ * @param[in] stream cudaStream
+ * @note This routine is overloaded with the routine with maxval
+ */
 void cuArraysMaxloc2D(cuArrays<float> *images, cuArrays<int2> *maxloc, cudaStream_t stream)
 {
    cudaKernel_maxloc2D<NTHREADS><<<images->count, NTHREADS, 0, stream>>>
@ -188,10 +131,7 @@ void cuArraysMaxloc2D(cuArrays<float> *images, cuArrays<int2> *maxloc, cudaStrea
    getLastCudaError("cudaKernel find max location 2D error\n");
 }

-
-
-
-//determine final offset values
+// cuda kernel for cuSubPixelOffset
 __global__ void cuSubPixelOffset_kernel(const int2 *offsetInit, const int2 *offsetZoomIn,
                                        float2 *offsetFinal,
                                        const float OSratio,
@ -204,69 +144,47 @@ __global__ void cuSubPixelOffset_kernel(const int2 *offsetInit, const int2 *offs
 }


-/// determine the final offset value
-/// @param[in]
-
-void cuSubPixelOffset(cuArrays<int2> *offsetInit, cuArrays<int2> *offsetZoomIn, cuArrays<float2> *offsetFinal,
+/**
+ * Determine the final offset value
+ * @param[in] offsetInit max location (adjusted to the starting location for extraction) determined from
+ *   the cross-correlation before oversampling, in dimensions of pixel
+ * @param[in] offsetZoomIn max location from the oversampled cross-correlation surface
+ * @param[out] offsetFinal the combined offset value
+ * @param[in] OversampleRatioZoomIn the correlation surface oversampling factor
+ * @param[in] OversampleRatioRaw the oversampling factor of reference/secondary windows before cross-correlation
+ * @param[in] xHalfRangInit the original half search range along x, to be subtracted
+ * @param[in] yHalfRangInit the original half search range along y, to be subtracted
+ *
+ * 1. Cross-correlation is performed at first for the un-oversampled data with a larger search range.
+ *   The secondary window is then extracted to a smaller size (a smaller search range) around the max location.
+ *   The extraction starting location (offsetInit) - original half search range (xHalfRangeInit, yHalfRangeInit)
+ *        = pixel size offset
+ * 2. Reference/secondary windows are then oversampled by OversampleRatioRaw, and cross-correlated.
+ * 3. The correlation surface is further oversampled by OversampleRatioZoomIn.
+ *    The overall oversampling factor is OversampleRatioZoomIn*OversampleRatioRaw.
+ *    The max location in oversampled correlation surface (offsetZoomIn) / overall oversampling factor
+ *        = subpixel offset
+ *    Final offset =  pixel size offset +  subpixel offset
+ */
+void cuSubPixelOffset(cuArrays<int2> *offsetInit, cuArrays<int2> *offsetZoomIn,
+    cuArrays<float2> *offsetFinal,
    int OverSampleRatioZoomin, int OverSampleRatioRaw,
    int xHalfRangeInit,  int yHalfRangeInit,
-                      int xHalfRangeZoomIn, int yHalfRangeZoomIn,
    cudaStream_t stream)
 {
    int size = offsetInit->getSize();
    float OSratio = 1.0f/(float)(OverSampleRatioZoomin*OverSampleRatioRaw);
    float xoffset = xHalfRangeInit ;
    float yoffset = yHalfRangeInit ;
-    //std::cout << "subpixel" << xoffset << " " << yoffset << " ratio " << OSratio << std::endl;

    cuSubPixelOffset_kernel<<<IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>
        (offsetInit->devData, offsetZoomIn->devData,
         offsetFinal->devData, OSratio, xoffset, yoffset, size);
    getLastCudaError("cuSubPixelOffset_kernel");
-    //offsetInit->debuginfo(stream);
-    //offsetZoomIn->debuginfo(stream);

 }

-static inline __device__ int dev_padStart(const size_t padDim, const size_t imageDim, const size_t maxloc)
-{
-    int halfPadSize = padDim/2;
-    int start = maxloc - halfPadSize;
-    if(start <0) start =0;
-    else if(maxloc > imageDim-halfPadSize-1) start = imageDim-padDim-1;
-    return start;
-}
-
-//cuda kernel for cuda_determineInterpZone
-__global__ void cudaKernel_determineInterpZone(const int2* maxloc, const size_t nImages,
-                                               const size_t imageNX,  const size_t imageNY,
-                                               const size_t padNX, const size_t padNY,  int2* padOffset)
-{
-    int imageIndex = threadIdx.x + blockDim.x *blockIdx.x; //image index
-    if (imageIndex < nImages) {
-        padOffset[imageIndex].x = dev_padStart(padNX, imageNX, maxloc[imageIndex].x);
-        padOffset[imageIndex].y = dev_padStart(padNY, imageNY, maxloc[imageIndex].y);
-    }
-}
-
-/*
- * determine the interpolation area (pad) from the max location and the padSize
- *    the pad will be (maxloc-padSize/2, maxloc+padSize/2-1)
- * @param[in] maxloc[nImages]
- * @param[in] padSize
- * @param[in] imageSize
- * @param[in] nImages
- * @param[out] padStart[nImages] return values of maxloc-padSize/2
- */
-void cuDetermineInterpZone(cuArrays<int2> *maxloc, cuArrays<int2> *zoomInOffset, cuArrays<float> *corrOrig, cuArrays<float> *corrZoomIn, cudaStream_t stream)
-{
-	int threadsperblock=NTHREADS;
-	int blockspergrid=IDIVUP(corrOrig->count, threadsperblock);
-	cudaKernel_determineInterpZone<<<blockspergrid, threadsperblock, 0, stream>>>
-	    (maxloc->devData, maxloc->size, corrOrig->height, corrOrig->width, corrZoomIn->height, corrZoomIn->width, zoomInOffset->devData);
-}
-
-
+// cuda device function to compute the shift of center
 static inline __device__ int2 dev_adjustOffset(
    const int oldRange, const int newRange, const int maxloc)
 {
@ -303,6 +221,7 @@ static inline __device__ int2 dev_adjustOffset(
    return make_int2(start, shift);
 }

+// cuda kernel for cuDetermineSecondaryExtractOffset
 __global__ void cudaKernel_determineSecondaryExtractOffset(int2 * maxLoc, int2 *shift,
    const size_t nImages, int xOldRange, int yOldRange, int xNewRange, int yNewRange)
 {
@ -319,8 +238,15 @@ __global__ void cudaKernel_determineSecondaryExtractOffset(int2 * maxLoc, int2 *
 	}
 }

-///@param[in] xOldRange, yOldRange are (half) search ranges in first step
-///@param[in] x
+/**
+ * Determine the secondary window extract offset from the max location
+ * @param[in] xOldRange, yOldRange are (half) search ranges in first step
+ * @param[in] xNewRange, yNewRange are (half) search range
+ *
+ * After the first run of cross-correlation, with a larger search range,
+ *  We now choose a smaller search range around the max location for oversampling.
+ *  This procedure is used to determine the starting pixel locations for extraction.
+ */
 void cuDetermineSecondaryExtractOffset(cuArrays<int2> *maxLoc, cuArrays<int2> *maxLocShift,
    int xOldRange, int yOldRange, int xNewRange, int yNewRange, cudaStream_t stream)
 {
@ -330,29 +256,6 @@ void cuDetermineSecondaryExtractOffset(cuArrays<int2> *maxLoc, cuArrays<int2> *m
 	    (maxLoc->devData, maxLocShift->devData, maxLoc->size, xOldRange, yOldRange, xNewRange, yNewRange);
 }

-
-
-
-__global__ void cudaKernel_maxlocPlusZoominOffset(float *offset, const int * padStart, const int * maxlocUpSample,
-        const size_t nImages, float zoomInRatioX, float zoomInRatioY)
-{
-	int imageIndex = threadIdx.x + blockDim.x *blockIdx.x; //image index
-	if (imageIndex < nImages)
-	{
-		int index=2*imageIndex;
-		offset[index] = padStart[index] + maxlocUpSample[index] * zoomInRatioX;
-		index++;
-		offset[index] = padStart[index] + maxlocUpSample[index] * zoomInRatioY;
-	}
-}
-
-void cuda_maxlocPlusZoominOffset(float *offset, const int * padStart, const int * maxlocUpSample,
-        const size_t nImages, float zoomInRatioX, float zoomInRatioY)
-{
-	int threadsperblock=NTHREADS;
-	int blockspergrid = IDIVUP(nImages, threadsperblock);
-	cudaKernel_maxlocPlusZoominOffset<<<blockspergrid,threadsperblock>>>(offset, padStart, maxlocUpSample,
-        nImages, zoomInRatioX, zoomInRatioY);
-}
+// end of file


--- a/contrib/PyCuAmpcor/src/cuOverSampler.cu
+++ b/contrib/PyCuAmpcor/src/cuOverSampler.cu
@ -1,15 +1,25 @@
 /* 
- * cuOverSampler.cu
- * define cuOverSampler class, to save cufft plans and perform oversampling calculations
+ * @file cuOverSampler.cu
+ * @brief Implementations of cuOverSamplerR2R (C2C) class
 */
-#include "cuArrays.h"
+
+// my declarations
 #include "cuOverSampler.h"
+
+// dependencies
+#include "cuArrays.h"
 #include "cuArrays.h"
 #include "cudaUtil.h"
 #include "cudaError.h"
 #include "cuAmpcorUtil.h"

-// Oversampler for complex data 
+/**
+ * Constructor for cuOversamplerC2C
+ * @param input image size inNX x inNY
+ * @param output image size outNX x outNY
+ * @param nImages batches
+ * @param stream_ cuda stream
+ */
 cuOverSamplerC2C::cuOverSamplerC2C(int inNX, int inNY, int outNX, int outNY, int nImages, cudaStream_t stream_)
 {
    
@ -25,10 +35,13 @@ cuOverSamplerC2C::cuOverSamplerC2C(int inNX, int inNY, int outNX, int outNY, int
    int outNYp2 = inNYp2*outNY/inNY; 
    */

+    // set up work arrays
    workIn = new cuArrays<float2>(inNXp2, inNYp2, nImages);
    workIn->allocate();
    workOut = new cuArrays<float2>(outNXp2, outNYp2, nImages);
    workOut->allocate();
+
+    // set up fft plans
    int imageSize = inNXp2*inNYp2;
    int n[NRANK] ={inNXp2, inNYp2};
    int fImageSize = inNXp2*inNYp2;
@ -36,9 +49,13 @@ cuOverSamplerC2C::cuOverSamplerC2C(int inNX, int inNY, int outNX, int outNY, int
    int fImageOverSampleSize = outNXp2*outNYp2;
    cufft_Error(cufftPlanMany(&forwardPlan, NRANK, n, NULL, 1, imageSize, NULL, 1, fImageSize, CUFFT_C2C, nImages));
    cufft_Error(cufftPlanMany(&backwardPlan, NRANK, nOverSample, NULL, 1, fImageOverSampleSize, NULL, 1, fImageOverSampleSize, CUFFT_C2C, nImages));
+    // set cuda stream
    setStream(stream_);
 }

+/**
+ * Set up cuda stream
+ */
 void cuOverSamplerC2C::setStream(cudaStream_t stream_)
 {
    this->stream = stream_;
@ -46,16 +63,12 @@ void cuOverSamplerC2C::setStream(cudaStream_t stream_)
    cufftSetStream(backwardPlan, stream);
 }

-//tested
-void cuOverSamplerC2C::execute(cuArrays<float2> *imagesIn, cuArrays<float2> *imagesOut)
-{
-    //cuArraysCopyPadded(imagesIn, workIn, stream);  
-    cufft_Error(cufftExecC2C(forwardPlan, imagesIn->devData, workIn->devData, CUFFT_INVERSE));
-    cuArraysPaddingMany(workIn, workOut, stream);
-    cufft_Error(cufftExecC2C(backwardPlan, workOut->devData, imagesOut->devData, CUFFT_FORWARD));
-    //cuArraysCopyExtract(workOut, imagesOut, make_int2(0,0), stream);
-}
-
+/**
+ * Execute fft oversampling
+ * @param[in] imagesIn input batch of images
+ * @param[out] imagesOut output batch of images
+ * @param[in] method phase deramping method
+ */
 void cuOverSamplerC2C::execute(cuArrays<float2> *imagesIn, cuArrays<float2> *imagesOut, int method)
 {   
    cuDeramp(method, imagesIn, stream);         
@ -64,31 +77,41 @@ void cuOverSamplerC2C::execute(cuArrays<float2> *imagesIn, cuArrays<float2> *ima
    cufft_Error(cufftExecC2C(backwardPlan, workOut->devData, imagesOut->devData, CUFFT_FORWARD));
 }

+/// destructor
 cuOverSamplerC2C::~cuOverSamplerC2C() 
 {
+    // destroy fft handles
    cufft_Error(cufftDestroy(forwardPlan));
    cufft_Error(cufftDestroy(backwardPlan));
+    // deallocate work arrays
    delete(workIn);
    delete(workOut);	
 }

+// end of cuOverSamplerC2C

-// oversampler for real data
+/**
+ * Constructor for cuOversamplerR2R
+ * @param input image size inNX x inNY
+ * @param output image size outNX x outNY
+ * @param nImages the number of images
+ * @param stream_ cuda stream
+ */
 cuOverSamplerR2R::cuOverSamplerR2R(int inNX, int inNY, int outNX, int outNY, int nImages, cudaStream_t stream)
 {
    
-/*    
-    int inNXp2 = nextpower2(inNX);
-    int inNYp2 = nextpower2(inNY);
-    int outNXp2 = inNXp2*outNX/inNX;
-    int outNYp2 = inNYp2*outNY/inNY;    
-*/
-    
    int inNXp2 = inNX;
    int inNYp2 = inNY;
    int outNXp2 = outNX;
    int outNYp2 = outNY;

+    /* if expanded to 2^n
+    int inNXp2 = nextpower2(inNX);
+    int inNYp2 = nextpower2(inNY);
+    int outNXp2 = inNXp2*outNX/inNX;
+    int outNYp2 = inNYp2*outNY/inNY;
+    */
+
    int imageSize = inNXp2 *inNYp2;
    int n[NRANK] ={inNXp2, inNYp2};
    int fImageSize = inNXp2*inNYp2;
@ -110,7 +133,11 @@ void cuOverSamplerR2R::setStream(cudaStream_t stream_)
    cufftSetStream(backwardPlan, stream);
 }

-//tested
+/**
+ * Execute fft oversampling
+ * @param[in] imagesIn input batch of images
+ * @param[out] imagesOut output batch of images
+ */
 void cuOverSamplerR2R::execute(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut)
 {
    cuArraysCopyPadded(imagesIn, workSizeIn, stream);
@ -120,6 +147,7 @@ void cuOverSamplerR2R::execute(cuArrays<float> *imagesIn, cuArrays<float> *image
    cuArraysCopyExtract(workSizeOut, imagesOut, make_int2(0,0), stream);	
 }

+/// destructor
 cuOverSamplerR2R::~cuOverSamplerR2R() 
 {
    cufft_Error(cufftDestroy(forwardPlan));
@ -128,6 +156,7 @@ cuOverSamplerR2R::~cuOverSamplerR2R()
    workSizeOut->deallocate();
 }

+// end of file



--- a/contrib/PyCuAmpcor/src/cuOverSampler.h
+++ b/contrib/PyCuAmpcor/src/cuOverSampler.h
@ -1,10 +1,11 @@
 /* 
- * cuOverSampler.h 
- * oversampling with FFT padding method
- * define cuOverSampler class, to save cufft plans and perform oversampling calculations
- * one float image use cuOverSamplerR2R
- * one complex image use cuOverSamplerC2C
- * many complex images use cuOverSamplerManyC2C
+ * @file cuOverSampler.h
+ * @brief Oversampling with FFT padding method
+ *
+ * Define cuOverSampler class, to save cufft plans and perform oversampling calculations
+ * For float images use cuOverSamplerR2R
+ * For complex images use cuOverSamplerC2C
+ * @todo use template class to unify these two classes
 */

 #ifndef __CUOVERSAMPLER_H
@ -13,34 +14,40 @@
 #include "cuArrays.h"
 #include "cudaUtil.h"

-
+// FFT Oversampler for complex images
 class cuOverSamplerC2C
 {
 private:
-	 cufftHandle forwardPlan;
-	 cufftHandle backwardPlan;
-	 cudaStream_t stream;
-     cuArrays<float2> *workIn;
-     cuArrays<float2> *workOut;
+	 cufftHandle forwardPlan;   // forward fft handle
+	 cufftHandle backwardPlan;  // backward fft handle
+	 cudaStream_t stream;       // cuda stream
+     cuArrays<float2> *workIn;  // work array to hold forward fft data
+     cuArrays<float2> *workOut; // work array to hold padded data
 public:
+     // disable the default constructor
+     cuOverSamplerC2C() = delete;
+     // constructor
 	 cuOverSamplerC2C(int inNX, int inNY, int outNX, int outNY, int nImages, cudaStream_t stream_);
+	 // set cuda stream
 	 void setStream(cudaStream_t stream_);
-	 void execute(cuArrays<float2> *imagesIn, cuArrays<float2> *imagesOut);
-     void execute(cuArrays<float2> *imagesIn, cuArrays<float2> *imagesOut, int deramp_method);
+     // execute oversampling
+     void execute(cuArrays<float2> *imagesIn, cuArrays<float2> *imagesOut, int deramp_method=0);
+     // destructor
 	 ~cuOverSamplerC2C(); 
 };

-
+// FFT Oversampler for complex images
 class cuOverSamplerR2R
 {
 private:
 	 cufftHandle forwardPlan;
 	 cufftHandle backwardPlan;
+	 cudaStream_t stream;
 	 cuArrays<float2> *workSizeIn;
 	 cuArrays<float2> *workSizeOut;	 
-	 cudaStream_t stream;

 public:
+    cuOverSamplerR2R() = delete;
 	cuOverSamplerR2R(int inNX, int inNY, int outNX, int outNY, int nImages, cudaStream_t stream_);
 	void setStream(cudaStream_t stream_);
    void execute(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut);	
@ -48,7 +55,8 @@ public:
 };


-#endif
+#endif //__CUOVERSAMPLER_H
+// end of file



--- a/contrib/PyCuAmpcor/src/cuSincOverSampler.cu
+++ b/contrib/PyCuAmpcor/src/cuSincOverSampler.cu
@ -1,34 +1,40 @@
-/*
- * cuSincOverSampler.cu
+/**
+ * @file cuSincOverSampler.cu
+ * @brief Implementation for cuSinOversampler class
+ *
 */
-#include "cuArrays.h"
+
+// my declaration
 #include "cuSincOverSampler.h"
+
+// dependencies
 #include "cuArrays.h"
 #include "cudaUtil.h"
 #include "cudaError.h"
 #include "cuAmpcorUtil.h"

+/**
+ * cuSincOverSamplerR2R constructor
+ * @param i_covs oversampling factor
+ * @param stream cuda stream
+ */
 cuSincOverSamplerR2R::cuSincOverSamplerR2R(const int i_covs_, cudaStream_t stream_)
 : i_covs(i_covs_)
 {
-    setStream(stream_);
+    stream = stream_;
    i_intplength = int(r_relfiltlen/r_beta+0.5f);
    i_filtercoef = i_intplength*i_decfactor;
    checkCudaErrors(cudaMalloc((void **)&r_filter, (i_filtercoef+1)*sizeof(float)));
    cuSetupSincKernel();
 }

-void cuSincOverSamplerR2R::setStream(cudaStream_t stream_)
-{
-    stream = stream_;
-}
-
+/// destructor
 cuSincOverSamplerR2R::~cuSincOverSamplerR2R()
 {
    checkCudaErrors(cudaFree(r_filter));
 }

-
+// cuda kernel for cuSetupSincKernel
 __global__ void cuSetupSincKernel_kernel(float *r_filter_, const int i_filtercoef_,
    const float r_soff_, const float r_wgthgt_, const int i_weight_,
    const float r_soff_inverse_, const float r_beta_, const float r_decfactor_inverse_)
@ -53,7 +59,9 @@ __global__ void cuSetupSincKernel_kernel(float *r_filter_, const int i_filtercoe
    }
 }

-
+/**
+ * Set up the sinc interpolation kernel (coefficient)
+ */
 void cuSincOverSamplerR2R::cuSetupSincKernel()
 {
    const int nthreads = 128;
@ -71,6 +79,8 @@ void cuSincOverSamplerR2R::cuSetupSincKernel()
    getLastCudaError("cuSetupSincKernel_kernel");
 }

+
+// cuda kernel for cuSincOverSamplerR2R::execute
 __global__ void cuSincInterpolation_kernel(const int nImages,
    const float * imagesIn, const int inNX, const int inNY,
    float * imagesOut, const int outNX, const int outNY,
@ -83,7 +93,7 @@ __global__ void cuSincInterpolation_kernel(const int nImages,
    // get the xy threads for output image pixel indices
    int idxX = threadIdx.x + blockDim.x*blockIdx.x;
    int idxY = threadIdx.y + blockDim.y*blockIdx.y;
-    // cuda: to make sure extra allocated threads do nothing
+    // cuda: to make sure extra allocated threads doing nothing
    if(idxImage >=nImages || idxX >= i_int_size || idxY >= i_int_size) return;
    // decide the center shift
    int2 shift = centerShift[idxImage];
@ -144,12 +154,19 @@ __global__ void cuSincInterpolation_kernel(const int nImages,
        }
    }
    imagesOut[idxOut] = intpData/r_sincwgt;
-    //printf("test int kernel %d %d %f %f %f\n", outx, outy, intpData, r_sincwgt, imagesOut[idxOut]);
 }

+/**
+ * Execute sinc interpolation
+ * @param[in] imagesIn input images
+ * @param[out] imagesOut output images
+ * @param[in] centerShift the shift of interpolation center
+ * @param[in] rawOversamplingFactor the multiplier of the centerShift
+ * @note rawOversamplingFactor is for the centerShift, not the signal oversampling factor
+ */

 void cuSincOverSamplerR2R::execute(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut,
-    cuArrays<int2> *centerShift, int oversamplingFactor)
+    cuArrays<int2> *centerShift, int rawOversamplingFactor)
 {
    const int nImages = imagesIn->count;
    const int inNX = imagesIn->height;
@ -172,7 +189,7 @@ void cuSincOverSamplerR2R::execute(cuArrays<float> *imagesIn, cuArrays<float> *i
    cuSincInterpolation_kernel<<<blockspergrid, threadsperblock, 0, stream>>>(nImages,
        imagesIn->devData, inNX, inNY,
        imagesOut->devData, outNX, outNY,
-        centerShift->devData, oversamplingFactor,
+        centerShift->devData, rawOversamplingFactor,
        r_filter, i_covs, i_decfactor, i_intplength, i_int_startX, i_int_startY, i_int_size);
    getLastCudaError("cuSincInterpolation_kernel");
 }
--- a/contrib/PyCuAmpcor/src/cuSincOverSampler.h
+++ b/contrib/PyCuAmpcor/src/cuSincOverSampler.h
@ -1,32 +1,48 @@
 /*
- * cuSincOverSampler.h
- * oversampling with sinc interpolation method
+ * @file cuSincOverSampler.h
+ * @brief A class performs sinc interpolation/oversampling
+ *
+ * Oversample a given 2d signal by i_covs factor.
+ * Only signals within(-i_sincwindow, i_sincwindow) are oversampled
+ * The interpolation zone may also be shifted, if the max location is not at the center.
+ *
+ * The sinc interpolation is based on the formula
+ *   $$x(t) = \sum_{n=-\infty}^{\infty} x_n f( \Omega_c t-n )$$
+ *   with $f(x) = \text{sinc}(x)$, or a complex filter
+ *   such as the sinc(x) convoluted with Hamming Window used here.
+ *   In practice, a finite length of n (i_intplength) is used for interpolation.
+ *
+ * @note most parameters are currently hardwired; you need to change
+ *    the source code below if you need to adjust the parameters.
 */

+// code guard
 #ifndef __CUSINCOVERSAMPLER_H
 #define __CUSINCOVERSAMPLER_H

+// dependencites
 #include "cuArrays.h"
 #include "cudaUtil.h"

-#define PI 3.141592654f
+#ifndef PI
+#define PI 3.14159265359f
+#endif

 class cuSincOverSamplerR2R
 {
 private:
    static const int i_sincwindow = 2;
-    // the oversampling is only performed within \pm i_sincwindow*i_covs around the peak
-    static const int i_weight = 1;       // weight for cos() pedestal
+    ///< the oversampling is only performed within \pm i_sincwindow*i_covs around the peak
+    static const int i_weight = 1;       ///< weight for cos() pedestal
+    const float r_pedestal = 0.0f;       ///< height of pedestal
+    const float r_beta = 0.75f;          ///< a low-band pass
+    const float r_relfiltlen = 6.0f;     ///< relative filter length

-    const float r_pedestal = 0.0f;       // height of pedestal
-    const float r_beta = 0.75f;          // a low-band pass
-    const float r_relfiltlen = 6.0f;     // relative filter length
+    static const int i_decfactor = 4096; ///< max decimals between original grid to set up the sinc kernel

-    static const int i_decfactor = 4096; // decimals between original grid to set up the sinc kernel
-
-    int i_covs;         // oversampling factor
-    int i_intplength;   // actual filter length
-    int i_filtercoef;   // length of the sinc kernel i_intplength*i_decfactor+1
+    int i_covs;         ///< oversampling factor
+    int i_intplength;   ///< actual filter length = r_relfiltlen/r_beta
+    int i_filtercoef;   //< length of the sinc kernel i_intplength*i_decfactor+1

    float * r_filter;   // sinc kernel with size i_filtercoef

@ -35,17 +51,15 @@ class cuSincOverSamplerR2R
 public:
    // constructor
    cuSincOverSamplerR2R(const int i_covs_, cudaStream_t stream_);
-    // local methods
-    void setStream(cudaStream_t stream_);
+    // set up sinc interpolation coefficients
    void cuSetupSincKernel();
-    // call interface
+    // execute interface
    void execute(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut, cuArrays<int2> *center, int oversamplingFactor);
-
+    // destructor
    ~cuSincOverSamplerR2R();
 };

 #endif // _CUSINCOVERSAMPLER_H
-
 // end of file


--- a/contrib/PyCuAmpcor/src/cudaError.h
+++ b/contrib/PyCuAmpcor/src/cudaError.h
@ -1,13 +1,10 @@
 /**
- * cudaError.h
- * Purpose: check various errors in cuda/cufft calls
- * Lijun Zhu
- * Last modified 09/07/2017
+ * @file  cudaError.h
+ * @brief Define error checking in cuda calls
+ *
 **/

-////////////////////////////////////////////////////////////////////////////////
-// These are CUDA Helper functions for initialization and error checking
-
+// code guard
 #ifndef _CUDAERROR_CUH
 #define _CUDAERROR_CUH

@ -34,22 +31,20 @@
 template<typename T >
 void check(T result, char const *const func, const char *const file, int const line)
 {
-#ifdef CUDA_ERROR_CHECK
    if (result)
    {
+
        fprintf(stderr, "CUDA error at %s:%d code=%d(%s) \n",
                file, line, static_cast<unsigned int>(result), func);
        DEVICE_RESET
        // Make sure we call CUDA Device Reset before exiting
        exit(EXIT_FAILURE);
    }
-#endif
 }

 // This will output the proper error string when calling cudaGetLastError
 inline void __getLastCudaError(const char *errorMessage, const char *file, const int line)
 {
-#ifdef CUDA_ERROR_CHECK
    cudaError_t err = cudaGetLastError();

    if (cudaSuccess != err)
@ -59,12 +54,17 @@ inline void __getLastCudaError(const char *errorMessage, const char *file, const
        DEVICE_RESET
        exit(EXIT_FAILURE);
    }
-#endif
 }

 // This will output the proper CUDA error strings in the event that a CUDA host call returns an error
+#ifdef CUDA_ERROR_CHECK
 #define checkCudaErrors(val)  check ( (val), #val, __FILE__, __LINE__ )
 #define cufft_Error(val)     check ( (val), #val, __FILE__, __LINE__ )
 #define getLastCudaError(var)   __getLastCudaError (var, __FILE__, __LINE__)
+#else
+#define checkCudaErrors(val) val
+#define cufft_Error(val)  val
+#define getLastCudaError(val)
+#endif //CUDA_ERROR_CHECK

 #endif //__CUDAERROR_CUH
--- a/contrib/PyCuAmpcor/src/cudaUtil.h
+++ b/contrib/PyCuAmpcor/src/cudaUtil.h
@ -1,11 +1,10 @@
 /** 
- * cudaUtil.h
- * Purpose: various cuda related parameters and utilities
+ * @file cudaUtil.h
+ * @brief Various cuda related parameters and utilities
 * 
 * Some routines are adapted from Nvidia CUDA samples/common/inc/help_cuda.h
 * Copyright 1993-2013 NVIDIA Corporation.  All rights reserved.
 *
- * 
 **/

 #ifndef __CUDAUTIL_H
@ -54,7 +53,7 @@ inline int ftoi(float value)
    return (value >= 0 ? (int)(value + 0.5) : (int)(value - 0.5));
 }

-
+// compute the next integer in power of 2
 inline int nextpower2(int value)
 {
    int r=1;
@ -62,153 +61,6 @@ inline int nextpower2(int value)
    return r;
 }

-// Beginning of GPU Architecture definitions
-inline int _ConvertSMVer2Cores(int major, int minor)
-{
-    // Defines for GPU Architecture types (using the SM version to determine the # of cores per SM
-    typedef struct
-    {
-        int SM; // 0xMm (hexidecimal notation), M = SM Major version, and m = SM minor version
-        int Cores;
-    } sSMtoCores;
-
-    sSMtoCores nGpuArchCoresPerSM[] =
-    {
-        { 0x20, 32 }, // Fermi Generation (SM 2.0) GF100 class
-        { 0x21, 48 }, // Fermi Generation (SM 2.1) GF10x class
-        { 0x30, 192}, // Kepler Generation (SM 3.0) GK10x class
-        { 0x32, 192}, // Kepler Generation (SM 3.2) GK10x class
-        { 0x35, 192}, // Kepler Generation (SM 3.5) GK11x class
-        { 0x37, 192}, // Kepler Generation (SM 3.7) GK21x class
-        { 0x50, 128}, // Maxwell Generation (SM 5.0) GM10x class
-        { 0x52, 128}, // Maxwell Generation (SM 5.2) GM20x class
-        { 0x53, 128}, // Maxwell Generation (SM 5.3) GM20x class
-        { 0x60, 64 }, // Pascal Generation (SM 6.0) GP100 class
-        { 0x61, 128}, // Pascal Generation (SM 6.1) GP10x class
-        { 0x62, 128}, // Pascal Generation (SM 6.2) GP10x class
-        {   -1, -1 }
-    };
-
-    int index = 0;
-
-    while (nGpuArchCoresPerSM[index].SM != -1)
-    {
-        if (nGpuArchCoresPerSM[index].SM == ((major << 4) + minor))
-        {
-            return nGpuArchCoresPerSM[index].Cores;
-        }
-
-        index++;
-    }
-
-    // If we don't find the values, we default use the previous one to run properly
-    printf("MapSMtoCores for SM %d.%d is undefined.  Default to use %d Cores/SM\n", major, minor, nGpuArchCoresPerSM[index-1].Cores);
-    return nGpuArchCoresPerSM[index-1].Cores;
-}
-// end of GPU Architecture definitions
-
-#ifdef __CUDA_RUNTIME_H__
-  
-// This function returns the best GPU (with maximum GFLOPS)
-inline int gpuGetMaxGflopsDeviceId()
-{
-    int current_device     = 0, sm_per_multiproc  = 0;
-    int max_perf_device    = 0;
-    int device_count       = 0, best_SM_arch      = 0;
-    int devices_prohibited = 0;
-    
-    unsigned long long max_compute_perf = 0;
-    cudaDeviceProp deviceProp;
-    cudaGetDeviceCount(&device_count);
-    
-    checkCudaErrors(cudaGetDeviceCount(&device_count));
-
-    if (device_count == 0)
-    {
-        fprintf(stderr, "gpuGetMaxGflopsDeviceId() CUDA error: no devices supporting CUDA.\n");
-        exit(EXIT_FAILURE);
-    }
-
-    // Find the best major SM Architecture GPU device
-    while (current_device < device_count)
-    {
-        cudaGetDeviceProperties(&deviceProp, current_device);
-
-        // If this GPU is not running on Compute Mode prohibited, then we can add it to the list
-        if (deviceProp.computeMode != cudaComputeModeProhibited)
-        {
-            if (deviceProp.major > 0 && deviceProp.major < 9999)
-            {
-                best_SM_arch = MAX(best_SM_arch, deviceProp.major);
-            }
-        }
-        else
-        {
-            devices_prohibited++;
-        }
-
-        current_device++;
-    }
-
-    if (devices_prohibited == device_count)
-    {
-    	fprintf(stderr, "gpuGetMaxGflopsDeviceId() CUDA error: all devices have compute mode prohibited.\n");
-    	exit(EXIT_FAILURE);
-    }
-
-    // Find the best CUDA capable GPU device
-    current_device = 0;
-
-    while (current_device < device_count)
-    {
-        cudaGetDeviceProperties(&deviceProp, current_device);
-
-        // If this GPU is not running on Compute Mode prohibited, then we can add it to the list
-        if (deviceProp.computeMode != cudaComputeModeProhibited)
-        {
-            if (deviceProp.major == 9999 && deviceProp.minor == 9999)
-            {
-                sm_per_multiproc = 1;
-            }
-            else
-            {
-                sm_per_multiproc = _ConvertSMVer2Cores(deviceProp.major, deviceProp.minor);
-            }
-
-            unsigned long long compute_perf  = (unsigned long long) deviceProp.multiProcessorCount * sm_per_multiproc * deviceProp.clockRate;
-
-            //fprintf(stderr, "Device %d has performamce %llu.\n", current_device, compute_perf);
-
-            if (compute_perf  > max_compute_perf)
-            {
-                /* Let the GPU with max flops win! --LJ  
-                // If we find GPU with SM major > 2, search only these
-                if (best_SM_arch > 2)
-                {
-                    // If our device==best_SM_arch, choose this, or else pass
-                    if (deviceProp.major == best_SM_arch)
-                    {
-                        max_compute_perf  = compute_perf;
-                        max_perf_device   = current_device;
-                    }
-                }
-                else
-                {
-                    max_compute_perf  = compute_perf;
-                    max_perf_device   = current_device;
-                } 
-                */
-                max_compute_perf  = compute_perf;
-                max_perf_device   = current_device;
-                
-            }
-        }
-
-        ++current_device;
-    }
-
-    return max_perf_device;
-}

 // General GPU Device CUDA Initialization
 inline int gpuDeviceInit(int devID)
@ -225,8 +77,7 @@ inline int gpuDeviceInit(int devID)
    if (devID < 0 || devID > device_count-1)
    {
 		fprintf(stderr, "gpuDeviceInit() Device %d is not a valid GPU device. \n", devID);
-        fprintf(stderr, "gpuDeviceInit() Finding the GPU with max GFlops instead ...\n");
-        devID = gpuGetMaxGflopsDeviceId();
+        exit(EXIT_FAILURE);
    }

    checkCudaErrors(cudaSetDevice(devID));
@ -235,7 +86,6 @@ inline int gpuDeviceInit(int devID)
    return devID;
 }

-
 // This function lists all available GPUs 
 inline void gpuDeviceList()
 {
@ -267,9 +117,7 @@ inline void gpuDeviceList()
        }
        current_device++;
    }
-    fprintf(stderr, "Device %d has the max Gflops\n", gpuGetMaxGflopsDeviceId());
 }
-#endif
-

 #endif //__CUDAUTIL_H
+//end of file
--- a/contrib/PyCuAmpcor/src/debug.h
+++ b/contrib/PyCuAmpcor/src/debug.h
@ -1,17 +1,22 @@
-#ifndef __DEBUG_H
-#define __DEBUG_H
+/**
+ * @file debug.h
+ * @brief Define flags to control the debugging
+ *
+ * CUAMPCOR_DEBUG is used to output debugging information and intermediate results,
+ *    disabled when NDEBUG macro is defined.
+ * CUDA_ERROR_CHECK is always enabled, to check CUDA routine errors
+ *
+ */

-#include <iostream>
-#include <assert.h>
-#include <stdio.h>
+// code guard
+#ifndef __CUAMPCOR_DEBUG_H
+#define __CUAMPCOR_DEBUG_H

 #ifndef NDEBUG
 #define CUAMPCOR_DEBUG
-#define debugmsg(msg) fprintf(stderr, msg)
-#else
-#define debugmsg(msg)
 #endif //NDEBUG

 #define CUDA_ERROR_CHECK

-#endif //__DEBUG_H
+#endif //__CUAMPCOR_DEBUG_H
+//end of file
--- a/contrib/PyCuAmpcor/src/float2.h
+++ b/contrib/PyCuAmpcor/src/float2.h
@ -1,6 +1,6 @@
 /* 
- * float2.h 
- * define operators and functions on float2 (cuComplex) datatype
+ * @file  float2.h
+ * @brief Define operators and functions on float2 (cuComplex) datatype
 *
 */
 
@ -11,14 +11,13 @@

 inline __host__ __device__ void zero(float2 &a) { a.x = 0.0f; a.y = 0.0f; }

-
-//negate
+// negative
 inline __host__ __device__ float2 operator-(float2 &a)
 {
    return make_float2(-a.x, -a.y);
 }

-//conjugate
+// complex conjugate
 inline __host__ __device__ float2 conjugate(float2 a)
 {
 	return make_float2(a.x, -a.y);
@ -98,22 +97,8 @@ inline __host__ __device__ float2 complexMul(float2 a, float2 b)
 inline __host__ __device__ float2 complexMulConj(float2 a, float2 b)
 {
 	return make_float2(a.x*b.x + a.y*b.y, a.y*b.x - a.x*b.y);
-	//return a*conjugate(b)
 }

-// division
-/*
- * inline __host__ __device__ float2 operator/(float2 a, float2 b)
-{
-    return make_float2(a.x / b.x, a.y / b.y);
-}
-inline __host__ __device__ void operator/=(float2 &a, float2 b)
-{
-    a.x /= b.x;
-    a.y /= b.y;
-}
-* 
-* */
 inline __host__ __device__ float2 operator/(float2 a, float b)
 {
    return make_float2(a.x / b, a.y / b);
@ -134,14 +119,11 @@ inline __host__ __device__ float complexArg(float2 a)
 	return atan2f(a.y, a.x);
 }

+// make a complex number from phase
 inline __host__ __device__ float2 complexExp(float arg)
 {
 	return make_float2(cosf(arg), sinf(arg));
 }

-
-
-
-
-#endif
-
+#endif //__FLOAT2_H
+// end of file
--- a/contrib/PyCuAmpcor/src/setup.py
+++ b/contrib/PyCuAmpcor/src/setup.py
@ -1,5 +1,5 @@
 #
-# Implementation: python setup_cuAmpcor.py build_ext --inplace
+# Implementation: python setup.py build_ext --inplace
 # Generates PyCuAmpcor.xxx.so (where xxx is just some local sys-arch information).
 # Note you need to run your makefile *FIRST* to generate the cuAmpcor.o object.
 #
@ -11,6 +11,7 @@ from Cython.Build import cythonize
 import numpy

 setup(  name = 'PyCuAmpcor',
+        version = '2.0.0',
        ext_modules = cythonize(Extension(
        "PyCuAmpcor",
        sources=['PyCuAmpcor.pyx'],