PyCuAmpcor updates:

* added a README.md for installation/user guide/procedures * modified the cuDenseOffsets.py * expose more options from the CUDA/C++ program * add an option for varying gross offset input * clarify the parameter definitions * removed old SlcImage implementation and cublas dependence * modified cuSincOversampler * to be consistent with cpu version * fix an issue when the extraction of the search window is not around the center * added a debug mode to output intermediate results * enable cuda error checking for both Debug/Release build types * corrected the code to extract raw images when the correlation surface peak is close to edges * Move utf-8 decoding step inside cython extension The cython setters take python strings, but the getters return python bytes, so this makes the types match up. I went with regular python strings for the interface since they are more common at the python level, so the encoding/decoding is now an implementation detail of the cython extension. Contributed by lijun99, rtburns-jpl, vbrancat, mzzhong
2020-11-12 15:02:44 -08:00 · 2020-11-12 15:02:44 -08:00 · a393282b69
parent ab5a867d4b
commit a393282b69
25 changed files with 1034 additions and 635 deletions
--- a/contrib/PyCuAmpcor/CMakeLists.txt
+++ b/contrib/PyCuAmpcor/CMakeLists.txt
@ -1,7 +1,6 @@
 # Early exit if prereqs not available
 if(NOT TARGET GDAL::GDAL
 OR NOT TARGET Python::NumPy
-OR NOT TARGET CUDA::cublas
 OR NOT TARGET CUDA::cufft
   )
    return()
@ -14,7 +13,6 @@ cython_add_module(PyCuAmpcor
    src/PyCuAmpcor.pyx
    src/GDALImage.cu
    src/SConscript
-    src/SlcImage.cu
    src/cuAmpcorChunk.cu
    src/cuAmpcorController.cu
    src/cuAmpcorParameter.cu
@ -35,7 +33,6 @@ target_include_directories(PyCuAmpcor PRIVATE
    )
 target_link_libraries(PyCuAmpcor PRIVATE
    CUDA::cufft
-    CUDA::cublas
    GDAL::GDAL
    Python::NumPy
    )
--- a/contrib/PyCuAmpcor/README.md
+++ b/contrib/PyCuAmpcor/README.md
@ -0,0 +1,442 @@
+# PyCuAmpcor - Amplitude Cross-Correlation with GPU
+
+## Contents
+
+  * [1. Introduction](#1-introduction)
+  * [2. Installation](#2-installation)
+  * [3. User Guide](#3-user-guide)
+  * [4. List of Parameters](#4-list-of-parameters)
+  * [5. List of Procedures](#5-list-of-procedures)
+
+## 1. Introduction
+
+Ampcor (Amplitude cross correlation) in InSAR processing offers an estimate of spatial displacements (offsets) with the feature tracking (also called as speckle tracking or pixel tracking) method. The offsets are in dimensions of a pixel or sub-pixel (with additional oversampling).
+
+In practice, we
+
+  * choose a rectangle window, $R(x,y)$, from the reference image, serving as the template,
+
+ * choose a series of windows of the same size, $S(x+u, y+v)$, from the search image around the same location but offsetted by $(u,v)$;
+
+  * perform cross-correlation between the search windows with the reference window, to obtain the normalized correlation surface $c(u,v)$;
+
+  * find the maximum of $c(u,v)$ while its location, $(u_m,v_m)$, provides an estimate of the offset.
+
+A detailed formulation can be found, e.g., by J. P. Lewis with [the frequency domain approach](http://scribblethink.org/Work/nvisionInterface/nip.html).
+
+PyCuAmpcor follows the same procedure as the FORTRAN code, ampcor.F, in RIOPAC. In order to optimize the performance on GPU, some implementations are slightly different. In the [list the procedures](#5-list-of-procedures), we show the detailed steps of PyCuAmpcor, as well as its differences to ROIPAC.
+
+## 2. Installation
+
+### 2.1 Installation with ISCE2
+
+PyCuAmpcor is included in [ISCE2](https://github.com/isce-framework/isce2), and can be compiled/installed by CMake or Scons, together with ISCE2. An installation guide can be found at [isce-framework](https://github.com/isce-framework/isce2#building-isce).
+
+Some special notices for PyCuAmpcor:
+
+* PyCuAmpcor now uses the GDAL VRT driver to read image files. The memory-map accelerated I/O is only supported by GDAL version >=3.1.0. Earlier versions of GDAL are supported, but run slower.
+
+* PyCuAmpcor offers a debug mode which outputs intermediate results. For end users, you may disable the debug mode by
+
+    * CMake, use the Release build type *-DCMAKE_BUILD_TYPE=Release*
+    * SCons, it is disabled by default with the -DNDEBUG flag in SConscript
+
+* PyCuAmpcor requires GPUs with CUDA support and compute-capabilities >=2.0. You may (must in some cases, e.g., sm_35 with CUDA) specify the targeted compute capability by
+
+   * CMake, add the flag *-DCMAKE_CUDA_FLAGS="-arch=sm_60"*, sm_35 for K40/80, sm_60 for P100, sm_70 for V100.
+   * SCons, modify the *scons_tools/cuda.py* file by adding *-arch=sm_60* to *env['ENABLESHAREDNVCCFLAG']*.
+
+### 2.2 Standalone Installation
+
+You may also install PyCuAmpcor as a standalone package.
+
+```bash
+    # go to PyCuAmpcor source directory
+    cd contrib/PyCuAmpcor/src
+    # edit Makefile to provide the correct gdal include path and gpu architecture to NVCCFLAGS
+    # call make to compile
+    make
+    # install 
+    python3 setup.py install  
+ ```
+
+## 3. User Guide
+
+The main procedures of PyCuAmpcor are implemented with CUDA/C++. A Python interface to configure and run PyCuAmpcor is offered. Sample python scripts are provided in *contrib/PyCuAmpcor/examples* directory.
+
+### 3.1 cuDenseOffsets.py
+
+*cuDenseOffsets.py*, as also included in InSAR processing stacks, serves as a general purpose script to run PyCuAmpcor. It uses *argparse* to pass parameters, either from a command line
+
+```bash
+cuDenseOffsets.py -r 20151120.slc.full -s 20151214.slc.full --outprefix ./20151120_20151214/offset --ww 64 --wh 64 --oo 32 --kw 300 --kh 100 --nwac 32 --nwdc 1 --sw  20 --sh 20 --gpuid 2
+ ```
+
+ or by a shell script
+
+ ```
+#!/bin/bash
+reference=./merged/SLC/20151120/20151120.slc.full # reference image name 
+secondary=./merged/SLC/20151214/20151214.slc.full # secondary image name
+ww=64  # template window width
+wh=64  # template window height
+sw=20   # (half) search range along width
+sh=20   # (half) search range along height
+kw=300   # skip between windows along width
+kh=100   # skip between windows along height
+mm=0   # margin to be neglected 
+gross=0  # whether to use a varying gross offset
+azshift=0 # constant gross offset along height/azimuth 
+rgshift=0 # constant gross offset along width/range
+deramp=0 # 0 for mag (TOPS), 1 for complex  
+oo=32  # correlation surface oversampling factor 
+outprefix=./merged/20151120_20151214/offset  # output prefix
+outsuffix=_ww64_wh64   # output suffix
+gpuid=0   # GPU device ID
+nstreams=2 # number of CUDA streams
+usemmap=1 # whether to use memory-map i/o
+mmapsize=8 # buffer size in GB for memory map
+nwac=32 # number of windows in a batch along width
+nwdc=1  # number of windows in a batch along height
+
+rm $outprefix$outsuffix*
+cuDenseOffsets.py --reference $reference --secondary $secondary --ww $ww --wh $wh --sw $sw --sh $sh --mm $mm --kw $kw --kh $kh --gross $gross --rr $rgshift --aa $azshift --oo $oo --deramp $deramp --outprefix $outprefix --outsuffix $outsuffix --gpuid $gpuid  --usemmap $usemmap --mmapsize $mmapsize --nwac $nwac --nwdc $nwdc 
+ ```
+
+In the above script, the computation starts from the (mm+sh, mm+sw) pixel in the reference image, take a series of template windows of size (wh, ww) with a skip (sh, sw), cross-correlate with the corresponding windows in the secondary image, and iterate till the end of the images. The output offset fields are stored in *outprefix+outputsuffix+'bip'*, which is in BIP format, i.e., each pixel has two bands of float32 data, (offsetDown, offsetAcross). The total number of pixels is given by the total number of windows (numberWindowDown, numberWindowAcross), which is computed by the script and also saved to the xml file.
+
+If you are interested in a particular region instead of the whole image, you may specify the location of the starting pixel (in reference image) and the number of windows desired by adding
+
+```
+--startpixelac $startPixelAcross --startpixeldw $startPixelDown --nwa $numberOfWindowsAcross --nwd $numberOfWindowsDown
+```
+This option is also helpful for debugging.
+
+PyCuAmpcor supports two types of gross offset fields,
+* static (--gross=0), i.e., a constant shift between reference and secondary images. The static gross offsets can be passed by *--rr $rgshift --aa $azshift*. Note that the margin as well as the starting pixel may be adjusted.
+* dynamic (--gross=1), i.e., shifts between reference windows and secondary windows are varying in different locations. This is helpful to reduce the search range if you have a prior knowledge of the estimated offset fields, e.g., the velocity model of glaciers. You may prepare a BIP input file of the varying gross offsets (same format as the output offset fields), and use the option *--gross-file $grossOffsetFilename*. If you need the coordinates of reference windows, you may run *cuDenseOffsets.py* at first to find out the location of the starting pixel and the total number of windows. The coordinate for the starting pixel of the (iDown, iAcross) window will be (startPixelDown+iDown\*skipDown, startPixelAcross+iAcross\*skipAcross).
+
+### 3.2 Customized Python Scripts
+
+If you need more control of the computation, you may follow the examples to create your own Python script. The general steps are
+* create a PyCuAmpcor instance
+```python
+# if installed with ISCE2
+from isce.contrib.PyCuAmpcor.PyCuAmpcor import PyCuAmpcor
+# if standalone
+from PyCuAmpcor import PyCuAmpcr
+# create an instance
+objOffset = PyCuAmpcor()
+```
+
+* set various parameters, e.g., (see a [list of configurable parameters](#4-list-of-parameters) below)
+```python
+objOffset.referenceImageName="20151120.slc.full.vrt"
+...
+objOffset.windowSizeWidth = 64
+...
+```
+
+* ask CUDA/C++ to check/initialize parameters
+```python
+objOffset.setupParams()
+```
+
+* set up the starting pixel(s) and gross offsets
+```python
+objOffset.referenceStartPixelDownStatic = objOffset.halfSearchRangeDown 
+objOffset.referenceStartPixelAcrossStatic = objOffset.halfSearchRangeDown
+# if static gross offset
+objOffset.setConstantGrossOffset(0, 0)
+# if dynamic gross offset, computed and stored in vD, vA 
+objOffset.setVaryingGrossOffset(vD, vA)
+# check whether all windows are within the image range
+objOffset.checkPixelInImageRange() 
+```
+
+* and finally, run PyCuAmpcor
+```python
+objOffset.runAmpcor()
+```
+
+## 4. List of Parameters
+
+**Image Parameters**
+
+| PyCuAmpcor           | Notes                     |
+| :---                 | :----                     |
+| referenceImageName   | The file name of the reference/template image |
+| referenceImageHeight | The height of the reference image |
+| referenceImageWidth  | The width of the reference image |
+| secondaryImageName   | The file name of the secondary/search image   |
+| secondaryImageHeight | The height of the secondary image |
+| secondaryImageWidth  | The width of the secondary image |
+| grossOffsetImageName | The output file name for gross offsets  |
+| offsetImageName      | The output file name for dense offsets  |
+| snrImageName         | The output file name for signal-noise-ratio of the correlation |
+| covImageName         | The output file name for variance of the correlation surface |
+
+PyCuAmpcor now uses exclusively the GDAL driver to read images, only single-precision binary data are supported. (Image heights/widths are still required as inputs; they are mainly for dimension checking.  We will update later to read them with the GDAL driver). Multi-band is not currently supported, but can be added if desired.
+
+The offset output is arranged in BIP format, with each pixel (azimuth offset, range offset). In addition to a static gross offset (i.e., a constant for all search windows), PyCuAmpcor supports varying gross offsets as inputs (e.g., for glaciers, users can compute the gross offsets with the velocity model for different locations and use them as inputs for PyCuAmpcor. See 2.1 for details.
+
+The offsetImage only ouputs the (dense) offset values computed from the cross-correlations. Users need to add offsetImage and grossOffsetImage to obtain the total offsets.
+
+
+The dimension/direction names used in PyCuAmpcor are:
+* the inner-most dimension x(i): row, height, down, azimuth, along the track.
+* the outer-most dimension y(j): column, width, across, range, along the sight.
+* C/C++/Python use row-major indexing: a[i][j] -> a[i*WIDTH+j]
+* FORTRAN/BLAS/CUBLAS use column-major indexing: a[i][j]->a[i+j*LENGTH]
+
+Note that ampcor.F in general uses y for rows and x for columns which is opposite to PyCuAmpcor.
+
+Note also PyCuAmpcor parameters refer to the names used by the PyCuAmpcor Python class. They may be different from those used in C/C++/CUDA, or the cuDenseOffsets.py args.
+
+**Process Parameters**
+
+| PyCuAmpcor           | Notes                     |
+| :---                 | :----                     |
+| devID                | The CUDA GPU to be used for computation, usually=0, or users can use the CUDA_VISIBLE_DEVICES=n enviromental variable to choose GPU |
+| nStreams | The number of CUDA streams to be used, recommended=2, to overlap the CUDA kernels with data copying, more streams require more memory which isn't alway better |
+| useMmap              | Whether to use memory map cached file I/O, recommended=1, supported by GDAL vrt driver (needs >=3.1.0) and GeoTIFF |
+| mmapSize             | The cache size used for memory map, in units of GB. The larger the better, but not exceed 1/4 the total physical memory. |
+| numberWindowDownInChunk |  The number of windows processed in a batch/chunk, along lines |
+| numberWindowAcrossInChunk | The number of windows processed in a batch/chunk, along columns |
+
+Many windows are processed together to maximize the usage of GPU cores; which is called as a Chunk. The total number of windows in a chunk is limited by the GPU memory. We recommend
+numberWindowDownInChunk=1, numberWindowAcrossInChunk=10, for a window size=64.
+
+
+**Search Parameters**
+
+| PyCuAmpcor           | Notes    |
+| :---                 | :----                     |
+| skipSampleDown       | The skip in pixels for neighboring windows along height |
+| skipSampleAcross     | The skip in pixels for neighboring windows along width |
+| numberWindowDown     | the number of windows along height |
+| numberWindowAcross   | the number of windows along width  |
+| referenceStartPixelDownStatic | the starting pixel location of the first reference window - along height component |
+|referenceStartPixelAcrossStatic | the starting pixel location of the first reference window - along width component |
+
+The C/C++/CUDA program accepts inputs with the total number of windows (numberWindowDown, numberWindowAcross) and the starting pixels of each reference window. The purpose is to establish multiple-threads/streams processing. Therefore, users are required to provide/compute these inputs, with tools available from PyCuAmpcor python class. The cuDenseOffsets.py script also does the job.
+
+We provide some examples below, assuming a PyCuAmpcor class object is created as
+
+```python
+    objOffset = PyCuAmpcor()
+```
+
+**To compute the total number of windows**
+
+We use the line direction as an example, assuming parameters as
+
+```
+   margin # the number of pixels to neglect at edges
+   halfSearchRangeDown # the half of the search range
+   windowSizeHeight # the size of the reference window for feature tracking
+   skipSampleDown # the skip in pixels between two reference windows
+   referenceImageHeight # the reference image height, usually the same as the secondary image height
+```
+
+and the number of windows may be computed along lines as
+
+```python
+   objOffset.numberWindowDown = (referenceImageHeight-2*margin-2*halfSearchRangeDown-windowSizeHeight) // skipSampleDown
+```
+
+If there is a gross offset, you may also need to subtract that when computing the number of windows.
+
+The output offset fields will be of size (numberWindowDown, numberWindowAcross). The total number of windows numberWindows = numberWindowDown\*numberWindowAcross.
+
+**To compute the starting pixels of reference/secondary windows**
+
+The starting pixel for the first reference window is usually set as
+
+```python
+   objOffset.referenceStartPixelDownStatic = margin + halfSearchRangeDown
+   objOffset.referenceStartPixelAcrossStatic = margin + halfSearchRangeAcross
+```
+
+you may also choose other values, e.g., for a particular region of the image, or a certain location for debug purposes.
+
+
+With a constant gross offset, call
+
+```python
+   objOffset.setConstantGrossOffset(grossOffsetDown, grossOffsetAcross)
+```
+
+to set the starting pixels of all reference and secondary windows.
+
+The starting pixel for the seconday window will be (referenceStartPixelDownStatic-halfSearchRangeDown+grossOffsetDown, referenceStartPixelAcrossStatic-halfSearchRangeAcross+grossOffsetAcross).
+
+For cases you choose a varying grossOffset, you may use two numpy arrays to pass the information to PyCuAmpcor, e.g.,
+
+```python
+    objOffset.referenceStartPixelDownStatic = objOffset.halfSearchRangeDown + margin
+    objOffset.referenceStartPixelAcrossStatic = objOffset.halfSearchRangeAcross + margin
+    vD = np.random.randint(0, 10, size =objOffset.numberWindows, dtype=np.int32)
+    vA = np.random.randint(0, 1, size = objOffset.numberWindows, dtype=np.int32)
+    objOffset.setVaryingGrossOffset(vD, vA)
+```
+
+to set all the starting pixels for reference/secondary windows.
+
+Sometimes, adding a large gross offset may cause the windows near the edge to be out of range of the orignal image. To avoid memory access errors, call
+
+```python
+   objOffset.checkPixelInImageRange()
+```
+
+to verify. If an out-of-range error is reported, you may consider to increase the margin or reduce the number of windows.
+
+## 5. List of Procedures
+
+The following procedures apply to one pair of reference/secondary windows, which are iterated through the whole image.
+
+### 5.1 Read a window from Reference/Secondary images
+
+* Load a window of size (windowSizeHeight, windowSizeWidth) from a starting pixel from the reference image
+
+* Load a larger chip of size (windowSizeHeight+2\*halfSearchRangeDown, windowSizeWidth+2\*halfSearchRangeAcross) from the secondary image, the starting position is shifted by (-halfSearchRangeDown, -halfSearchRangeAcross) from the starting position of the reference image (may also be shifted additionally by the gross offset). The secondary chip can be viewed as a set of windows of the same size as the reference window, but shifted in locations varied within the search range.
+
+**Parameters**
+
+| PyCuAmpcor          | CUDA variable       | ampcor.F equivalent   | Notes                     |
+| :---                | :---                | :----                 | :---                      |
+| windowSizeHeight    | windowSizeHeightRaw | i_wsyi                |Reference window height     |
+| windowSizeWidth     | windowSizeWidthRaw  | i_wsxi                |Reference window width      |
+| halfSearchRangeDown | halfSearchRangeDownRaw | i_srchy            | half of the search range along lines |
+| halfSearchRangeAcross | halfSearchRangeAcrossRaw | i_srchx            | half of the search range along  |
+
+
+**Difference to ROIPAC**
+No major difference
+
+
+### 5.2 Perform cross-correlation and obtain an offset in units of the pixel size
+
+* Take amplitudes (real) of the signals (complex or real) in reference/secondary windows
+* Compute the normalized correlation surface between reference and secondary windows: the resulting correlation surface is of size (2\*halfSearchRangeDown+1, 2\*halfSearchRangeAcross+1); two cross-correlation methods are offered, time domain or frequency domain algorithms.
+* Find the location of the maximum/peak in correlation surface.
+* Around the peak position, extract a smaller window from the correlation surface for statistics, such as signal-noise-ratio (SNR), variance.
+
+This step provides an initial estimate of the offset, usually with a large search range. In the following, we will zoom in around the peak, and oversample the windows with a smaller search range.
+
+
+**Parameters**
+
+| PyCuAmpcor          | CUDA variable       | ampcor.F equivalent   | Notes                     |
+| :---                | :---                | :----                 | :---                      |
+| algorithm           | algorithm           | N/A                   |  the cross-correlation computation method 0=Freq 1=time   |
+| corrStatWindowSize  | corrStatWindowSize  | 21               | the size of correlation surface around the peak position used for statistics, may be adjusted   |
+
+
+**Difference to ROIPAC**
+
+* RIOPAC only offers the time-domain algorithm. The frequency-domain algorithm is faster and is set as default in PyCuAmpcor.
+* RIOPAC proceeds from here only for windows with *good* match. To maintain parallelism, PyCuAmpcor proceeds anyway while leaving the *filtering* to users in post processing.
+
+
+### 5.3 Extract a smaller window from the secondary window for oversampling
+
+* From the secondary window, we extract a smaller window of size (windowSizeHeightRaw+2\*halfZoomWindowSizeRaw, windowSizeWidthRaw+2\*halfZoomWindowSizeRaw) with the center determined by the peak position. If the peak postion, e.g., along height, is OffsetInit (taking values in \[0, 2\*halfSearchRangeDownRaw\]), the starting position to extract will be OffsetInit+halfSearchRangeDownRaw-halfZoomWindowSizeRaw.
+
+**Parameters**
+
+| PyCuAmpcor          | CUDA variable       | ampcor.F equivalent   | Notes                     |
+| :---                | :---                | :----                 | :---                      |
+| N/A                 | halfZoomWindowSizeRaw  | i_srchp(p)=4       |  The smaller search range to zoom-in. In PyCuAmpcor, is determined by zoomWindowSize/(2\*rawDataOversamplingFactor)
+
+**Difference to ROIPAC**
+
+RIOPAC extracts the secondary window centering at the correlation surface peak. If the peak locates near the edge, zeros are padded if the extraction zone exceeds the window range. In PyCuAmpcor, the extraction center may be shifted away from peak to warrant all pixels being in the range of the original window.
+
+
+### 5.4 Oversampling reference and (extracted) secondary windows
+
+* oversample both the reference and the (extracted) secondary windows by a factor of 2, which is to avoid aliasing in the complex multiplication of the SAR images. The oversampling is performed with FFT (zero padding), same as in RIOPAC.
+* A deramping procedure is in general required for complex signals before oversampling, to shift the band center to 0. The procedure is only designed to remove a linear phase ramp. It doesn't work for InSAR TOPS mode, whose ramp goes quadratic. Instead, the amplitudes are taken before oversampling.
+* the amplitudes (real) are then taken for each pixel of the complex signals in reference and secondary windows.
+
+**Parameters**
+
+| PyCuAmpcor          | CUDA variable       | ampcor.F equivalent   | Notes                     |
+| :---                | :---                | :----                 | :---                      |
+| rawDataOversamplingFactor | rawDataOversamplingFactor | i_ovs=2   | the oversampling factor for reference and secondary windows, use 2 for InSAR SLCs. |
+| derampMethod        | derampMethod        | 1 or no effect on TOPS | 0=mag for TOPS, 1=deramping (default), else=skip deramping.
+
+
+**Difference to ROIPAC**
+
+RIOPAC enlarges both windows to a size which is a power of 2; ideal for FFT. PyCuAmpcor uses their original sizes for FFT.
+
+RIOPAC always performs deramping with Method 1, to obtain the ramp by averaging the phase difference between neighboring pixels. For TOPS mode, users need to specify 'mag' as the image *datatype* such that the amplitudes are taken before oversampling. Therefore, deramping has no effect. In PyCuAmpcor, derampMethod=0 is equivalent to *datatype='mag'*, taking amplitudes but skipping deramping. derampMethod=1 always performs deramping, no matter the 'complex' or 'real' image datatypes.
+
+### 5.5 Cross-Correlate the oversampled reference and secondary windows
+
+* cross-correlate the oversampled reference and secondary windows.
+* other procedures are needed to obtain the normalized cross-correlation surface, such as calculating and subtracting the mean values.
+* the resulting correlation surface is of size (2\*halfZoomWindowSizeRaw\*rawDataOversamplingFactor+1, 2\*halfZoomWindowSizeRaw\*rawDataOversamplingFactor+1). We cut the last row and column to make it an even sequence, or the size 2\*halfZoomWindowSizeRaw\*rawDataOversamplingFactor=zoomWindowSize.
+
+**Parameters**
+
+| PyCuAmpcor          | CUDA variable       | ampcor.F equivalent   | Notes                     |
+| :---                | :---                | :----                 | :---                      |
+| corrSurfaceZoomInWindow | zoomWindowSize  | i_cw   | The size of correlation surface of the (anti-aliasing) oversampled reference/secondary windows, also used to set halfZoomWindowSizeRaw. Set it to 16 to be consistent with RIOPAC. |
+
+**Difference to ROIPAC**
+
+In RIOPAC, an extra resizing step is performed on the correlation surface, from (2\*halfZoomWindowSizeRaw\*rawDataOversamplingFactor+1, 2\*halfZoomWindowSizeRaw\*rawDataOversamplingFactor+1) to (i_cw, i_cw), centered at the peak (in RIOPAC, the peak seeking is incorporated in the correlation module while is seperate in PyCuAmpcor). i_cw is a user configurable variable; it could be smaller or bigger than 2\*i_srchp\*i_ovs+1=17 (fixed), leading to extraction or enlargement by padding 0s. This procedure is not performed in PyCuAmpcor, as it makes little difference in the next oversampling procedure.
+
+### 5.6 Oversample the correlation surface and find the peak position
+
+* oversample the (real) correlation surface by a factor oversamplingFactor, or the resulting surface is of size (zoomWindowSize\*oversamplingFactor, zoomWindowSize\*oversamplingFactor) Two oversampling methods are offered, oversamplingMethod=0 (FFT, default), =1(sinc).
+* find the peak position in the oversampled correlation surface, OffsetZoomIn, in range zoomWindowSize\*oversamplingFactor.
+* calculate the final offset, from OffsetInit (which is the starting position of secondary window extraction in 2.4),
+
+   offset = (OffsetInit-halfSearchRange)+OffsetZoomIn/(oversamplingFactor\*rawDataOversamplingFactor)
+
+Note that this offset does not include the pre-defined gross offset. Users need to add them together if necessary.
+
+
+**Parameters**
+
+| PyCuAmpcor          | CUDA variable       | ampcor.F equivalent   | Notes                     |
+| :---                | :---                | :----                 | :---                      |
+| corrSurfaceOverSamplingFactor | oversamplingFactor  | i_covs   | The oversampling factor for the correlation surface |
+| corrSurfaceOverSamplingMethod | oversamplingMethod | i_sinc_fourier=i_sinc | The oversampling method 0=FFT, 1=sinc. |
+
+**Difference to ROIPAC**
+
+RIOPAC by default uses the sinc interpolator (one needs to change the FORTRAN code to use FFT). There is no differnce with the sinc interpolator, while for FFT, RIOPAC always enlarges the window to a power of 2.
+
+
+## 6. Additional Notes
+
+### 6.1 Sinc Oversampler
+
+The since oversampler/interpolator may be selected to oversample the correlation surface with *--corr-osm=1* in *cuDenseOffsets.py*, or *objOffset.corrSurfaceOverSamplingMethod=1*.
+
+The sinc interpolating formula is defined as
+
+ $$x(t) = \sum_{n=-\infty}^{\infty} x_n f( \Omega_c t-n )$$
+
+
+with $f(x) = \text{sinc}(x)$ or a complex filter such as the sinc(x) convoluted with Hamming Window used in ampcor.
+
+```
+   parameter(MAXDECFACTOR=4096) ! maximum lags in interpolation kernels
+   r_fintp(0:MAXINTLGH) ! interpolation kernel values  
+   i_decfactor = 4096 ! Range migration decimation Factor 
+   parameter (MAXINTKERLGH=256) !maximum interpolation kernel length 
+   MAXINTLGH=MAXINTKERLGH*MAXDECFACTOR ! maximum interpolation kernel array size 
+   i_weight = 1 
+   r_pedestal = 0.0 
+   r_beta = .75 
+   r_relfiltlen = 6.0 
+   r_fintp(0:MAXINTLGH)       
+```
+
+Note that these parameters are hardwired; you need to change the source code to change these parameters.
--- a/contrib/PyCuAmpcor/examples/GeoTiffSample.py
+++ b/contrib/PyCuAmpcor/examples/GeoTiffSample.py
@ -61,3 +61,4 @@ def main():

 if __name__ == '__main__':

+    main()
--- a/contrib/PyCuAmpcor/examples/cuDenseOffsets.py
+++ b/contrib/PyCuAmpcor/examples/cuDenseOffsets.py
@ -14,8 +14,8 @@ from contrib.PyCuAmpcor.PyCuAmpcor import PyCuAmpcor


 EXAMPLE = '''example
-  cuDenseOffsets.py -m ./merged/SLC/20151120/20151120.slc.full -s ./merged/SLC/20151214/20151214.slc.full
-      --referencexml ./reference/IW1.xml --outprefix ./merged/offsets/20151120_20151214/offset
+  cuDenseOffsets.py -r ./merged/SLC/20151120/20151120.slc.full -s ./merged/SLC/20151214/20151214.slc.full
+      --outprefix ./merged/offsets/20151120_20151214/offset
      --ww 256 --wh 256 --oo 32 --kw 300 --kh 100 --nwac 100 --nwdc 1 --sw 8 --sh 8 --gpuid 2
 '''

@ -29,77 +29,96 @@ def createParser():
    parser = argparse.ArgumentParser(description='Generate offset field between two Sentinel slc',
                                     formatter_class=argparse.RawTextHelpFormatter,
                                     epilog=EXAMPLE)
-    parser.add_argument('-m','--reference', type=str, dest='reference', required=True,
+
+    # input/output
+    parser.add_argument('-r','--reference', type=str, dest='reference', required=True,
                        help='Reference image')
    parser.add_argument('-s', '--secondary',type=str, dest='secondary', required=True,
                        help='Secondary image')
-    parser.add_argument('-l', '--lat',type=str, dest='lat', required=False,
-                        help='Latitude')
-    parser.add_argument('-L', '--lon',type=str, dest='lon', required=False,
-                        help='Longitude')
-    parser.add_argument('--los',type=str, dest='los', required=False,
-                        help='Line of Sight')
-    parser.add_argument('-x', '--referencexml',type=str, dest='referencexml', required=False,
-                        help='Reference Image XML File')

    parser.add_argument('--op','--outprefix','--output-prefix', type=str, dest='outprefix',
                        default='offset', required=True,
                        help='Output prefix, default: offset.')
    parser.add_argument('--os','--outsuffix', type=str, dest='outsuffix', default='',
                        help='Output suffix, default:.')
+
+    # window size settings
    parser.add_argument('--ww', type=int, dest='winwidth', default=64,
                        help='Window width (default: %(default)s).')
    parser.add_argument('--wh', type=int, dest='winhgt', default=64,
                        help='Window height (default: %(default)s).')
-
-    parser.add_argument('--sw', type=int, dest='srcwidth', default=20, choices=range(8, 33),
-                        help='Search window width (default: %(default)s).')
-    parser.add_argument('--sh', type=int, dest='srchgt', default=20, choices=range(8, 33),
-                        help='Search window height (default: %(default)s).')
-    parser.add_argument('--mm', type=int, dest='margin', default=50,
-                        help='Margin (default: %(default)s).')
-
+    parser.add_argument('--sw', type=int, dest='srcwidth', default=20,
+                        help='Half search range along width, (default: %(default)s, recommend: 4-32).')
+    parser.add_argument('--sh', type=int, dest='srchgt', default=20,
+                        help='Half search range along height (default: %(default)s, recommend: 4-32).')
    parser.add_argument('--kw', type=int, dest='skipwidth', default=64,
                        help='Skip across (default: %(default)s).')
    parser.add_argument('--kh', type=int, dest='skiphgt', default=64,
                        help='Skip down (default: %(default)s).')

+    # determine the number of windows
+    # either specify the starting pixel and the number of windows,
+    # or by setting them to -1, let the script to compute these parameters
+    parser.add_argument('--mm', type=int, dest='margin', default=0,
+                        help='Margin (default: %(default)s).')
+    parser.add_argument('--nwa', type=int, dest='numWinAcross', default=-1,
+                        help='Number of window across (default: %(default)s to be auto-determined).')
+    parser.add_argument('--nwd', type=int, dest='numWinDown', default=-1,
+                        help='Number of window down (default: %(default)s).')
+    parser.add_argument('--startpixelac', dest='startpixelac', type=int, default=-1,
+                        help='Starting Pixel across of the reference image(default: %(default)s to be determined by margin and search range).')
+    parser.add_argument('--startpixeldw', dest='startpixeldw', type=int, default=-1,
+                        help='Starting Pixel down of the reference image (default: %(default)s).')
+
+    # cross-correlation algorithm
+    parser.add_argument('--alg', '--algorithm', dest='algorithm', type=int, default=0,
+                        help='cross-correlation algorithm (0 = frequency domain, 1 = time domain) (default: %(default)s).')
    parser.add_argument('--raw-osf','--raw-over-samp-factor', type=int, dest='raw_oversample',
                        default=2, choices=range(2,5),
-                        help='raw data oversampling factor (default: %(default)s).')
+                        help='anti-aliasing oversampling factor, equivalent to i_ovs in RIOPAC (default: %(default)s).')
+    parser.add_argument('--drmp', '--deramp', dest='deramp', type=int, default=0,
+                        help='deramp method (0: mag for TOPS, 1:complex with linear ramp) (default: %(default)s).')

+    # gross offset
    gross = parser.add_argument_group('Initial gross offset')
    gross.add_argument('-g','--gross', type=int, dest='gross', default=0,
-                       help='Use gross offset or not')
+                       help='Use varying gross offset or not')
    gross.add_argument('--aa', type=int, dest='azshift', default=0,
                       help='Gross azimuth offset (default: %(default)s).')
    gross.add_argument('--rr', type=int, dest='rgshift', default=0,
                       help='Gross range offset (default: %(default)s).')
+    gross.add_argument('--gf', '--gross-file', type=str, dest='gross_offset_file',
+                       help='Varying gross offset input file')

    corr = parser.add_argument_group('Correlation surface')
-    corr.add_argument('--corr-win-size', type=int, dest='corr_win_size', default=-1,
-                      help='Zoom-in window size of the correlation surface for oversampling (default: %(default)s).')
+    corr.add_argument('--corr-stat-size', type=int, dest='corr_stat_win_size', default=21,
+                      help='Zoom-in window size of the correlation surface for statistics(snr/variance) (default: %(default)s).')
+    corr.add_argument('--corr-srch-size', type=int, dest='corr_srch_size', default=4,
+                      help='(half) Zoom-in window size of the correlation surface for oversampling, ' \
+                      'equivalent to i_srcp in RIOPAC (default: %(default)s).')
    corr.add_argument('--corr-osf', '--oo', '--corr-over-samp-factor', type=int, dest='corr_oversample', default=32,
                      help = 'Oversampling factor of the zoom-in correlation surface (default: %(default)s).')
+    corr.add_argument('--corr-osm', '--corr-over-samp-method', type=int, dest='corr_oversamplemethod', default=0,
+                      help = 'Oversampling method for the correlation surface 0=fft, 1=sinc (default: %(default)s).')

-    parser.add_argument('--nwa', type=int, dest='numWinAcross', default=-1,
-                        help='Number of window across (default: %(default)s).')
-    parser.add_argument('--nwd', type=int, dest='numWinDown', default=-1,
-                        help='Number of window down (default: %(default)s).')
+    # gpu settings
+    proc = parser.add_argument_group('Processing parameters')
+    proc.add_argument('--gpuid', '--gid', '--gpu-id', dest='gpuid', type=int, default=-1,
+                        help='GPU ID (default: %(default)s to auto decide).')
+    proc.add_argument('--nstreams', dest='nstreams', type=int, default=2,
+                        help='Number of cuda streams (default: %(default)s).')
+    proc.add_argument('--usemmap', dest='usemmap', type=int, default=1,
+                        help='Whether to use memory map for loading image files (default: %(default)s).')
+    proc.add_argument('--mmapsize', dest='mmapsize', type=int, default=8,
+                        help='The memory map buffer size in GB (default: %(default)s).')
+    proc.add_argument('--nwac', type=int, dest='numWinAcrossInChunk', default=10,
+                        help='Number of window across in a chunk/batch (default: %(default)s).')
+    proc.add_argument('--nwdc', type=int, dest='numWinDownInChunk', default=1,
+                        help='Number of window down in a chunk/batch (default: %(default)s).')

-    parser.add_argument('--nwac', type=int, dest='numWinAcrossInChunk', default=1,
-                        help='Number of window across in chunk (default: %(default)s).')
-    parser.add_argument('--nwdc', type=int, dest='numWinDownInChunk', default=1,
-                        help='Number of window down in chunk (default: %(default)s).')
-    parser.add_argument('-r', '--redo', dest='redo', action='store_true',
+    proc.add_argument('--redo', dest='redo', action='store_true',
                        help='To redo by force (ignore the existing offset fields).')

-    parser.add_argument('--drmp', '--deramp', dest='deramp', type=int, default=0,
-                        help='deramp method (0: mag, 1: complex) (default: %(default)s).')
-
-    parser.add_argument('--gpuid', '--gid', '--gpu-id', dest='gpuid', type=int, default=-1,
-                        help='GPU ID (default: %(default)s).')
-
    return parser


@ -108,9 +127,13 @@ def cmdLineParse(iargs = None):
    inps =  parser.parse_args(args=iargs)

    # check oversampled window size
-    if (inps.winwidth + 2 * inps.srcwidth) * inps.raw_oversample > 1024:
-        msg = 'input oversampled window size in the across/range direction '
-        msg += 'exceeds the current implementaion limit of 1024!'
+    if (inps.winwidth + 2 * inps.srcwidth ) * inps.raw_oversample > 1024:
+        msg = 'The oversampled window width, ' \
+              'as computed by (winwidth+2*srcwidth)*raw_oversample, ' \
+              'exceeds the current implementation limit of 1,024. ' \
+              f'Please reduce winwidth: {inps.winwidth}, ' \
+              f'srcwidth: {inps.srcwidth}, ' \
+              f'or raw_oversample: {inps.raw_oversample}.'
        raise ValueError(msg)

    return inps
@ -136,11 +159,12 @@ def estimateOffsetField(reference, secondary, inps=None):
    width = sar.getWidth()
    length = sar.getLength()

+    # create a PyCuAmpcor instance
    objOffset = PyCuAmpcor()

-    objOffset.algorithm = 0
-    objOffset.deviceID = inps.gpuid  # -1:let system find the best GPU
-    objOffset.nStreams = 2 #cudaStreams
+    objOffset.algorithm = inps.algorithm
+    objOffset.deviceID = inps.gpuid
+    objOffset.nStreams = inps.nstreams #cudaStreams
    objOffset.derampMethod = inps.deramp
    print('deramp method (0 for magnitude, 1 for complex): ', objOffset.derampMethod)

@ -155,49 +179,52 @@ def estimateOffsetField(reference, secondary, inps=None):
    print("image length:",length)
    print("image width:",width)

-    objOffset.numberWindowDown = (length-2*inps.margin-2*inps.srchgt-inps.winhgt)//inps.skiphgt
-    objOffset.numberWindowAcross = (width-2*inps.margin-2*inps.srcwidth-inps.winwidth)//inps.skipwidth
+    # if using gross offset, adjust the margin
+    margin = max(inps.margin, abs(inps.azshift), abs(inps.rgshift))

-    if (inps.numWinDown != -1):
-        objOffset.numberWindowDown = inps.numWinDown
-    if (inps.numWinAcross != -1):
-        objOffset.numberWindowAcross = inps.numWinAcross
-    print("offset field length: ",objOffset.numberWindowDown)
-    print("offset field width: ",objOffset.numberWindowAcross)
+    # determine the number of windows down and across
+    # that's also the size of the output offset field
+    objOffset.numberWindowDown = inps.numWinDown if inps.numWinDown > 0 \
+        else (length-2*margin-2*inps.srchgt-inps.winhgt)//inps.skiphgt
+    objOffset.numberWindowAcross = inps.numWinAcross if inps.numWinAcross > 0 \
+        else (width-2*margin-2*inps.srcwidth-inps.winwidth)//inps.skipwidth
+    print('the number of windows: {} by {}'.format(objOffset.numberWindowDown, objOffset.numberWindowAcross))

    # window size
    objOffset.windowSizeHeight = inps.winhgt
    objOffset.windowSizeWidth = inps.winwidth
-    print('cross correlation window size: {} by {}'.format(objOffset.windowSizeHeight, objOffset.windowSizeWidth))
+    print('window size for cross-correlation: {} by {}'.format(objOffset.windowSizeHeight, objOffset.windowSizeWidth))

    # search range
    objOffset.halfSearchRangeDown = inps.srchgt
    objOffset.halfSearchRangeAcross = inps.srcwidth
-    print('half search range: {} by {}'.format(inps.srchgt, inps.srcwidth))
+    print('initial search range: {} by {}'.format(inps.srchgt, inps.srcwidth))

    # starting pixel
+    objOffset.referenceStartPixelDownStatic = inps.startpixeldw if inps.startpixeldw != -1 \
+        else margin + objOffset.halfSearchRangeDown    # use margin + halfSearchRange instead
+    objOffset.referenceStartPixelAcrossStatic = inps.startpixelac if inps.startpixelac != -1 \
+        else margin + objOffset.halfSearchRangeAcross
+
+    print('the first pixel in reference image is: ({}, {})'.format(
+        objOffset.referenceStartPixelDownStatic, objOffset.referenceStartPixelAcrossStatic))

-    objOffset.referenceStartPixelDownStatic = inps.margin
-    objOffset.referenceStartPixelAcrossStatic = inps.margin
- 
    # skip size
-    
    objOffset.skipSampleDown = inps.skiphgt
    objOffset.skipSampleAcross = inps.skipwidth
    print('search step: {} by {}'.format(inps.skiphgt, inps.skipwidth))

    # oversample raw data (SLC)
    objOffset.rawDataOversamplingFactor = inps.raw_oversample
-    print('raw data oversampling factor:', inps.raw_oversample)

    # correlation surface
-    if inps.corr_win_size == -1:
-        corr_win_size_orig = min(inps.srchgt, inps.srcwidth) * inps.raw_oversample + 1
-        inps.corr_win_size = np.power(2, int(np.log2(corr_win_size_orig)))
-        objOffset.corrSurfaceZoomInWindow = inps.corr_win_size
-        print('correlation surface zoom-in window size:', inps.corr_win_size)
+    objOffset.corrStatWindowSize = inps.corr_stat_win_size

-    objOffset.corrSufaceOverSamplingMethod = 0
+    corr_win_size = 2*inps.corr_srch_size*inps.raw_oversample
+    objOffset.corrSurfaceZoomInWindow = corr_win_size
+    print('correlation surface zoom-in window size:', corr_win_size)
+
+    objOffset.corrSurfaceOverSamplingMethod = inps.corr_oversamplemethod
    objOffset.corrSurfaceOverSamplingFactor = inps.corr_oversample
    print('correlation surface oversampling factor:', inps.corr_oversample)

@ -211,37 +238,38 @@ def estimateOffsetField(reference, secondary, inps=None):
    print("snr: ",objOffset.snrImageName)
    print("cov: ",objOffset.covImageName)

-    offsetImageName = objOffset.offsetImageName.decode('utf8')
-    grossOffsetImageName = objOffset.grossOffsetImageName.decode('utf8')
-    snrImageName = objOffset.snrImageName.decode('utf8')
-    covImageName = objOffset.covImageName.decode('utf8')
+    offsetImageName = objOffset.offsetImageName
+    grossOffsetImageName = objOffset.grossOffsetImageName
+    snrImageName = objOffset.snrImageName
+    covImageName = objOffset.covImageName

-    print(offsetImageName)
-    print(inps.redo)
    if os.path.exists(offsetImageName) and not inps.redo:
-        print('offsetfield file exists')
+        print('offsetfield file {} exists while the redo flag is {}.'.format(offsetImageName, inps.redo))
        return 0

    # generic control
    objOffset.numberWindowDownInChunk = inps.numWinDownInChunk
    objOffset.numberWindowAcrossInChunk = inps.numWinAcrossInChunk
-    objOffset.useMmap = 0
-    objOffset.mmapSize = 8
+    objOffset.useMmap = inps.usemmap
+    objOffset.mmapSize = inps.mmapsize
+
+    # setup and check parameters
    objOffset.setupParams()

    ## Set Gross Offset ###
-    if inps.gross == 0:
-        print("Set constant grossOffset")
-        print("By default, the gross offsets are zero")
-        print("You can override the default values here")
-        objOffset.setConstantGrossOffset(0, 0)
+    if inps.gross == 0: # use static grossOffset
+        print('Set constant grossOffset ({}, {})'.format(inps.azshift, inps.rgshift))
+        objOffset.setConstantGrossOffset(inps.azshift, inps.rgshift)

-    else:
-        print("Set varying grossOffset")
-        print("By default, the gross offsets are zero")
-        print("You can override the default grossDown and grossAcross arrays here")
-        objOffset.setVaryingGrossOffset(np.zeros(shape=grossDown.shape,dtype=np.int32),
-                                        np.zeros(shape=grossAcross.shape,dtype=np.int32))
+    else: # use varying offset
+        print("Set varying grossOffset from file {}".format(inps.gross_offset_file))
+        grossOffset = np.fromfile(inps.gross_offset_file, dtype=np.int32)
+        numberWindows = objOffset.numberWindowDown*objOffset.numberWindowAcross
+        if grossOffset.size != 2*numberWindows :
+            print('The input gross offsets do not match the number of windows {} by {} in int32 type'.format(objOffset.numberWindowDown, objOffset.numberWindowAcross))
+            return 0;
+        grossOffset.reshape(numberWindows, 2)
+        objOffset.setVaryingGrossOffset(grossOffset[:,0], grossOffset[:,1])

    # check
    objOffset.checkPixelInImageRange()
--- a/contrib/PyCuAmpcor/src/GDALImage.cu
+++ b/contrib/PyCuAmpcor/src/GDALImage.cu
@ -6,7 +6,6 @@
 #include <unistd.h>
 #include <fcntl.h>
 #include <assert.h>
-#include <cublas_v2.h>
 #include "cudaError.h"
 #include <errno.h>
 #include <unistd.h>
@ -107,11 +106,6 @@ void GDALImage::loadToDevice(void *dArray, size_t h_offset, size_t w_offset, siz
    char * startPtr = (char *)_memPtr ;
    startPtr += tileStartOffset;

-    // @note
-    // We assume down/across directions as rows/cols. Therefore, SLC mmap and device array are both row major.
-    // cuBlas assumes both source and target arrays are column major.
-    // To use cublasSetMatrix, we need to switch w_tile/h_tile for rows/cols
-    // checkCudaErrors(cublasSetMatrixAsync(w_tile, h_tile, sizeof(float2), startPtr, width, dArray, w_tile, stream));
    if (_useMmap)
        checkCudaErrors(cudaMemcpy2DAsync(dArray, w_tile*_pixelSize, startPtr, _width*_pixelSize,
                                      w_tile*_pixelSize, h_tile, cudaMemcpyHostToDevice,stream));
--- a/contrib/PyCuAmpcor/src/Makefile
+++ b/contrib/PyCuAmpcor/src/Makefile
@ -1,12 +1,13 @@
 PROJECT = CUAMPCOR

-LDFLAGS =  -lcuda -lcudart -lcufft -lcublas
+LDFLAGS =  -lcuda -lcudart -lcufft -lgdal
 CXXFLAGS = -std=c++11 -fpermissive -fPIC -shared
 NVCCFLAGS = -std=c++11 -ccbin g++ -m64 \
    -gencode arch=compute_35,code=sm_35 \
 	-gencode arch=compute_60,code=sm_60 \
    -Xcompiler -fPIC -shared -Wno-deprecated-gpu-targets \
-    -ftz=false -prec-div=true -prec-sqrt=true
+    -ftz=false -prec-div=true -prec-sqrt=true \
+    -I/usr/include/gdal

 CXX=g++
 NVCC=nvcc
--- a/contrib/PyCuAmpcor/src/PyCuAmpcor.pyx
+++ b/contrib/PyCuAmpcor/src/PyCuAmpcor.pyx
@ -43,10 +43,12 @@ cdef extern from "cuAmpcorParameter.h":
        int skipSampleDownRaw    			## Skip size between neighboring windows in Down direction (original size)
        int skipSampleAcrossRaw  			## Skip size between neighboring windows in across direction (original size)

+        int corrStatWindowSize                   ## Size of the raw correlation surface extracted for statistics
+
        ## Zoom in region near location of max correlation
        int zoomWindowSize       			## Zoom-in window size in correlation surface (same for down and across directions)
        int oversamplingFactor   			## Oversampling factor for interpolating correlation surface
-        int oversamplingMethod
+        int oversamplingMethod              ## Correlation surface oversampling method 0=fft, 1=sinc

        float thresholdSNR       			## Threshold of Signal noise ratio to remove noisy data

@ -217,6 +219,13 @@ cdef class PyCuAmpcor(object):
    def rawDataOversamplingFactor(self, int a):
        self.c_cuAmpcor.param.rawDataOversamplingFactor = a
    @property
+    def corrStatWindowSize(self):
+        """Size of correlation surface extracted for statistics"""
+        return self.c_cuAmpcor.param.corrStatWindowSize
+    @corrStatWindowSize.setter
+    def corrStatWindowSize(self, int a):
+        self.c_cuAmpcor.param.corrStatWindowSize = a
+    @property
    def corrSurfaceZoomInWindow(self):
        """Zoom-In Window Size for correlation surface"""
        return self.c_cuAmpcor.param.zoomWindowSize
@ -231,11 +240,11 @@ cdef class PyCuAmpcor(object):
    def corrSurfaceOverSamplingFactor(self, int a):
        self.c_cuAmpcor.param.oversamplingFactor = a
    @property
-    def corrSufaceOverSamplingMethod(self):
+    def corrSurfaceOverSamplingMethod(self):
        """Oversampling method for correlation surface(0=fft,1=sinc)"""
        return self.c_cuAmpcor.param.oversamplingMethod
-    @corrSufaceOverSamplingMethod.setter
-    def corrSufaceOverSamplingMethod(self, int a):
+    @corrSurfaceOverSamplingMethod.setter
+    def corrSurfaceOverSamplingMethod(self, int a):
        self.c_cuAmpcor.param.oversamplingMethod = a
    @property
    def referenceImageName(self):
@ -322,27 +331,27 @@ cdef class PyCuAmpcor(object):
    ## gross offets
    @property
    def grossOffsetImageName(self):
-        return self.c_cuAmpcor.param.grossOffsetImageName
+        return self.c_cuAmpcor.param.grossOffsetImageName.decode("utf-8")
    @grossOffsetImageName.setter
    def grossOffsetImageName(self, str a):
        self.c_cuAmpcor.param.grossOffsetImageName = <string> a.encode()
    @property
    def offsetImageName(self):
-        return self.c_cuAmpcor.param.offsetImageName
+        return self.c_cuAmpcor.param.offsetImageName.decode("utf-8")
    @offsetImageName.setter
    def offsetImageName(self, str a):
        self.c_cuAmpcor.param.offsetImageName = <string> a.encode()

    @property
    def snrImageName(self):
-        return self.c_cuAmpcor.param.snrImageName
+        return self.c_cuAmpcor.param.snrImageName.decode("utf-8")
    @snrImageName.setter
    def snrImageName(self, str a):
        self.c_cuAmpcor.param.snrImageName = <string> a.encode()

    @property
    def covImageName(self):
-        return self.c_cuAmpcor.param.covImageName
+        return self.c_cuAmpcor.param.covImageName.decode("utf-8")
    @covImageName.setter
    def covImageName(self, str a):
        self.c_cuAmpcor.param.covImageName = <string> a.encode()
--- a/contrib/PyCuAmpcor/src/SConscript
+++ b/contrib/PyCuAmpcor/src/SConscript
@ -1,5 +1,6 @@
 #!/usr/bin/env python
 import sys
+import subprocess

 Import('envPyCuAmpcor')
 package = envPyCuAmpcor['PACKAGE']
@ -16,6 +17,10 @@ listFiles = ['GDALImage.cu', 'cuArrays.cu', 'cuArraysCopy.cu',

 lib = envPyCuAmpcor.SharedLibrary(target = 'PyCuAmpcor', source= listFiles, SHLIBPREFIX='')

+# add gdal include path
+gdal_cflags = subprocess.check_output('gdal-config --cflags', shell=True)[:-1].decode('utf-8')
+envPyCuAmpcor.Append(ENABLESHAREDNVCCFLAG = ' -DNDEBUG ' + gdal_cflags)
+
 envPyCuAmpcor.Install(build,lib)
 envPyCuAmpcor.Alias('install', build)

--- a/contrib/PyCuAmpcor/src/SlcImage.cu
+++ b/contrib/PyCuAmpcor/src/SlcImage.cu
@ -1,177 +0,0 @@
-#include "SlcImage.h"
-#include <iostream>
-
-#include <stdio.h>
-#include <stdlib.h>
-#include <sys/types.h>
-#include <sys/stat.h>
-#include <unistd.h>
-#include <fcntl.h>
-#include <sys/mman.h>
-#include <cuComplex.h>
-#include <assert.h>
-#include <cublas_v2.h>
-#include "cudaError.h"
-#include <errno.h>
-#include <unistd.h>
-
-SlcImage::SlcImage() {
-    fileid = -1;
-    is_mapped = 0;
-    is_opened = 0;
-    height = 0;
-    width = 0;
-}
-  
- 
-SlcImage::SlcImage(std::string fn, size_t h, size_t w) {
-    filename = fn;
-    width = w;
-    height = h;
-    is_mapped = 0;
-    is_opened = 0;
-    openFile();
-    buffersize = filesize;
-    offset = 0l; 
-    openFile();
-    setupMmap();
-}
-
-SlcImage::SlcImage(std::string fn, size_t h, size_t w, size_t bsize) {
-    filename = fn;
-    width = w;
-    height = h;
-    is_mapped = 0;
-    is_opened = 0;
-    buffersize = bsize*(1l<<30); //1G as a unit
-    offset = 0l;
-    openFile();
-    //std::cout << "buffer and file sizes" << buffersize << " " << filesize << std::endl;
-    setupMmap();
-}
-
-void SlcImage::setBufferSize(size_t sizeInG)
-{
-    buffersize = sizeInG*(1l<<30);
-}
-
-void SlcImage::openFile()
-{
-    if(!is_opened){
-        fileid = open(filename.c_str(), O_RDONLY, 0);
-        if(fileid == -1) 
-            {
-            fprintf(stderr, "Error opening file %s\n", filename.c_str());
-            exit(EXIT_FAILURE);
-        }
-    }
-    struct stat st;
-    stat(filename.c_str(), &st);
-    filesize = st.st_size;
-    //lseek(fileid,filesize-1,SEEK_SET);
-    is_opened = 1;
-}
-
-void SlcImage::closeFile()
-{
-    if(is_opened)
-        {
-        close(fileid);
-        is_opened = 0;
-    }
-}
-/*
-  void SlcImage::setupMmap()
-{
-    if(!is_mapped) {
-        float2 *fmmap = (float2 *)mmap(NULL, filesize, PROT_READ, MAP_SHARED, fileid, 0);
-        assert (fmmap != MAP_FAILED);
-        mmapPtr =  fmmap;
-        is_mapped = 1;
-    }
-}*/
-
-void SlcImage::setupMmap()
-{
-
-    if(is_opened) {
-        if(!is_mapped) {
-            void * fmmap;
-            if((fmmap=mmap((caddr_t)0, buffersize, PROT_READ, MAP_SHARED, fileid, offset)) == MAP_FAILED)
-            {
-                fprintf(stderr, "mmap error: %d %d\n", fileid, errno);
-                exit(1);
-            }	       
-            mmapPtr = (float2 *)fmmap;
-            is_mapped = 1;
-        }
-    }
-    else {
-        fprintf(stderr, "error! file is not opened");
-        exit(1);}
-    //fprintf(stderr, "debug mmap setup %ld, %ld\n", offset, buffersize);
-    //fprintf(stderr, "starting mmap pixel %f %f\n", mmapPtr[0].x, mmapPtr[0].y);
-}
-
-void SlcImage::mUnMap()
-{
-    if(is_mapped) {
-        if(munmap((void *)mmapPtr, buffersize) == -1)
-        {
-            fprintf(stderr, "munmap error: %d\n", fileid);
-        } 
-        is_mapped = 0; 
-    }
-}
-
-
-/// load a tile of data h_tile x w_tile from CPU (mmap) to GPU
-/// @param dArray pointer for array in device memory
-/// @param h_offset Down/Height offset
-/// @param w_offset Across/Width offset
-/// @param h_tile Down/Height tile size
-/// @param w_tile Across/Width tile size
-/// @param stream CUDA stream for copying
-void SlcImage::loadToDevice(float2 *dArray, size_t h_offset, size_t w_offset, size_t h_tile, size_t w_tile, cudaStream_t stream)
-{
-    size_t tileStartAddress = (h_offset*width + w_offset)*sizeof(float2); 
-    size_t tileLastAddress = tileStartAddress + (h_tile*width + w_tile)*sizeof(float2); 
-    size_t pagesize = getpagesize();
-     
-    if(tileStartAddress  < offset || tileLastAddress > offset + buffersize )
-    {
-        size_t temp = tileStartAddress/pagesize;
-        offset = temp*pagesize;
-        mUnMap();
-        setupMmap(); 
-    }
-    
-    float2 *startPtr = mmapPtr ;
-    startPtr += (tileStartAddress - offset)/sizeof(float2);
-    
-    // @note 
-    // We assume down/across directions as rows/cols. Therefore, SLC mmap and device array are both row major. 
-    // cuBlas assumes both source and target arrays are column major. 
-    // To use cublasSetMatrix, we need to switch w_tile/h_tile for rows/cols  
-    // checkCudaErrors(cublasSetMatrixAsync(w_tile, h_tile, sizeof(float2), startPtr, width, dArray, w_tile, stream)); 
-    
-    checkCudaErrors(cudaMemcpy2DAsync(dArray, w_tile*sizeof(float2), startPtr, width*sizeof(float2), 
-                                      w_tile*sizeof(float2), h_tile, cudaMemcpyHostToDevice,stream)); 
-}
-
-SlcImage::~SlcImage()
-{
-    mUnMap();
-    closeFile();
-}
-  	  
-
-void SlcImage::testData()
-{
-    float2 *test;
-    test =(float2 *)malloc(10*sizeof(float2));
-    mempcpy(test, mmapPtr+1000000l, 10*sizeof(float2));
-    for(int i=0; i<10; i++)
-        std::cout << test[i].x << " " << test[i].y << ",";
-    std::cout << std::endl;
-}
--- a/contrib/PyCuAmpcor/src/SlcImage.h
+++ b/contrib/PyCuAmpcor/src/SlcImage.h
@ -1,64 +0,0 @@
-// -*- c++ -*- 
-#ifndef __SLCIMAGE_H
-#define __SLCIMAGE_H
-
-#include <cublas_v2.h>
-#include <string>
-
-class SlcImage{
-private:
-    std::string filename;
-    int fileid;
-    size_t filesize;
-    size_t height;
-    size_t width;
-    
-    bool is_mapped;
-    bool is_opened;
-    float2* mmapPtr;  
-    size_t buffersize;
-    size_t offset;
-    
-public:  
-    SlcImage();
-    
-    SlcImage(std::string fn, size_t h, size_t w);
-    SlcImage(std::string fn, size_t h, size_t w, size_t bsize);
-    void openFile();
-    void closeFile();
-    void setupMmap();
-    void mUnMap();
-    void setBufferSize(size_t size);
-    
-    float2* getmmapPtr()
-    {
-        return(mmapPtr);
-    }
-    
-    size_t getFileSize()
-    {
-        return (filesize);
-    }
-    
-    size_t getHeight() {
-        return (height);
-    }
-    
-    size_t getWidth()
-    {
-        return (width);
-    }
-    
-    bool getMmapStatus() 
-    {
-        return(is_mapped);
-    }
-    
-    //tested
-    void loadToDevice(float2 *dArray, size_t h_offset, size_t w_offset, size_t h_tile, size_t w_tile, cudaStream_t stream);
-    ~SlcImage();
-    void testData();
-    
-};
-
-#endif //__SLCIMAGE_H
--- a/contrib/PyCuAmpcor/src/cuAmpcorChunk.cu
+++ b/contrib/PyCuAmpcor/src/cuAmpcorChunk.cu
@ -17,74 +17,143 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    //std::cout << "load reference chunk ok\n";

    cuArraysAbs(c_referenceBatchRaw, r_referenceBatchRaw, stream);
+
+#ifdef CUAMPCOR_DEBUG
+    // dump the raw reference image(s)
+    c_referenceBatchRaw->outputToFile("c_referenceBatchRaw", stream);
+    r_referenceBatchRaw->outputToFile("r_referenceBatchRaw", stream);
+#endif
+
    cuArraysSubtractMean(r_referenceBatchRaw, stream);
+
+#ifdef CUAMPCOR_DEBUG
+    // dump the raw reference image(s)
+    r_referenceBatchRaw->outputToFile("r_referenceBatchRawSubMean", stream);
+#endif
+
    // load secondary image chunk
    loadSecondaryChunk();
    cuArraysAbs(c_secondaryBatchRaw, r_secondaryBatchRaw, stream);

-    //std::cout << "load secondary chunk ok\n";
+#ifdef CUAMPCOR_DEBUG
+    // dump the raw secondary image(s)
+    c_secondaryBatchRaw->outputToFile("c_secondaryBatchRaw", stream);
+    r_secondaryBatchRaw->outputToFile("r_secondaryBatchRaw", stream);
+#endif

+    //std::cout << "load secondary chunk ok\n";

    //cross correlation for none-oversampled data
    if(param->algorithm == 0) {
        cuCorrFreqDomain->execute(r_referenceBatchRaw, r_secondaryBatchRaw, r_corrBatchRaw);
-    }
-    else {
+    } else {
        cuCorrTimeDomain(r_referenceBatchRaw, r_secondaryBatchRaw, r_corrBatchRaw, stream); //time domain cross correlation
    }
+
+#ifdef CUAMPCOR_DEBUG
+    // dump the unrenormalized correlation surface
+    r_corrBatchRaw->outputToFile("r_corrBatchRawUnNorm", stream);
+#endif
+
    cuCorrNormalize(r_referenceBatchRaw, r_secondaryBatchRaw, r_corrBatchRaw, stream);

+#ifdef CUAMPCOR_DEBUG
+    // dump the normalized correlation surface
+    r_corrBatchRaw->outputToFile("r_corrBatchRaw", stream);
+#endif

    // find the maximum location of none-oversampled correlation
    // 41 x 41, if halfsearchrange=20
    //cuArraysMaxloc2D(r_corrBatchRaw, offsetInit, stream);
    cuArraysMaxloc2D(r_corrBatchRaw, offsetInit, r_maxval, stream);

-    offsetInit->outputToFile("offsetInit1", stream);
-
    // Estimation of statistics
    // Author: Minyan Zhong
    // Extraction of correlation surface around the peak
    cuArraysCopyExtractCorr(r_corrBatchRaw, r_corrBatchRawZoomIn, i_corrBatchZoomInValid, offsetInit, stream);

-    cudaDeviceSynchronize();
-
-    // debug: output the intermediate results
-    r_maxval->outputToFile("r_maxval",stream);
-    r_corrBatchRaw->outputToFile("r_corrBatchRaw",stream);
-    r_corrBatchRawZoomIn->outputToFile("r_corrBatchRawZoomIn",stream);
-    i_corrBatchZoomInValid->outputToFile("i_corrBatchZoomInValid",stream);
+    //cudaDeviceSynchronize();

    // Summation of correlation and data point values
    cuArraysSumCorr(r_corrBatchRawZoomIn, i_corrBatchZoomInValid, r_corrBatchSum, i_corrBatchValidCount, stream);

+#ifdef CUAMPCOR_DEBUG
+    i_corrBatchZoomInValid->outputToFile("i_corrBatchZoomInValid", stream);
+    r_corrBatchSum->outputToFile("r_corrBatchSum", stream);
+    // snr and cov will be outputted anyway
+#endif
+
    // SNR
    cuEstimateSnr(r_corrBatchSum, i_corrBatchValidCount, r_maxval, r_snrValue, stream);

    // Variance
-    // cuEstimateVariance(r_corrBatchRaw, offsetInit, r_maxval, r_covValue, stream);
+    cuEstimateVariance(r_corrBatchRaw, offsetInit, r_maxval, r_covValue, stream);
+
+#ifdef CUAMPCOR_DEBUG
+    // debug: output the intermediate results
+
+    // std::cout << "Offset from first search:\n";
+    // offsetInit->debuginfo(stream);
+    // dump the results
+    offsetInit->outputToFile("i_offsetInit", stream);
+    r_maxval->outputToFile("r_maxval", stream);
+    r_corrBatchRawZoomIn->outputToFile("r_corrBatchRawStatZoomIn", stream);
+    i_corrBatchZoomInValid->outputToFile("i_corrBatchStatZoomInValid", stream);
+    //r_snr, r_cov will be always saved to files
+#endif

    // Using the approximate estimation to adjust secondary image (half search window size becomes only 4 pixels)
-    //offsetInit->debuginfo(stream);
+    // offsetInit->debuginfo(stream);
    // determine the starting pixel to extract secondary images around the max location
    cuDetermineSecondaryExtractOffset(offsetInit,
+        maxLocShift,
        param->halfSearchRangeDownRaw, // old range
        param->halfSearchRangeAcrossRaw,
        param->halfZoomWindowSizeRaw,  // new range
        param->halfZoomWindowSizeRaw,
        stream);
-    //offsetInit->debuginfo(stream);
+
+#ifdef CUAMPCOR_DEBUG
+    // std::cout << "max location adjusted if close to boundary\n";
+    // offsetInit->debuginfo(stream);
+    // std::cout << "and the shift of the center\n";
+    // maxLocShift->debuginfo(stream);
+    offsetInit->outputToFile("i_offsetInitAdjusted", stream);
+    maxLocShift->outputToFile("i_maxLocShift", stream);
+#endif
+
    // oversample reference
    // (deramping now included in oversampler)
    referenceBatchOverSampler->execute(c_referenceBatchRaw, c_referenceBatchOverSampled, param->derampMethod);
    cuArraysAbs(c_referenceBatchOverSampled, r_referenceBatchOverSampled, stream);
+
+#ifdef CUAMPCOR_DEBUG
+    // dump the oversampled reference image(s)
+    c_referenceBatchOverSampled->outputToFile("c_referenceBatchOverSampled", stream);
+    r_referenceBatchOverSampled->outputToFile("r_referenceBatchOverSampled", stream);
+#endif
+
+    // subtrace the mean value
    cuArraysSubtractMean(r_referenceBatchOverSampled, stream);

+#ifdef CUAMPCOR_DEBUG
+    // dump the oversampled reference image(s) with mean subtracted
+    r_referenceBatchOverSampled->outputToFile("r_referenceBatchOverSampledSubMean",stream);
+#endif
+
    // extract secondary and oversample
    cuArraysCopyExtract(c_secondaryBatchRaw, c_secondaryBatchZoomIn, offsetInit, stream);
    secondaryBatchOverSampler->execute(c_secondaryBatchZoomIn, c_secondaryBatchOverSampled, param->derampMethod);
    cuArraysAbs(c_secondaryBatchOverSampled, r_secondaryBatchOverSampled, stream);

+#ifdef CUAMPCOR_DEBUG
+    // dump the extracted raw secondary image
+    c_secondaryBatchZoomIn->outputToFile("c_secondaryBatchZoomIn", stream);
+    // dump the oversampled secondary image(s)
+    c_secondaryBatchOverSampled->outputToFile("c_secondaryBatchOverSampled", stream);
+    r_secondaryBatchOverSampled->outputToFile("r_secondaryBatchOverSampled", stream);
+#endif
+
    // correlate oversampled images
    if(param->algorithm == 0) {
        cuCorrFreqDomain_OverSampled->execute(r_referenceBatchOverSampled, r_secondaryBatchOverSampled, r_corrBatchZoomIn);
@ -92,47 +161,77 @@ void cuAmpcorChunk::run(int idxDown_, int idxAcross_)
    else {
        cuCorrTimeDomain(r_referenceBatchOverSampled, r_secondaryBatchOverSampled, r_corrBatchZoomIn, stream);
    }
+
+#ifdef CUAMPCOR_DEBUG
+    // dump the oversampled correlation surface (un-normalized)
+    r_corrBatchZoomIn->outputToFile("r_corrBatchZoomInUnNorm", stream);
+#endif
+
+    // normalize the correlation surface
    cuCorrNormalize(r_referenceBatchOverSampled, r_secondaryBatchOverSampled, r_corrBatchZoomIn, stream);

+#ifdef CUAMPCOR_DEBUG
    //std::cout << "debug correlation oversample\n";
    //std::cout << r_referenceBatchOverSampled->height << " " << r_referenceBatchOverSampled->width << "\n";
    //std::cout << r_secondaryBatchOverSampled->height << " " << r_secondaryBatchOverSampled->width << "\n";
    //std::cout << r_corrBatchZoomIn->height << " " << r_corrBatchZoomIn->width << "\n";
+    // dump the oversampled correlation surface (normalized)
+    r_corrBatchZoomIn->outputToFile("r_corrBatchZoomIn", stream);
+#endif

-    // oversample the correlation surface
+    // remove the last row and col to get even sequences (for sinc oversampler)
    cuArraysCopyExtract(r_corrBatchZoomIn, r_corrBatchZoomInAdjust, make_int2(0,0), stream);

+#ifdef CUAMPCOR_DEBUG
    //std::cout << "debug oversampling " << r_corrBatchZoomInAdjust << " " << r_corrBatchZoomInOverSampled << "\n";
+    // dump the adjusted correlation Surface
+    r_corrBatchZoomInAdjust->outputToFile("r_corrBatchZoomInAdjust", stream);
+#endif

+    // oversample the correlation surface
    if(param->oversamplingMethod) {
-        corrSincOverSampler->execute(r_corrBatchZoomInAdjust, r_corrBatchZoomInOverSampled);
+        // sinc interpolator only computes (-i_sincwindow, i_sincwindow)*oversamplingfactor
+        // we need the max loc as the center if shifted
+        corrSincOverSampler->execute(r_corrBatchZoomInAdjust, r_corrBatchZoomInOverSampled,
+            maxLocShift, param->oversamplingFactor*param->rawDataOversamplingFactor
+            );
    }
    else {
        corrOverSampler->execute(r_corrBatchZoomInAdjust, r_corrBatchZoomInOverSampled);
    }

-    //find the max again
+#ifdef CUAMPCOR_DEBUG
+    // dump the oversampled correlation surface
+    r_corrBatchZoomInOverSampled->outputToFile("r_corrBatchZoomInOverSampled", stream);
+#endif

+    //find the max again
    cuArraysMaxloc2D(r_corrBatchZoomInOverSampled, offsetZoomIn, corrMaxValue, stream);

+#ifdef CUAMPCOR_DEBUG
+    // dump the max location on oversampled correlation surface
+    offsetZoomIn->outputToFile("i_offsetZoomIn", stream);
+    corrMaxValue->outputToFile("r_maxvalZoomInOversampled", stream);
+#endif
+
    // determine the final offset from non-oversampled (pixel) and oversampled (sub-pixel)
+    // = (Init-HalfsearchRange) + ZoomIn/(2*ovs)
    cuSubPixelOffset(offsetInit, offsetZoomIn, offsetFinal,
        param->oversamplingFactor, param->rawDataOversamplingFactor,
        param->halfSearchRangeDownRaw, param->halfSearchRangeAcrossRaw,
        param->halfZoomWindowSizeRaw, param->halfZoomWindowSizeRaw,
        stream);
-    //offsetInit->debuginfo(stream);
-    //offsetZoomIn->debuginfo(stream);
-    //offsetFinal->debuginfo(stream);
+
+// #ifdef CUAMPCOR_DEBUG
+    // std::cout << "Offsets: Oversampled and Final)\n";
+    // offsetZoomIn->debuginfo(stream);
+    // offsetFinal->debuginfo(stream);
+// #endif

    // Do insertion.
    // Offsetfields.
    cuArraysCopyInsert(offsetFinal, offsetImage, idxDown_*param->numberWindowDownInChunk, idxAcross_*param->numberWindowAcrossInChunk,stream);

-    // Debugging matrix.
-    cuArraysCopyInsert(r_corrBatchSum, floatImage1, idxDown_*param->numberWindowDownInChunk, idxAcross_*param->numberWindowAcrossInChunk,stream);
-    cuArraysCopyInsert(i_corrBatchValidCount, intImage1, idxDown_*param->numberWindowDownInChunk, idxAcross_*param->numberWindowAcrossInChunk,stream);
-
    // Old: save max correlation coefficients.
    //cuArraysCopyInsert(corrMaxValue, snrImage, idxDown_*param->numberWindowDownInChunk, idxAcross_*param->numberWindowAcrossInChunk,stream);
    // New: save SNR
@ -301,7 +400,8 @@ void cuAmpcorChunk::loadSecondaryChunk()
 }

 cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, GDALImage *secondary_,
-    cuArrays<float2> *offsetImage_, cuArrays<float> *snrImage_, cuArrays<float3> *covImage_, cuArrays<int> *intImage1_, cuArrays<float> *floatImage1_, cudaStream_t stream_)
+    cuArrays<float2> *offsetImage_, cuArrays<float> *snrImage_, cuArrays<float3> *covImage_,
+    cudaStream_t stream_)

 {
    param = param_;
@ -311,9 +411,6 @@ cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, G
    snrImage = snrImage_;
    covImage = covImage_;

-    intImage1 = intImage1_;
-    floatImage1 = floatImage1_;
-
    stream = stream_;

    // std::cout << "debug Chunk creator " << param->maxReferenceChunkHeight << " " << param->maxReferenceChunkWidth << "\n";
@ -422,13 +519,17 @@ cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, G
    offsetFinal = new cuArrays<float2> (param->numberWindowDownInChunk, param->numberWindowAcrossInChunk);
    offsetFinal->allocate();

+    maxLocShift = new cuArrays<int2> (param->numberWindowDownInChunk, param->numberWindowAcrossInChunk);
+    maxLocShift->allocate();
+
    corrMaxValue = new cuArrays<float> (param->numberWindowDownInChunk, param->numberWindowAcrossInChunk);
    corrMaxValue->allocate();


    // new arrays due to snr estimation
-    std::cout<< "corrRawZoomInHeight: " << param->corrRawZoomInHeight << "\n";
-    std::cout<< "corrRawZoomInWidth: " << param->corrRawZoomInWidth << "\n";
+    // std::cout<< "corrRawZoomInHeight: " << param->corrRawZoomInHeight << "\n";
+    // std::cout<< "corrRawZoomInWidth: " << param->corrRawZoomInWidth << "\n";
+    std::cout << "Size of corr_surface used for statistics: " << param->corrRawZoomInHeight << " x " << param->corrRawZoomInWidth << "\n";

    r_corrBatchRawZoomIn = new cuArrays<float> (
 			param->corrRawZoomInHeight,
@ -474,7 +575,7 @@ cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, G
    // end of new arrays

    if(param->oversamplingMethod) {
-        corrSincOverSampler = new cuSincOverSamplerR2R(param->zoomWindowSize, param->oversamplingFactor, stream);
+        corrSincOverSampler = new cuSincOverSamplerR2R(param->oversamplingFactor, stream);
    }
    else {
        corrOverSampler= new cuOverSamplerR2R(param->zoomWindowSize, param->zoomWindowSize,
@ -495,8 +596,9 @@ cuAmpcorChunk::cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, G
    }


-
-    debugmsg("all objects in chunk are created ...\n");
+#ifdef CUAMPCOR_DEBUG
+    std::cout << "all objects in chunk are created ...\n";
+#endif

 }
 cuAmpcorChunk::~cuAmpcorChunk()
--- a/contrib/PyCuAmpcor/src/cuAmpcorChunk.h
+++ b/contrib/PyCuAmpcor/src/cuAmpcorChunk.h
@ -31,10 +31,6 @@ private:
 	cuArrays<float> *snrImage;
 	cuArrays<float3> *covImage;

-	// added for test
-    cuArrays<int> *intImage1;
-    cuArrays<float> *floatImage1;
-
    // gpu buffer
 	cuArrays<float2> * c_referenceChunkRaw, * c_secondaryChunkRaw;
 	cuArrays<float> * r_referenceChunkRaw, * r_secondaryChunkRaw;
@ -61,6 +57,7 @@ private:
 	cuArrays<int2> *offsetInit;
 	cuArrays<int2> *offsetZoomIn;
 	cuArrays<float2> *offsetFinal;
+	cuArrays<int2> *maxLocShift; //record the maxloc from the extract center
    cuArrays<float> *corrMaxValue;


@ -85,7 +82,7 @@ public:
 	void setIndex(int idxDown_, int idxAcross_);

 	cuAmpcorChunk(cuAmpcorParameter *param_, GDALImage *reference_, GDALImage *secondary_, cuArrays<float2> *offsetImage_,
-	            cuArrays<float> *snrImage_, cuArrays<float3> *covImage_, cuArrays<int> *intImage1_, cuArrays<float> *floatImage1_, cudaStream_t stream_);
+	            cuArrays<float> *snrImage_, cuArrays<float3> *covImage_, cudaStream_t stream_);


    void loadReferenceChunk();
--- a/contrib/PyCuAmpcor/src/cuAmpcorController.cu
+++ b/contrib/PyCuAmpcor/src/cuAmpcorController.cu
@ -27,13 +27,13 @@ void cuAmpcorController::runAmpcor() {
    cuArrays<float3> *covImage, *covImageRun;

    // For debugging.
-    cuArrays<int> *intImage1;
-    cuArrays<float> *floatImage1;
+    // cuArrays<int> *corrValidCountImage;
+    // cuArrays<float> *corrSumImage;

    int nWindowsDownRun = param->numberChunkDown * param->numberWindowDownInChunk;
    int nWindowsAcrossRun = param->numberChunkAcross * param->numberWindowAcrossInChunk;

-    std::cout << "Debug " << nWindowsDownRun << " " << param->numberWindowDown << "\n";
+    //std::cout << "The number of windows to be processed (might be bigger) " << nWindowsDownRun << " x " << param->numberWindowDown << "\n";

    offsetImageRun = new cuArrays<float2>(nWindowsDownRun, nWindowsAcrossRun);
    offsetImageRun->allocate();
@ -44,14 +44,6 @@ void cuAmpcorController::runAmpcor() {
    covImageRun = new cuArrays<float3>(nWindowsDownRun, nWindowsAcrossRun);
    covImageRun->allocate();

-    // intImage 1 and floatImage 1 are added for debugging issues
-
-    intImage1 = new cuArrays<int>(nWindowsDownRun, nWindowsAcrossRun);
-    intImage1->allocate();
-
-    floatImage1 = new cuArrays<float>(nWindowsDownRun, nWindowsAcrossRun);
-    floatImage1->allocate();
-
    // Offsetfields.
    offsetImage = new cuArrays<float2>(param->numberWindowDown, param->numberWindowAcross);
    offsetImage->allocate();
@ -69,7 +61,8 @@ void cuAmpcorController::runAmpcor() {
    for(int ist=0; ist<param->nStreams; ist++)
    {
        cudaStreamCreate(&streams[ist]);
-        chunk[ist]= new cuAmpcorChunk(param, referenceImage, secondaryImage, offsetImageRun, snrImageRun, covImageRun, intImage1, floatImage1, streams[ist]);
+        chunk[ist]= new cuAmpcorChunk(param, referenceImage, secondaryImage, offsetImageRun, snrImageRun, covImageRun,
+            streams[ist]);

    }

@ -106,9 +99,6 @@ void cuAmpcorController::runAmpcor() {
    snrImage->outputToFile(param->snrImageName, streams[0]);
    covImage->outputToFile(param->covImageName, streams[0]);

-    // Output debugging arrays.
-    intImage1->outputToFile("intImage1", streams[0]);
-    floatImage1->outputToFile("floatImage1", streams[0]);

    outputGrossOffsets();

@ -117,9 +107,6 @@ void cuAmpcorController::runAmpcor() {
    delete snrImage;
    delete covImage;

-    delete intImage1;
-    delete floatImage1;
-
    delete offsetImageRun;
    delete snrImageRun;
    delete covImageRun;
--- a/contrib/PyCuAmpcor/src/cuAmpcorParameter.cu
+++ b/contrib/PyCuAmpcor/src/cuAmpcorParameter.cu
@ -32,7 +32,7 @@ cuAmpcorParameter::cuAmpcorParameter()
    skipSampleAcrossRaw = 64;
    skipSampleDownRaw = 64;
    rawDataOversamplingFactor = 2;
-    zoomWindowSize = 8;
+    zoomWindowSize = 16;
    oversamplingFactor = 16;
    oversamplingMethod = 0;

@ -54,8 +54,7 @@ cuAmpcorParameter::cuAmpcorParameter()
    referenceStartPixelDown0 = 0;
    referenceStartPixelAcross0 = 0;

-    corrRawZoomInHeight = 17; // 8*2+1
-    corrRawZoomInWidth = 17;
+    corrStatWindowSize = 21; // 10*2+1 as in RIOPAC

    useMmap = 1; // use mmap
    mmapSizeInGB = 1;
@ -68,7 +67,19 @@ cuAmpcorParameter::cuAmpcorParameter()

 void cuAmpcorParameter::setupParameters()
 {
-    zoomWindowSize *= rawDataOversamplingFactor; //8 * 2
+    // Size to extract the raw correlation surface for snr/cov
+    corrRawZoomInHeight = std::min(corrStatWindowSize, 2*halfSearchRangeDownRaw+1);
+    corrRawZoomInWidth = std::min(corrStatWindowSize, 2*halfSearchRangeAcrossRaw+1);
+
+    // Size to extract the resampled correlation surface for oversampling
+    // users should use 16 for zoomWindowSize, no need to multiply by 2
+    // zoomWindowSize *= rawDataOversamplingFactor; //8 * 2
+    // to check the search range
+    int corrSurfaceActualSize =
+        std::min(halfSearchRangeAcrossRaw, halfSearchRangeDownRaw)*
+        2*rawDataOversamplingFactor;
+    zoomWindowSize = std::min(zoomWindowSize, corrSurfaceActualSize);
+
    halfZoomWindowSizeRaw = zoomWindowSize/(2*rawDataOversamplingFactor); // 8*2/(2*2) = 4

    windowSizeWidth = windowSizeWidthRaw*rawDataOversamplingFactor;  //
--- a/contrib/PyCuAmpcor/src/cuAmpcorParameter.h
+++ b/contrib/PyCuAmpcor/src/cuAmpcorParameter.h
@ -50,7 +50,8 @@ public:
    int searchWindowSizeHeightRawZoomIn;
    int searchWindowSizeWidthRawZoomIn;

-    int corrRawZoomInHeight;  // window to estimate snr
+    int corrStatWindowSize;     /// window to estimate snr
+    int corrRawZoomInHeight;
    int corrRawZoomInWidth;

    // chip or window size after oversampling
--- a/contrib/PyCuAmpcor/src/cuAmpcorUtil.h
+++ b/contrib/PyCuAmpcor/src/cuAmpcorUtil.h
@ -6,7 +6,7 @@


 #ifndef __CUAMPCORUTIL_H
-#define __CUMAPCORUTIL_H
+#define __CUAMPCORUTIL_H

 #include "cuArrays.h"
 #include "cuAmpcorParameter.h"
@ -72,7 +72,7 @@ void cuSubPixelOffset(cuArrays<int2> *offsetInit, cuArrays<int2> *offsetZoomIn,
                      cudaStream_t stream);

 void cuDetermineInterpZone(cuArrays<int2> *maxloc, cuArrays<int2> *zoomInOffset, cuArrays<float> *corrOrig, cuArrays<float> *corrZoomIn, cudaStream_t stream);
-void cuDetermineSecondaryExtractOffset(cuArrays<int2> *maxLoc, int xOldRange, int yOldRange, int xNewRange, int yNewRange, cudaStream_t stream);
+void cuDetermineSecondaryExtractOffset(cuArrays<int2> *maxLoc, cuArrays<int2> *maxLocShift, int xOldRange, int yOldRange, int xNewRange, int yNewRange, cudaStream_t stream);

 //in cuCorrTimeDomain.cu: cross correlation in time domain
 void cuCorrTimeDomain(cuArrays<float> *templates, cuArrays<float> *images, cuArrays<float> *results, cudaStream_t stream);
--- a/contrib/PyCuAmpcor/src/cuArrays.cu
+++ b/contrib/PyCuAmpcor/src/cuArrays.cu
@ -1,14 +1,14 @@

 #include "cuArrays.h"
 #include "cudaError.h"
-	
+
 	template <typename T>
 	void cuArrays<T>::allocate()
 	{
 		checkCudaErrors(cudaMalloc((void **)&devData, getByteSize()));
-        is_allocated = 1; 
+        is_allocated = 1;
 	}
-	
+
    template <typename T>
    void cuArrays<T>::allocateHost()
    {
@ -16,41 +16,41 @@
        //checkCudaErrors(cudaMallocHost((void **)&hostData, getByteSize()));
        is_allocatedHost = 1;
    }
-    
+
 	template <typename T>
 	void cuArrays<T>::deallocate()
 	{
 		checkCudaErrors(cudaFree(devData));
-        is_allocated = 0; 
+        is_allocated = 0;
 	}
-	
+
    template <typename T>
 	void cuArrays<T>::deallocateHost()
 	{
 		//checkCudaErrors(cudaFreeHost(hostData));
        free(hostData);
-        is_allocatedHost = 0; 
+        is_allocatedHost = 0;
 	}
-    
+
    template <typename T>
 	void cuArrays<T>::copyToHost(cudaStream_t stream)
 	{
        //std::cout << "debug copy " << is_allocatedHost << " " << is_allocated  << " " << getByteSize() << "\n";
 		checkCudaErrors(cudaMemcpyAsync(hostData, devData, getByteSize(), cudaMemcpyDeviceToHost, stream));
 	}
-    
+
    template <typename T>
    void cuArrays<T>::copyToDevice(cudaStream_t stream)
 	{
 		checkCudaErrors(cudaMemcpyAsync(devData, hostData, getByteSize(), cudaMemcpyHostToDevice, stream));
 	}
-    
+
    template <typename T>
    void cuArrays<T>::setZero(cudaStream_t stream)
    {
        checkCudaErrors(cudaMemsetAsync(devData, 0, getByteSize(), stream));
    }
-    
+
 	template<>
 	void cuArrays<float2>::debuginfo(cudaStream_t stream) {
 		//std::cout << height << " " << width << " " << count << std::endl;
@ -58,41 +58,41 @@
        if( !is_allocatedHost)
    		allocateHost();
        copyToHost(stream);
-    
+
        //cudaStreamSynchronize(stream);
        //std::cout << "debug debuginfo " << size << " " << count << " " << stream << "\n";

-		int range = min(10, size*count);
+		int range = std::min(10, size*count);
 	
 		for(int i=0; i<range; i++)
-			std::cout << "(" <<hostData[i].x << " ," << hostData[i].y << ")" ;
+			std::cout << "(" <<hostData[i].x << ", " << hostData[i].y << ")" ;
 		std::cout << std::endl;
        if(size*count>range) {
            for(int i=size*count-range; i<size*count; i++)
-                std::cout << "(" <<hostData[i].x << " ," << hostData[i].y << ")" ;
+                std::cout << "(" <<hostData[i].x << ", " << hostData[i].y << ")" ;
            std::cout << std::endl;
        }
 	}
-	
-    	
+
+
 	template<>
 	void cuArrays<int2>::debuginfo(cudaStream_t stream) {
 		//std::cout << height << " " << width << " " << count << std::endl;
        if( !is_allocatedHost)
    		allocateHost();
        copyToHost(stream);
-		int range = min(10, size*count);
+		int range = std::min(10, size*count);
 	
 		for(int i=0; i<range; i++)
-			std::cout << "(" <<hostData[i].x << " ," << hostData[i].y << ")" ;
+			std::cout << "(" <<hostData[i].x << ", " << hostData[i].y << ")" ;
 		std::cout << std::endl;
 		if(size*count>range) {
            for(int i=size*count-range; i<size*count; i++)
-                std::cout << "(" <<hostData[i].x << " ," << hostData[i].y << ")" ;
+                std::cout << "(" <<hostData[i].x << ", " << hostData[i].y << ")" ;
            std::cout << std::endl;
        }
 	}
-    
+
 	template <>
 	void cuArrays<float>::debuginfo(cudaStream_t stream) {
 		std::cout << height << " " << width << " " << count << std::endl;
@ -100,7 +100,7 @@
    		allocateHost();
        copyToHost(stream);
 		
-		int range = min(10, size*count);
+		int range = std::min(10, size*count);
 	
 		for(int i=0; i<range; i++)
 			std::cout << "(" <<hostData[i]  << ")" ;
@ -111,7 +111,7 @@
            std::cout << std::endl;
        }
 	}
-	
+
 	template<typename T>
 	void cuArrays<T>::outputToFile(std::string fn, cudaStream_t stream)
 	{
@ -124,12 +124,12 @@
    template <typename T>
    void cuArrays<T>::outputHostToFile(std::string fn)
 	{
-		std::ofstream file;  
+		std::ofstream file;
 		file.open(fn.c_str(),  std::ios_base::binary);
 		file.write((char *)hostData, getByteSize());
 		file.close();
 	}
-    
+
 	/*
 	template<>
 	void cuArrays<float>::outputToFile(std::string fn, cudaStream_t stream)
@ -137,19 +137,19 @@
 		float *data;
 		data = (float *)malloc(size*count*sizeof(float));
 		cudaMemcpyAsync(data, devData, size*count*sizeof(float), cudaMemcpyDeviceToHost, stream);
-		std::ofstream file;  
+		std::ofstream file;
 		file.open(fn.c_str(),  std::ios_base::binary);
 		file.write((char *)data, size*count*sizeof(float));
 		file.close();
 	}*/
-	
+
 	template<>
 	void cuArrays<float2>::outputToFile(std::string fn, cudaStream_t stream)
 	{
 		float *data;
 		data = (float *)malloc(size*count*sizeof(float2));
 		checkCudaErrors(cudaMemcpyAsync(data, devData, size*count*sizeof(float2), cudaMemcpyDeviceToHost, stream));
-		std::ofstream file;  
+		std::ofstream file;
 		file.open(fn.c_str(),  std::ios_base::binary);
 		file.write((char *)data, size*count*sizeof(float2));
 		file.close();
@ -161,12 +161,12 @@
 		float *data;
 		data = (float *)malloc(size*count*sizeof(float3));
 		checkCudaErrors(cudaMemcpyAsync(data, devData, size*count*sizeof(float3), cudaMemcpyDeviceToHost, stream));
-		std::ofstream file;  
+		std::ofstream file;
 		file.open(fn.c_str(),  std::ios_base::binary);
 		file.write((char *)data, size*count*sizeof(float3));
 		file.close();
 	}
-	
+
 	template class cuArrays<float>;
 	template class cuArrays<float2>;
    template class cuArrays<float3>;
--- a/contrib/PyCuAmpcor/src/cuArraysCopy.cu
+++ b/contrib/PyCuAmpcor/src/cuArraysCopy.cu
@ -233,33 +233,37 @@ __global__ void cuArraysCopyExtractVaryingOffsetCorr(const float *imageIn, const
     const int2 *maxloc)
 {

-        int idxImage = blockIdx.z;
+    // get the image index
+    int idxImage = blockIdx.z;

-        // One thread per out point. Find the coordinates within the current image.
-        int outx = threadIdx.x + blockDim.x*blockIdx.x;
-        int outy = threadIdx.y + blockDim.y*blockIdx.y;
+    // One thread per out point. Find the coordinates within the current image.
+    int outx = threadIdx.x + blockDim.x*blockIdx.x;
+    int outy = threadIdx.y + blockDim.y*blockIdx.y;

-        // Find the correponding input.
+    // check whether thread is within output image range
+    if (outx < outNX && outy < outNY)
+    {
+        // Find the corresponding input.
        int inx = outx + maxloc[idxImage].x - outNX/2;
        int iny = outy + maxloc[idxImage].y - outNY/2;

-        if (outx < outNX && outy < outNY)
+        // Find the location in flattened array.
+        int idxOut = ( blockIdx.z * outNX + outx ) * outNY + outy;
+        int idxIn = ( blockIdx.z * inNX + inx ) * inNY + iny;
+
+        // check whether inside of the input image
+        if (inx>=0 && iny>=0 && inx<inNX && iny<inNY)
        {
-                // Find the location in full array.
-                int idxOut = ( blockIdx.z * outNX + outx ) * outNY + outy;
-
-                int idxIn = ( blockIdx.z * inNX + inx ) * inNY + iny;
-
-            if (inx>=0 && iny>=0 && inx<inNX && iny<inNY) {
-
-                    imageOut[idxOut] = imageIn[idxIn];
-                    imageValid[idxOut] = 1;
-                }
-            else {
-                    imageOut[idxOut] = 0.0f;
-                    imageValid[idxOut] = 0;
-            }
+            // inside the boundary, copy over and mark the pixel as valid (1)
+            imageOut[idxOut] = imageIn[idxIn];
+            imageValid[idxOut] = 1;
        }
+        else {
+            // outside, set it to 0 and mark the pixel as invalid (0)
+            imageOut[idxOut] = 0.0f;
+            imageValid[idxOut] = 0;
+        }
+    }
 }

 /* copy a tile of images to another image, with starting pixels offsets accouting for boundary
@ -268,16 +272,16 @@ __global__ void cuArraysCopyExtractVaryingOffsetCorr(const float *imageIn, const
 */
 void cuArraysCopyExtractCorr(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut, cuArrays<int> *imagesValid, cuArrays<int2> *maxloc, cudaStream_t stream)
 {
-        //assert(imagesIn->height >= imagesOut && inNY >= outNY);
-        const int nthreads = 16;
+    //assert(imagesIn->height >= imagesOut && inNY >= outNY);
+    const int nthreads = 16;

-        dim3 threadsperblock(nthreads, nthreads,1);
+    dim3 threadsperblock(nthreads, nthreads,1);

-        dim3 blockspergrid(IDIVUP(imagesOut->height,nthreads), IDIVUP(imagesOut->width,nthreads), imagesOut->count);
+    dim3 blockspergrid(IDIVUP(imagesOut->height,nthreads), IDIVUP(imagesOut->width,nthreads), imagesOut->count);

-        cuArraysCopyExtractVaryingOffsetCorr<<<blockspergrid, threadsperblock,0, stream>>>(imagesIn->devData, imagesIn->height, imagesIn->width,
-            imagesOut->devData, imagesOut->height, imagesOut->width, imagesValid->devData, imagesOut->count, maxloc->devData);
-        getLastCudaError("cuArraysCopyExtract error");
+    cuArraysCopyExtractVaryingOffsetCorr<<<blockspergrid, threadsperblock,0, stream>>>(imagesIn->devData, imagesIn->height, imagesIn->width,
+        imagesOut->devData, imagesOut->height, imagesOut->width, imagesValid->devData, imagesOut->count, maxloc->devData);
+    getLastCudaError("cuArraysCopyExtract error");
 }

 // end of correlation surface extraction (Minyan Zhong)
--- a/contrib/PyCuAmpcor/src/cuDeramp.cu
+++ b/contrib/PyCuAmpcor/src/cuDeramp.cu
@ -277,9 +277,6 @@ void cpuDerampMethod3(cuArrays<float2> *imagesD, cudaStream_t stream)
            }         
        }
        //phaseDiffY /=  height*(width-1);
-        
-       
-        
         if (complexAbs(phaseDiffY) < 1.e-5) {
            phaseY = 0.0;
        }
@ -331,14 +328,17 @@ void cpuDerampMethod3(cuArrays<float2> *imagesD, cudaStream_t stream)
        
 void cuDeramp(int method, cuArrays<float2> *images, cudaStream_t stream)
 {
+    // methods 2-3 are for test purposes only, removed for release
+    // note method 0 is designed for TOPSAR: not only deramping is skipped,
+    //    the amplitude is taken before oversampling
    switch(method) {
-    case 3:
-        cpuDerampMethod3(images, stream);    
+    //case 3:
+    //    cpuDerampMethod3(images, stream);
    case 1:
        cuDerampMethod1(images, stream);
        break;
-    case 2:
-        cuDerampMethod2(images, stream);
+    //case 2:
+    //    cuDerampMethod2(images, stream);
        break;
    default:
        break;
--- a/contrib/PyCuAmpcor/src/cuOffset.cu
+++ b/contrib/PyCuAmpcor/src/cuOffset.cu
@ -1,13 +1,13 @@
 /*
 * maxlocation.cu
 * Purpose: find the location of maximum for a batch of images/vectors
- *          this uses the reduction algorithm similar to summations  
- *  
+ *          this uses the reduction algorithm similar to summations
+ *
 * Author : Lijun Zhu
 *          Seismo Lab, Caltech
- * Version 1.0 10/01/16  
-*/ 
-	
+ * Version 1.0 10/01/16
+*/
+
 #include "cuAmpcorUtil.h"
 #include <cfloat>

@ -26,30 +26,30 @@ __device__ float atomicMaxf(float* address, float val)


 // comapre two elements
-inline static __device__ void maxPairReduce(volatile float* maxval, volatile int* maxloc, 
+inline static __device__ void maxPairReduce(volatile float* maxval, volatile int* maxloc,
      size_t gid, size_t strideid)
 {
 	if(maxval[gid] < maxval[strideid]) {
 		maxval[gid] = maxval[strideid];
 		maxloc[gid] = maxloc[strideid];
 	}
-}   
+}

-// max reduction kernel, save the results to shared memory 
+// max reduction kernel, save the results to shared memory
 template<const int BLOCKSIZE>
-__device__ void max_reduction(const float* const images, 
+__device__ void max_reduction(const float* const images,
    const size_t imageSize,
-    const size_t nImages, 
-    volatile float* shval, 
+    const size_t nImages,
+    volatile float* shval,
    volatile int* shloc)
 {
 	int tid = threadIdx.x;
-    shval[tid] = -FLT_MAX; 
+    shval[tid] = -FLT_MAX;
 	int imageStart = blockIdx.x*imageSize;
 	int imagePixel;

-	// reduction for elements with i, i+BLOCKSIZE, i+2*BLOCKSIZE ... 
-	// 
+	// reduction for elements with i, i+BLOCKSIZE, i+2*BLOCKSIZE ...
+	//
 	for(int gid = tid; gid < imageSize; gid+=blockDim.x)
 	{
 		imagePixel = imageStart+gid;
@ -59,13 +59,13 @@ __device__ void max_reduction(const float* const images,
 		}
 	}
    __syncthreads();
-    
+
    //reduction within a block
    if (BLOCKSIZE >=1024){ if (tid < 512) { maxPairReduce(shval, shloc, tid, tid + 512); } __syncthreads(); }
    if (BLOCKSIZE >=512) { if (tid < 256) { maxPairReduce(shval, shloc, tid, tid + 256); } __syncthreads(); }
    if (BLOCKSIZE >=256) { if (tid < 128) { maxPairReduce(shval, shloc, tid, tid + 128); } __syncthreads(); }
    if (BLOCKSIZE >=128) { if (tid < 64 ) { maxPairReduce(shval, shloc, tid, tid + 64 ); } __syncthreads(); }
-    //reduction within a warp 
+    //reduction within a warp
    if (tid < 32)
    {
 		maxPairReduce(shval, shloc, tid, tid + 32);
@ -83,22 +83,22 @@ template <const int BLOCKSIZE>
 __global__ void  cuMaxValLoc_kernel( const float* const images, float *maxval, int* maxloc, const size_t imageSize, const size_t nImages)
 {
    __shared__ float shval[BLOCKSIZE];
-    __shared__ int shloc[BLOCKSIZE];    
-    int bid = blockIdx.x; 
+    __shared__ int shloc[BLOCKSIZE];
+    int bid = blockIdx.x;
    if(bid >= nImages) return;
-    
+
    max_reduction<BLOCKSIZE>(images, imageSize, nImages, shval, shloc);
-    
+
    if (threadIdx.x == 0) {
        maxloc[bid] = shloc[0];
        maxval[bid] = shval[0];
-    }      
+    }
 }

 void cuArraysMaxValandLoc(cuArrays<float> *images, cuArrays<float> *maxval, cuArrays<int> *maxloc, cudaStream_t stream)
 {
    const size_t imageSize = images->size;
-    const size_t nImages = images->count; 
+    const size_t nImages = images->count;
    dim3 threadsperblock(NTHREADS);
    dim3 blockspergrid(nImages);
    cuMaxValLoc_kernel<NTHREADS><<<blockspergrid, threadsperblock, 0, stream>>>
@ -106,29 +106,29 @@ void cuArraysMaxValandLoc(cuArrays<float> *images, cuArrays<float> *maxval, cuAr
    getLastCudaError("cudaKernel fine max location error\n");
 }

-//kernel and function for 1D array, find max location only 
+//kernel and function for 1D array, find max location only
 template <const int BLOCKSIZE>
 __global__ void  cudaKernel_maxloc(const float* const images, int* maxloc,
                                   const size_t imageSize, const size_t nImages)
 {
    __shared__ float shval[BLOCKSIZE];
    __shared__ int shloc[BLOCKSIZE];
-    
-    int bid = blockIdx.x; 
+
+    int bid = blockIdx.x;
    if(bid >=nImages) return;
-    
+
    max_reduction<BLOCKSIZE>(images, imageSize, nImages, shval, shloc);
-    
+
    if (threadIdx.x == 0) {
        maxloc[bid] = shloc[0];
    }
 }

-void cuArraysMaxLoc(cuArrays<float> *images, cuArrays<int> *maxloc, cudaStream_t stream) 
+void cuArraysMaxLoc(cuArrays<float> *images, cuArrays<int> *maxloc, cudaStream_t stream)
 {
    int imageSize = images->size;
    int nImages = maxloc->size;
-    
+
    cudaKernel_maxloc<NTHREADS><<<nImages, NTHREADS,0, stream>>>
        (images->devData, maxloc->devData, imageSize, nImages);
    getLastCudaError("cudaKernel find max location 1D error\n");
@ -140,21 +140,21 @@ __global__ void  cudaKernel_maxloc2D(const float* const images, int2* maxloc, fl
 {
    __shared__ float shval[BLOCKSIZE];
    __shared__ int shloc[BLOCKSIZE];
-    
-    int bid = blockIdx.x; 
+
+    int bid = blockIdx.x;
    if(bid >= nImages) return;
-    
+
    const int imageSize = imageNX * imageNY;
    max_reduction<BLOCKSIZE>(images, imageSize, nImages, shval, shloc);
-    
+
    if (threadIdx.x == 0) {
-        maxloc[bid] = make_int2(shloc[0]/imageNY, shloc[0]%imageNY); 
+        maxloc[bid] = make_int2(shloc[0]/imageNY, shloc[0]%imageNY);
        maxval[bid] = shval[0];
    }
 }

 void cuArraysMaxloc2D(cuArrays<float> *images, cuArrays<int2> *maxloc,
-                      cuArrays<float> *maxval, cudaStream_t stream) 
+                      cuArrays<float> *maxval, cudaStream_t stream)
 {
    cudaKernel_maxloc2D<NTHREADS><<<images->count, NTHREADS, 0, stream>>>
        (images->devData, maxloc->devData, maxval->devData, images->height, images->width, images->count);
@ -167,21 +167,21 @@ __global__ void  cudaKernel_maxloc2D(const float* const images, int2* maxloc, co
 {
    __shared__ float shval[BLOCKSIZE];
    __shared__ int shloc[BLOCKSIZE];
-    
-    int bid = blockIdx.x; 
+
+    int bid = blockIdx.x;
    if(bid >= nImages) return;
-        
+
    const int imageSize = imageNX * imageNY;
    max_reduction<BLOCKSIZE>(images, imageSize, nImages, shval, shloc);
-    
+
    if (threadIdx.x == 0) {
        int xloc = shloc[0]/imageNY;
        int yloc = shloc[0]%imageNY;
-        maxloc[bid] = make_int2(xloc, yloc); 
+        maxloc[bid] = make_int2(xloc, yloc);
    }
 }

-void cuArraysMaxloc2D(cuArrays<float> *images, cuArrays<int2> *maxloc, cudaStream_t stream) 
+void cuArraysMaxloc2D(cuArrays<float> *images, cuArrays<int2> *maxloc, cudaStream_t stream)
 {
    cudaKernel_maxloc2D<NTHREADS><<<images->count, NTHREADS, 0, stream>>>
        (images->devData, maxloc->devData, images->height, images->width, images->count);
@ -201,15 +201,15 @@ __global__ void cuSubPixelOffset_kernel(const int2 *offsetInit, const int2 *offs
    if (idx >= size) return;
    offsetFinal[idx].x = OSratio*(offsetZoomIn[idx].x ) + offsetInit[idx].x  - xoffset;
    offsetFinal[idx].y = OSratio*(offsetZoomIn[idx].y ) + offsetInit[idx].y - yoffset;
-}  
+}


 /// determine the final offset value
-/// @param[in] 
+/// @param[in]

-void cuSubPixelOffset(cuArrays<int2> *offsetInit, cuArrays<int2> *offsetZoomIn, cuArrays<float2> *offsetFinal, 
+void cuSubPixelOffset(cuArrays<int2> *offsetInit, cuArrays<int2> *offsetZoomIn, cuArrays<float2> *offsetFinal,
                      int OverSampleRatioZoomin, int OverSampleRatioRaw,
-                      int xHalfRangeInit,  int yHalfRangeInit, 
+                      int xHalfRangeInit,  int yHalfRangeInit,
                      int xHalfRangeZoomIn, int yHalfRangeZoomIn,
                      cudaStream_t stream)
 {
@ -218,14 +218,14 @@ void cuSubPixelOffset(cuArrays<int2> *offsetInit, cuArrays<int2> *offsetZoomIn,
    float xoffset = xHalfRangeInit ;
    float yoffset = yHalfRangeInit ;
    //std::cout << "subpixel" << xoffset << " " << yoffset << " ratio " << OSratio << std::endl;
-    
+
    cuSubPixelOffset_kernel<<<IDIVUP(size, NTHREADS), NTHREADS, 0, stream>>>
-        (offsetInit->devData, offsetZoomIn->devData, 
+        (offsetInit->devData, offsetZoomIn->devData,
         offsetFinal->devData, OSratio, xoffset, yoffset, size);
    getLastCudaError("cuSubPixelOffset_kernel");
    //offsetInit->debuginfo(stream);
    //offsetZoomIn->debuginfo(stream);
-    
+
 }

 static inline __device__ int dev_padStart(const size_t padDim, const size_t imageDim, const size_t maxloc)
@ -236,10 +236,10 @@ static inline __device__ int dev_padStart(const size_t padDim, const size_t imag
    else if(maxloc > imageDim-halfPadSize-1) start = imageDim-padDim-1;
    return start;
 }
-	 	
+
 //cuda kernel for cuda_determineInterpZone
 __global__ void cudaKernel_determineInterpZone(const int2* maxloc, const size_t nImages,
-                                               const size_t imageNX,  const size_t imageNY, 
+                                               const size_t imageNX,  const size_t imageNY,
                                               const size_t padNX, const size_t padNY,  int2* padOffset)
 {
    int imageIndex = threadIdx.x + blockDim.x *blockIdx.x; //image index
@ -247,18 +247,18 @@ __global__ void cudaKernel_determineInterpZone(const int2* maxloc, const size_t
        padOffset[imageIndex].x = dev_padStart(padNX, imageNX, maxloc[imageIndex].x);
        padOffset[imageIndex].y = dev_padStart(padNY, imageNY, maxloc[imageIndex].y);
    }
-} 
+}

 /*
 * determine the interpolation area (pad) from the max location and the padSize
- *    the pad will be (maxloc-padSize/2, maxloc+padSize/2-1)  
- * @param[in] maxloc[nImages]   
- * @param[in] padSize   
- * @param[in] imageSize 
+ *    the pad will be (maxloc-padSize/2, maxloc+padSize/2-1)
+ * @param[in] maxloc[nImages]
+ * @param[in] padSize
+ * @param[in] imageSize
 * @param[in] nImages
 * @param[out] padStart[nImages] return values of maxloc-padSize/2
 */
-void cuDetermineInterpZone(cuArrays<int2> *maxloc, cuArrays<int2> *zoomInOffset, cuArrays<float> *corrOrig, cuArrays<float> *corrZoomIn, cudaStream_t stream) 
+void cuDetermineInterpZone(cuArrays<int2> *maxloc, cuArrays<int2> *zoomInOffset, cuArrays<float> *corrOrig, cuArrays<float> *corrZoomIn, cudaStream_t stream)
 {
 	int threadsperblock=NTHREADS;
 	int blockspergrid=IDIVUP(corrOrig->count, threadsperblock);
@ -267,59 +267,91 @@ void cuDetermineInterpZone(cuArrays<int2> *maxloc, cuArrays<int2> *zoomInOffset,
 }


-static inline __device__ int dev_adjustOffset(const size_t newRange, const size_t oldRange, const size_t maxloc)
+static inline __device__ int2 dev_adjustOffset(
+    const int oldRange, const int newRange, const int maxloc)
 {
-    int maxloc_cor = maxloc;
-    if(maxloc_cor < newRange) {maxloc_cor = oldRange;}
-    else if(maxloc_cor > 2*oldRange-newRange) {maxloc_cor = oldRange;} 
-    int start = maxloc_cor - newRange;
-    return start;
+    // determine the starting point around the maxloc
+    // oldRange is the half search window size, e.g., = 32
+    // newRange is the half extract size, e.g., = 4
+    // maxloc is in range [0, 64]
+    // we want to extract \pm 4 centered at maxloc
+    // Examples:
+    // 1. maxloc = 40: we set start=maxloc-newRange=36, and extract [36,44), shift=0
+    // 2. maxloc = 2, start=-2: we set start=0, shift=-2,
+    //   (shift means the max is -2 from the extracted center 4)
+    // 3. maxloc =64, start=60: set start=56, shift = 4
+    //   (shift means the max is 4 from the extracted center 60).
+
+    // shift the max location by -newRange to find the start
+    int start = maxloc - newRange;
+    // if start is within the range, the max location will be in the center
+    int shift = 0;
+    // right boundary
+    int rbound = 2*(oldRange-newRange);
+    if(start<0)     // if exceeding the limit on the left
+    {
+        // set start at 0 and record the shift of center
+        shift = -start;
+        start = 0;
+    }
+    else if(start > rbound ) // if exceeding the limit on the right
+    {
+        //
+        shift = start-rbound;
+        start = rbound;
+    }
+    return make_int2(start, shift);
 }

-__global__ void cudaKernel_determineSecondaryExtractOffset(int2 * maxloc, 
+__global__ void cudaKernel_determineSecondaryExtractOffset(int2 * maxLoc, int2 *shift,
    const size_t nImages, int xOldRange, int yOldRange, int xNewRange, int yNewRange)
 {
    int imageIndex = threadIdx.x + blockDim.x *blockIdx.x; //image index
-	if (imageIndex < nImages) 
+	if (imageIndex < nImages)
 	{
-        maxloc[imageIndex].x = dev_adjustOffset(xNewRange, xOldRange, maxloc[imageIndex].x);
-        maxloc[imageIndex].y = dev_adjustOffset(yNewRange, yOldRange, maxloc[imageIndex].y);
+	    // get the starting pixel (stored back to maxloc) and shift
+        int2 result = dev_adjustOffset(xOldRange, xNewRange, maxLoc[imageIndex].x);
+        maxLoc[imageIndex].x = result.x;
+        shift[imageIndex].x = result.y;
+        result = dev_adjustOffset(yOldRange, yNewRange, maxLoc[imageIndex].y);
+        maxLoc[imageIndex].y = result.x;
+        shift[imageIndex].y = result.y;
 	}
 }

 ///@param[in] xOldRange, yOldRange are (half) search ranges in first step
 ///@param[in] x
-void cuDetermineSecondaryExtractOffset(cuArrays<int2> *maxLoc, 
-    int xOldRange, int yOldRange, int xNewRange, int yNewRange, cudaStream_t stream) 
+void cuDetermineSecondaryExtractOffset(cuArrays<int2> *maxLoc, cuArrays<int2> *maxLocShift,
+    int xOldRange, int yOldRange, int xNewRange, int yNewRange, cudaStream_t stream)
 {
 	int threadsperblock=NTHREADS;
 	int blockspergrid=IDIVUP(maxLoc->size, threadsperblock);
 	cudaKernel_determineSecondaryExtractOffset<<<blockspergrid, threadsperblock, 0, stream>>>
-	    (maxLoc->devData, maxLoc->size, xOldRange, yOldRange, xNewRange, yNewRange);
+	    (maxLoc->devData, maxLocShift->devData, maxLoc->size, xOldRange, yOldRange, xNewRange, yNewRange);
 }




-__global__ void cudaKernel_maxlocPlusZoominOffset(float *offset, const int * padStart, const int * maxlocUpSample, 
+__global__ void cudaKernel_maxlocPlusZoominOffset(float *offset, const int * padStart, const int * maxlocUpSample,
        const size_t nImages, float zoomInRatioX, float zoomInRatioY)
 {
 	int imageIndex = threadIdx.x + blockDim.x *blockIdx.x; //image index
-	if (imageIndex < nImages) 
+	if (imageIndex < nImages)
 	{
 		int index=2*imageIndex;
 		offset[index] = padStart[index] + maxlocUpSample[index] * zoomInRatioX;
 		index++;
 		offset[index] = padStart[index] + maxlocUpSample[index] * zoomInRatioY;
 	}
-} 
+}

-void cuda_maxlocPlusZoominOffset(float *offset, const int * padStart, const int * maxlocUpSample, 
+void cuda_maxlocPlusZoominOffset(float *offset, const int * padStart, const int * maxlocUpSample,
        const size_t nImages, float zoomInRatioX, float zoomInRatioY)
 {
 	int threadsperblock=NTHREADS;
 	int blockspergrid = IDIVUP(nImages, threadsperblock);
-	cudaKernel_maxlocPlusZoominOffset<<<blockspergrid,threadsperblock>>>(offset, padStart, maxlocUpSample, 
+	cudaKernel_maxlocPlusZoominOffset<<<blockspergrid,threadsperblock>>>(offset, padStart, maxlocUpSample,
        nImages, zoomInRatioX, zoomInRatioY);
 }

--- a/contrib/PyCuAmpcor/src/cuSincOverSampler.cu
+++ b/contrib/PyCuAmpcor/src/cuSincOverSampler.cu
@ -1,4 +1,4 @@
-/* 
+/*
 * cuSincOverSampler.cu
 */
 #include "cuArrays.h"
@ -8,17 +8,12 @@
 #include "cudaError.h"
 #include "cuAmpcorUtil.h"

-cuSincOverSamplerR2R::cuSincOverSamplerR2R(const int i_intplength_, const int i_covs_, cudaStream_t stream_)
- : i_intplength(i_intplength_), i_covs(i_covs_)
+cuSincOverSamplerR2R::cuSincOverSamplerR2R(const int i_covs_, cudaStream_t stream_)
+ : i_covs(i_covs_)
 {
    setStream(stream_);
-    //i_intplength = int(r_relfiltlen/r_beta);
-    r_relfiltlen = r_beta * i_intplength;
+    i_intplength = int(r_relfiltlen/r_beta+0.5f);
    i_filtercoef = i_intplength*i_decfactor;
-    r_wgthgt = (1.0f - r_pedestal)/2.0f;
-    r_soff = (i_filtercoef)/2.0f;
-    r_soff_inverse = 1.0f/r_soff;
-    r_decfactor_inverse = 1.0f/i_decfactor;
    checkCudaErrors(cudaMalloc((void **)&r_filter, (i_filtercoef+1)*sizeof(float)));
    cuSetupSincKernel();
 }
@ -28,23 +23,22 @@ void cuSincOverSamplerR2R::setStream(cudaStream_t stream_)
    stream = stream_;
 }

-cuSincOverSamplerR2R::~cuSincOverSamplerR2R() 
+cuSincOverSamplerR2R::~cuSincOverSamplerR2R()
 {
    checkCudaErrors(cudaFree(r_filter));
 }


-__global__ void cuSetupSincKernel_kernel(float *r_filter_, const int i_filtercoef_, 
+__global__ void cuSetupSincKernel_kernel(float *r_filter_, const int i_filtercoef_,
    const float r_soff_, const float r_wgthgt_, const int i_weight_,
-    const float r_soff_inverse_, const float r_beta_, const float r_decfactor_inverse_,
-    const float r_relfiltlen_inverse_)
+    const float r_soff_inverse_, const float r_beta_, const float r_decfactor_inverse_)
 {
    int i = threadIdx.x + blockDim.x*blockIdx.x;
    if(i > i_filtercoef_) return;
    float r_wa = i - r_soff_;
    float r_wgt = (1.0f - r_wgthgt_) + r_wgthgt_*cos(PI*r_wa*r_soff_inverse_);
    float r_s = r_wa*r_beta_*r_decfactor_inverse_*PI;
-    float r_fct; 
+    float r_fct;
    if(r_s != 0.0f) {
        r_fct = sin(r_s)/r_s;
    }
@ -57,68 +51,96 @@ __global__ void cuSetupSincKernel_kernel(float *r_filter_, const int i_filtercoe
    else {
        r_filter_[i] = r_fct;
    }
-    //printf("kernel %d %f\n", i, r_filter_[i]);
 }


 void cuSincOverSamplerR2R::cuSetupSincKernel()
 {
    const int nthreads = 128;
-    const int nblocks = IDIVUP(i_filtercoef, nthreads);
-    float r_relfiltlen_inverse = 1.0f/r_relfiltlen; 
+    const int nblocks = IDIVUP(i_filtercoef+1, nthreads);
+
+    // compute some commonly used constants at first
+    float r_wgthgt =  (1.0f - r_pedestal)/2.0f;
+    float r_soff = (i_filtercoef-1.0f)/2.0f;
+    float r_soff_inverse = 1.0f/r_soff;
+    float r_decfactor_inverse = 1.0f/i_decfactor;
+
    cuSetupSincKernel_kernel<<<nblocks, nthreads, 0, stream>>> (
-        r_filter, i_filtercoef, r_soff, r_wgthgt, i_weight, 
-        r_soff_inverse, r_beta, r_decfactor_inverse, r_relfiltlen_inverse);
+        r_filter, i_filtercoef, r_soff, r_wgthgt, i_weight,
+        r_soff_inverse, r_beta, r_decfactor_inverse);
    getLastCudaError("cuSetupSincKernel_kernel");
 }

-__global__ void cuSincInterpolation_kernel(const int nImages, 
+__global__ void cuSincInterpolation_kernel(const int nImages,
    const float * imagesIn, const int inNX, const int inNY,
-    float * imagesOut, const int outNX, const int outNY, 
-    const float * r_filter_, const int i_covs_, const int i_decfactor_, const int i_intplength_, 
+    float * imagesOut, const int outNX, const int outNY,
+    int2 *centerShift, int factor,
+    const float * r_filter_, const int i_covs_, const int i_decfactor_, const int i_intplength_,
    const int i_startX, const int i_startY, const int i_int_size)
 {
+    // get image index
    int idxImage = blockIdx.z;
-    int idxX = threadIdx.x + blockDim.x*blockIdx.x; 
+    // get the xy threads for output image pixel indices
+    int idxX = threadIdx.x + blockDim.x*blockIdx.x;
    int idxY = threadIdx.y + blockDim.y*blockIdx.y;
+    // cuda: to make sure extra allocated threads do nothing
    if(idxImage >=nImages || idxX >= i_int_size || idxY >= i_int_size) return;
-    int outx = idxX + i_startX;
-    int outy = idxY + i_startY;
+    // decide the center shift
+    int2 shift = centerShift[idxImage];
+    // determine the output pixel indices
+    int outx = idxX + i_startX + shift.x*factor;
+    if (outx >= outNX) outx-=outNX;
+    int outy = idxY + i_startY +  shift.y*factor;
+    if (outy >= outNY) outy-=outNY;
+    // flattened to 1d
    int idxOut = idxImage*outNX*outNY + outx*outNY + outy;
-    
+
+    // index in input grids
    float r_xout = (float)outx/i_covs_;
+     // integer part
    int i_xout = int(r_xout);
+    // factional part
    float r_xfrac = r_xout - i_xout;
+    // fractional part in terms of the interpolation kernel grids
    int i_xfrac = int(r_xfrac*i_decfactor_);
-    
+
+    // same procedure for y
    float r_yout = (float)outy/i_covs_;
    int i_yout = int(r_yout);
    float r_yfrac = r_yout - i_yout;
    int i_yfrac = int(r_yfrac*i_decfactor_);
-    
-    float intpData = 0.0f;
-    float r_sincwgt = 0.0f;
-    float r_sinc_coef;
-    
-    for(int i=0; i < inNX; i++) {
-        int i_xindex = i_xout - i + i_intplength_/2;
-        if(i_xindex < 0) i_xindex+= i_intplength_;
-        if(i_xindex >= i_intplength_) i_xindex-=i_intplength_;  
-        float r_xsinc_coef = r_filter_[i_xindex*i_decfactor_+i_xfrac];
-        
-        for(int j=0; j< inNY; j++) {
-            int i_yindex = i_yout - j + i_intplength_/2;
-            if(i_yindex < 0) i_yindex+= i_intplength_;
-            if(i_yindex >= i_intplength_) i_yindex-=i_intplength_;  
-            float r_ysinc_coef = r_filter_[i_yindex*i_decfactor_+i_yfrac];
+
+    // temp variables
+    float intpData = 0.0f; // interpolated value
+    float r_sincwgt = 0.0f; // total filter weight
+    float r_sinc_coef; // filter weight
+
+    // iterate over lines of input image
+    // i=0 -> -i_intplength/2
+    for(int i=0; i < i_intplength_; i++) {
+        // find the corresponding pixel in input(unsampled) image
+
+        int inx = i_xout - i + i_intplength_/2;
+
+        if(inx < 0) inx+= inNX;
+        if(inx >= inNX) inx-= inNY;
+
+        float r_xsinc_coef = r_filter_[i*i_decfactor_+i_xfrac];
+
+        for(int j=0; j< i_intplength_; j++) {
+            // find the corresponding pixel in input(unsampled) image
+            int iny = i_yout - j + i_intplength_/2;
+            if(iny < 0) iny += inNY;
+            if(iny >= inNY) iny -= inNY;
+
+            float r_ysinc_coef = r_filter_[j*i_decfactor_+i_yfrac];
+            // multiply the factors from xy
            r_sinc_coef = r_xsinc_coef*r_ysinc_coef;
+            // add to total sinc weight
            r_sincwgt += r_sinc_coef;
-            intpData += imagesIn[idxImage*inNX*inNY+i*inNY+j]*r_sinc_coef;
-            /*
-              if(outx == 0 && outy == 1) {
-                printf("intp kernel %d %d %d %d %d %d %d %f\n", i, j, i_xindex, i_yindex, i_xindex*i_decfactor_+i_xfrac,
-                   i_yindex*i_decfactor_+i_yfrac, idxImage*inNX*inNY+i*inNY+j, r_sinc_coef);
-              }*/
+            // multiply by the original signal and add to results
+            intpData += imagesIn[idxImage*inNX*inNY+inx*inNY+iny]*r_sinc_coef;
+
        }
    }
    imagesOut[idxOut] = intpData/r_sincwgt;
@ -126,27 +148,31 @@ __global__ void cuSincInterpolation_kernel(const int nImages,
 }


-void cuSincOverSamplerR2R::execute(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut)
+void cuSincOverSamplerR2R::execute(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut,
+    cuArrays<int2> *centerShift, int oversamplingFactor)
 {
    const int nImages = imagesIn->count;
    const int inNX = imagesIn->height;
    const int inNY = imagesIn->width;
-    const int outNX = imagesOut->height; 
+    const int outNX = imagesOut->height;
    const int outNY = imagesOut->width;
-    
-    const int i_int_range = i_sincwindow * i_covs; 
+
+    // only compute the overampled signals within a window
+    const int i_int_range = i_sincwindow * i_covs;
+    // set the start pixel, will be shifted by centerShift*oversamplingFactor (from raw image)
    const int i_int_startX = outNX/2 - i_int_range;
    const int i_int_startY = outNY/2 - i_int_range;
    const int i_int_size = 2*i_int_range + 1;
-      
+    // preset all pixels in out image to 0
    imagesOut->setZero(stream);
-    
+
    static const int nthreads = 16;
    dim3 threadsperblock(nthreads, nthreads, 1);
    dim3 blockspergrid (IDIVUP(i_int_size, nthreads), IDIVUP(i_int_size, nthreads), nImages);
-    cuSincInterpolation_kernel<<<blockspergrid, threadsperblock, 0, stream>>>(nImages, 
+    cuSincInterpolation_kernel<<<blockspergrid, threadsperblock, 0, stream>>>(nImages,
        imagesIn->devData, inNX, inNY,
        imagesOut->devData, outNX, outNY,
+        centerShift->devData, oversamplingFactor,
        r_filter, i_covs, i_decfactor, i_intplength, i_int_startX, i_int_startY, i_int_size);
    getLastCudaError("cuSincInterpolation_kernel");
 }
--- a/contrib/PyCuAmpcor/src/cuSincOverSampler.h
+++ b/contrib/PyCuAmpcor/src/cuSincOverSampler.h
@ -1,11 +1,11 @@
-/* 
- * cuSincOverSampler.h 
+/*
+ * cuSincOverSampler.h
 * oversampling with sinc interpolation method
 */

 #ifndef __CUSINCOVERSAMPLER_H
 #define __CUSINCOVERSAMPLER_H
- 
+
 #include "cuArrays.h"
 #include "cudaUtil.h"

@ -15,29 +15,33 @@ class cuSincOverSamplerR2R
 {
 private:
    static const int i_sincwindow = 2;
-    static const int i_decfactor = 4096; // division between orignal pixels
+    // the oversampling is only performed within \pm i_sincwindow*i_covs around the peak
    static const int i_weight = 1;       // weight for cos() pedestal
-    const float r_pedestal = 0.0f; // height of pedestal  
-    const float r_beta = 0.75f;     // factor r_relfiltlen/i_intplength 
-    
-    int i_covs;
-    int i_intplength;
-    float r_relfiltlen;  
-    int i_filtercoef;
-    float r_wgthgt;
-    float r_soff;
-    float r_soff_inverse;
-    float r_decfactor_inverse;
-    
+
+    const float r_pedestal = 0.0f;       // height of pedestal
+    const float r_beta = 0.75f;          // a low-band pass
+    const float r_relfiltlen = 6.0f;     // relative filter length
+
+    static const int i_decfactor = 4096; // decimals between original grid to set up the sinc kernel
+
+    int i_covs;         // oversampling factor
+    int i_intplength;   // actual filter length
+    int i_filtercoef;   // length of the sinc kernel i_intplength*i_decfactor+1
+
+    float * r_filter;   // sinc kernel with size i_filtercoef
+
    cudaStream_t stream;
-    float * r_filter; 
-    
+
 public:
-    cuSincOverSamplerR2R(const int i_intplength_, const int i_covs_, cudaStream_t stream_);
+    // constructor
+    cuSincOverSamplerR2R(const int i_covs_, cudaStream_t stream_);
+    // local methods
    void setStream(cudaStream_t stream_);
    void cuSetupSincKernel();
-    void execute(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut);	
-    ~cuSincOverSamplerR2R(); 
+    // call interface
+    void execute(cuArrays<float> *imagesIn, cuArrays<float> *imagesOut, cuArrays<int2> *center, int oversamplingFactor);
+
+    ~cuSincOverSamplerR2R();
 };

 #endif // _CUSINCOVERSAMPLER_H
--- a/contrib/PyCuAmpcor/src/cudaError.h
+++ b/contrib/PyCuAmpcor/src/cudaError.h
@ -1,6 +1,6 @@
 /**
 * cudaError.h
- * Purpose: check various errors in cuda/cufft/cublas calls
+ * Purpose: check various errors in cuda/cufft calls
 * Lijun Zhu
 * Last modified 09/07/2017
 **/
@ -65,7 +65,6 @@ inline void __getLastCudaError(const char *errorMessage, const char *file, const
 // This will output the proper CUDA error strings in the event that a CUDA host call returns an error
 #define checkCudaErrors(val)  check ( (val), #val, __FILE__, __LINE__ )
 #define cufft_Error(val)     check ( (val), #val, __FILE__, __LINE__ )
-#define cublas_Error(val)    check ( (val), #val, __FILE__, __LINE__ )
 #define getLastCudaError(var)   __getLastCudaError (var, __FILE__, __LINE__)

 #endif //__CUDAERROR_CUH
--- a/contrib/PyCuAmpcor/src/debug.h
+++ b/contrib/PyCuAmpcor/src/debug.h
@ -1,17 +1,17 @@
 #ifndef __DEBUG_H
 #define __DEBUG_H

-#pragma once
-
 #include <iostream>
 #include <assert.h>
+#include <stdio.h>

-#define _DEBUG_ 1
+#ifndef NDEBUG
+#define CUAMPCOR_DEBUG
+#define debugmsg(msg) fprintf(stderr, msg)
+#else
+#define debugmsg(msg)
+#endif //NDEBUG

 #define CUDA_ERROR_CHECK

-#define debugmsg(msg) if(_DEBUG_) fprintf(stderr, msg)
-
-//__CUDA_ARCH__
-
-#endif 
+#endif //__DEBUG_H
--- a/contrib/PyCuAmpcor/src/setup.py
+++ b/contrib/PyCuAmpcor/src/setup.py
@ -22,6 +22,6 @@ setup(  name = 'PyCuAmpcor',
                       'cuSincOverSampler.o', 'cuDeramp.o','cuAmpcorController.o','cuEstimateStats.o'],
        extra_link_args=['-L/usr/local/cuda/lib64',
                        '-L/usr/lib64/nvidia',
-                        '-lcuda','-lcudart','-lcufft','-lcublas','-lgdal'], # REPLACE FIRST PATH WITH YOUR PATH TO YOUR CUDA LIBRARIES
+                        '-lcuda','-lcudart','-lcufft','-lgdal'], # REPLACE FIRST PATH WITH YOUR PATH TO YOUR CUDA LIBRARIES
        language='c++'
    )))