---
title: 'Auto-annotation API'
linkTitle: 'Auto-annotation API'
weight: 6
---

## Overview

This layer provides functionality
that allows you to automate the process of annotating a CVAT dataset
by delegating this process (or parts of it) to a program running on a machine under your control.

To make use of this delegation, you must implement an "auto-annotation function",
or "AA function" for short.
This is a Python object that implements one of the protocols defined by this layer.
The particular protocol implemented defines
which part of the annotation process the AA function will be able to automate.

An AA function may be used in one of the following modes:

- Immediate mode.
  This involves annotating a specific CVAT task by passing the AA function to a driver,
  along with the identifier of the task and optional additional parameters.
  This may be done either:

  - programmatically (consult the "Auto-annotation driver" section (TODO)); or

  - via the CVAT CLI (consult the description of the `task auto-annotate` command
    in {{< ilink "/docs/api_sdk/cli" "the CLI documentation" >}}).

- Agent mode.
  This involves registering the AA function with the CVAT server
  (creating a resource on the server known as a "native function")
  and then running one or more agent processes.

  This makes the AA function usable from the CVAT UI.
  CVAT users can choose to use the native function as the model when using CVAT's AI tools.
  When they do, the agents detect this, and process their requests by calling appropriate
  methods on the corresponding AA function.

  Depending on how you create the native function, it'll be accessible to only you,
  or your organization.

  For more details, consult the descriptions of the `function create-native`
  and `function run-agent` commands in {{< ilink "/docs/api_sdk/cli" "the CLI documentation" >}}.

This SDK layer can be divided into several parts:

- The interface, containing the protocols that an AA function must implement,
  as well as helpers for use by such functions.
  Consult "..."

- The driver, containing functionality to annotate a CVAT dataset using an AA function.
  Consult "..."

- Predefined AA functions based on torchvision.
  Consult "..."

## Example

An AA function may be implemented in any way that is appropriate for your use case.
However, a typical AA function will be based on a machine learning model
and consist of the following basic elements:

- Code to load the ML model.

- A specification defining which protocol the AA function implements,
  as well as static properties of the AA function
  (such as a description of the annotations that the AA function can produce).

- Code to convert data from SDK data structures to a format the ML model can understand.

- Code to run the ML model.

- Code to convert resulting annotations to SDK data structures.

The following code snippet shows an example AA function implementation
(specifically, a detection function),
as well as code that creates an instance of the function and uses it for auto-annotation.

```python
from typing import List
import PIL.Image

import torchvision.models

from cvat_sdk import make_client
import cvat_sdk.models as models
import cvat_sdk.auto_annotation as cvataa

class TorchvisionDetectionFunction:
    def __init__(self, model_name: str, weights_name: str, **kwargs) -> None:
        # load the ML model
        weights_enum = torchvision.models.get_model_weights(model_name)
        self._weights = weights_enum[weights_name]
        self._transforms = self._weights.transforms()
        self._model = torchvision.models.get_model(model_name, weights=self._weights, **kwargs)
        self._model.eval()

    @property
    def spec(self) -> cvataa.DetectionFunctionSpec:
        # describe the annotations
        return cvataa.DetectionFunctionSpec(
            labels=[
                cvataa.label_spec(cat, i, type="rectangle")
                for i, cat in enumerate(self._weights.meta["categories"])
                if cat != "N/A"
            ]
        )

    def detect(
        self, context: cvataa.DetectionFunctionContext, image: PIL.Image.Image
    ) -> list[models.LabeledShapeRequest]:
        # determine the threshold for filtering results
        conf_threshold = context.conf_threshold or 0

        # convert the input into a form the model can understand
        transformed_image = [self._transforms(image)]

        # run the ML model
        results = self._model(transformed_image)

        # convert the results into the form SDK requires
        return [
            cvataa.rectangle(label.item(), [x.item() for x in box])
            for result in results
            for box, label, score in zip(result["boxes"], result["labels"], result["scores"])
            if score >= conf_threshold
        ]

# log into the CVAT server
with make_client(host="http://localhost", credentials=("user", "password")) as client:
    # create a function that uses Faster R-CNN
    func = TorchvisionDetectionFunction("fasterrcnn_resnet50_fpn_v2", "DEFAULT", box_score_thresh=0.5)

    # annotate task 12345 using the function
    cvataa.annotate_task(client, 12345, func)
```

## Auto-annotation interface

This part of the auto-annotation layer defines the protocols that an AA function must implement.

### Detection function protocol

A detection function is a type of AA function
that accepts an image and returns a list of shapes found in that image.

A detection function can be used in the following ways:

- In immediate mode, the AA function is run for every image in a given CVAT task,
  and the resulting lists of shapes are combined and uploaded to CVAT.

- In agent mode, the AA function can be used from the CVAT UI to either annotate a complete task
  (similar to immediate mode) or a single frame in a task.

A detection function must have two attributes, `spec` and `detect`.

`spec` must contain the AA function's specification,
which is an instance of `DetectionFunctionSpec`.

`DetectionFunctionSpec` must be initialized with a sequence of `PatchedLabelRequest` objects
that represent the labels that the AA function knows about.
See the docstring of `DetectionFunctionSpec` for more information on the constraints
that these objects must follow.
`BadFunctionError` will be raised if any constraint violations are detected.

`detect` must be a function/method accepting two parameters:

- `context` (`DetectionFunctionContext`).
  Contains invocation parameters and information about the current image.
  The following fields are available:

  - `frame_name` (`str`). The file name of the frame on the CVAT server.
  - `conf_threshold` (`float | None`). The confidence threshold that the function
    should use to filter objects. If `None`, the function may apply a default
    threshold at its discretion.

- `image` (`PIL.Image.Image`).
  Contains image data.

`detect` must return a sequence of `LabeledImageRequest` and/or `LabeledShapeRequest` objects,
representing tags/shapes found in the image.
See the docstring of `DetectionFunctionSpec` for more information on the constraints
that these objects must follow.

The same AA function may be used with any dataset that contain labels with the same name
as the AA function's specification.
The way it works is that the driver matches labels between the spec and the dataset,
and replaces the label IDs in the tag & shape objects with those defined in the dataset.

For example, suppose the AA function's spec defines the following labels:

| Name  | ID |
| ----- | -- |
| `bat` | 0  |
| `rat` | 1  |

And the dataset defines the following labels:

| Name  | ID  |
| ----- | --- |
| `bat` | 100 |
| `cat` | 101 |
| `rat` | 102 |

Then suppose `detect` returns a shape with `label_id` equal to 1.
The driver will see that it refers to the `rat` label, and replace it with 102,
since that's the ID this label has in the dataset.

The same logic is used for sublabel and attribute IDs.

#### Helper factory functions

The CVAT API model types used in the detection function protocol are somewhat unwieldy to work with,
so it's recommended to use the helper factory functions provided by this layer.
These helpers instantiate an object of their corresponding model type,
passing their arguments to the model constructor
and sometimes setting some attributes to fixed values.

The following helpers are available for building specifications:

| Name                      | Model type            | Fixed attributes                                      |
| ------------------------- | --------------------- | ----------------------------------------------------- |
| `label_spec`              | `PatchedLabelRequest` | -                                                     |
| `skeleton_label_spec`     | `PatchedLabelRequest` | `type="skeleton"`                                     |
| `keypoint_spec`           | `SublabelRequest`     | `type="points"`                                       |
| `attribute_spec`          | `AttributeRequest`    | `mutable=False`                                       |
| `checkbox_attribute_spec` | `AttributeRequest`    | `mutable=False`, `input_type="checkbox"`, `values=[]` |
| `number_attribute_spec`   | `AttributeRequest`    | `mutable=False`, `input_type="number"`                |
| `radio_attribute_spec`    | `AttributeRequest`    | `mutable=False`, `input_type="radio"`                 |
| `select_attribute_spec`   | `AttributeRequest`    | `mutable=False`, `input_type="select"`                |
| `text_attribute_spec`     | `AttributeRequest`    | `mutable=False`, `input_type="number"`, `values=[]`   |

For `number_attribute_spec`,
it's recommended to use the `cvat_sdk.attributes.number_attribute_values` function
to create the `values` argument, since this function will enforce the constraints expected
for attribute specs of this type.
For example:

```python
cvataa.number_attribute_spec("size", 1, number_attribute_values(0, 10))
```

The following helpers are available for use in `detect`:

| Name        | Model type               | Fixed attributes              |
| ----------- | ------------------------ | ----------------------------- |
| `tag`       | `LabeledImageRequest`    | `frame=0`                     |
| `shape`     | `LabeledShapeRequest`    | `frame=0`                     |
| `mask`      | `LabeledShapeRequest`    | `frame=0`, `type="mask"`      |
| `polygon`   | `LabeledShapeRequest`    | `frame=0`, `type="polygon"`   |
| `rectangle` | `LabeledShapeRequest`    | `frame=0`, `type="rectangle"` |
| `skeleton`  | `LabeledShapeRequest`    | `frame=0`, `type="skeleton"`  |
| `keypoint`  | `SubLabeledShapeRequest` | `frame=0`, `type="points"`    |

For `mask`, it is recommended to create the points list using
the `cvat_sdk.masks.encode_mask` function, which will convert a bitmap into a
list in the format that CVAT expects. For example:

```python
cvataa.mask(my_label, encode_mask(
    my_mask,  # boolean 2D array, same size as the input image
    [x1, y1, x2, y2],  # top left and bottom right coordinates of the mask
))
```

To create shapes with attributes,
it's recommended to use the `cvat_sdk.attributes.attribute_vals_from_dict` function,
which returns a list of objects that can be passed to an `attributes` argument:

```python
cvataa.rectangle(
    my_label, [x1, y2, x2, y2],
    attributes=attribute_vals_from_dict({my_attr1: val1, my_attr2: val2})
)
```

### Tracking function protocol

A tracking function is a type of AA function that analyzes an image with one or more shapes on it,
and then predicts the positions of those shapes on subsequent images.

A tracking function can only be used in agent mode.
When used with a tracking function, an agent will use it
to process requests from the AI tracking tools in the CVAT UI.

{{% alert title="Warning" color="warning" %}}
Currently, only one agent should be run for each tracking function.
If multiple agents for one tracking function are run at the same time,
CVAT users may experience intermittent "Tracking state not found" errors when using the function.
{{% /alert %}}

A tracking function must have three attributes, `spec`, `init_tracking_state`, and `track`.
It may also optionally have a `preprocess_image` attribute.

`spec` must contain the AA function's specification,
which is an instance of `TrackingFunctionSpec`.
This specification must be initialized with a single `supported_shape_types` parameter,
defining which types of shapes the AA function is able to track.
For example:

```python
spec = cvataa.TrackingFunctionSpec(supported_shape_types=["rectangle"])
```

`init_tracking_state` must be a function accepting the following parameters:

- `context` (`TrackingFunctionShapeContext`).
  An object with information about the shape being tracked. See details below.

- `pp_image` (type varies).
  A preprocessed image.
  Consult the description of `preprocess_image` for more details.

- `shape` (`TrackableShape`).
  A shape within the preprocessed image.
  `TrackableShape` is a minimal version of the `LabeledShape` SDK model,
  containing only the `type` and `points` fields.
  The shape's `type` is guaranteed to be one of the types listed
  in the `supported_shape_types` field of the spec.

`init_tracking_state` must analyze the shape and create a state object containing
any information that the AA function will need to predict its location on a subsequent image.
It must then return this object.

`init_tracking_state` must not modify either `pp_image` or `shape`.

`track` must be a function accepting the following parameters:

- `context` (`TrackingFunctionShapeContext`).
  An object with information about the shape being tracked. See details below.

- `pp_image` (type varies).
  A preprocessed image.
  Consult the description of `preprocess_image` for more details.
  This image will have the same dimensions as those of the image used to create the `state` object.

- `state` (type varies).
  The object returned by a previous call to `init_tracking_state`.

`track` must locate the shape that was used to create the `state` object
on the new preprocessed image.
If it is able to do that, it must return its prediction as a new `TrackableShape` object.
This object must have the same value of `type` as the original shape.

If `track` is unable to locate the shape, it must return `None`.

`track` may modify `state` as needed to improve prediction accuracy on subsequent frames.
It must not modify `pp_image`.

A `TrackingFunctionShapeContext` object passed to both `init_tracking_state` and `track`
will have the following field:

- `original_shape_type` (`str`).
  The type of the shape being tracked.
  In `init_tracking_state`, this is the same as `shape.type`.
  In `track`, this is the type of the shape that `state` was created from.

`preprocess_image`, if implemented, must accept the following parameters:

- `context` (`TrackingFunctionContext`).
  This is currently a dummy object and should be ignored.
  In future versions, this may contain additional information.

- `image` (`PIL.Image.Image`).
  An image that will be used to either start or continue tracking.

`preprocess_image` must perform any analysis on the image that the function can perform
independently of the shapes being tracked
and return an object representing the results of that analysis.
This object will be passed as `pp_image` to `init_tracking_state` and `track`.

If `preprocess_image` is not implemented, then the `pp_image` object will be the original image.
In other words, the default implementation is:

```python
def preprocess_image(context, image):
    return image
```

## Auto-annotation driver

The `annotate_task` function uses a detection function to annotate a CVAT task.
It must be called as follows:

```python
annotate_task(<client>, <task ID>, <AA function>, <optional arguments...>)
```

The supplied client will be used to make all API calls.

By default, new annotations will be appended to the old ones.
Use `clear_existing=True` to remove old annotations instead.

If a detection function declares a label that has no matching label in the task,
then by default, `BadFunctionError` is raised, and auto-annotation is aborted.
If you use `allow_unmatched_label=True`, then such labels will be ignored,
and any shapes referring to them will be dropped.
Same logic applies to sublabels and attributes.

It's possible to pass a custom confidence threshold to the function via the
`conf_threshold` parameter.

`annotate_task` will raise a `BadFunctionError` exception
if it detects that the function violated the detection function protocol.

## Predefined AA functions

This layer includes several predefined detection functions.
You can use them as-is, or as a base on which to build your own.

These AA functions use models from the
the [torchvision](https://pytorch.org/vision/stable/index.html) library.
To use them, install CVAT SDK with the `pytorch` extra:

```
$ pip install "cvat-sdk[pytorch]"
```

Each function is implemented as a dedicated module
to allow usage via the CLI `auto-annotate` command.

Usage from Python:

```python
from cvat_sdk.auto_annotation.functions.torchvision_<task> import create as create_torchvision
annotate_task(<client>, <task ID>, create_torchvision(<model name>, ...))
```

Usage from the CLI:

```bash
cvat-cli auto-annotate "<task ID>" \
      --function-module "cvat_sdk.auto_annotation.functions.torchvision_<task>" \
      -p model_name=str:"<model name>" ...
```

The `create` function in each module accepts the following parameters:

- `model_name` (`str`) - the name of the model, such as `fasterrcnn_resnet50_fpn_v2`.
  This parameter is required.
- `weights_name` (`str`) - the name of a weights enum value for the model, such as `COCO_V1`.
  Defaults to `DEFAULT`.

It also accepts arbitrary additional parameters,
which are passed directly to the model constructor.

The following section describe each available function.

### `cvat_sdk.auto_annotation.function.torchvision_classification`

This AA function uses torchvision's classification models.
It produces tag annotations.
For each frame, the function will output one tag whose label has the highest probability,
as long as that probability is greater or equal to the input confidence threshold.
If it is lower, the function will output nothing.

### `cvat_sdk.auto_annotation.functions.torchvision_detection`

This AA function uses torchvision's object detection models.
It produces rectangle annotations.

### `cvat_sdk.auto_annotation.functions.torchvision_instance_segmentation`

This AA function uses torchvision's instance segmentation models.
It produces mask or polygon annotations (depending on the value of `conv_mask_to_poly`).

### `cvat_sdk.auto_annotation.functions.torchvision_keypoint_detection`

This AA function uses torchvision's keypoint detection models.
It produces skeleton annotations.
Keypoints which the model marks as invisible will be marked as occluded in CVAT.

### SAM2 Tracking Function

For users who want to implement SAM2-based tracking, the CVAT repository includes
a ready-to-use SAM2 tracking function at `ai-models/tracker/sam2/func.py`.
This function implements the tracking function protocol described above
and can be used with the CLI commands for creating native functions and running agents.

For detailed setup and usage instructions, see the
{{< ilink "/docs/enterprise/segment-anything-2-tracker" "SAM2 Tracker documentation" >}}.