--- title: 'Auto-annotation API' linkTitle: 'Auto-annotation API' weight: 6 --- ## Overview This layer provides functionality that allows you to automate the process of annotating a CVAT dataset by delegating this process (or parts of it) to a program running on a machine under your control. To make use of this delegation, you must implement an "auto-annotation function", or "AA function" for short. This is a Python object that implements one of the protocols defined by this layer. The particular protocol implemented defines which part of the annotation process the AA function will be able to automate. An AA function may be used in one of the following modes: - Immediate mode. This involves annotating a specific CVAT task by passing the AA function to a driver, along with the identifier of the task and optional additional parameters. This may be done either: - programmatically (consult the "Auto-annotation driver" section (TODO)); or - via the CVAT CLI (consult the description of the `task auto-annotate` command in {{< ilink "/docs/api_sdk/cli" "the CLI documentation" >}}). - Agent mode. This involves registering the AA function with the CVAT server (creating a resource on the server known as a "native function") and then running one or more agent processes. This makes the AA function usable from the CVAT UI. CVAT users can choose to use the native function as the model when using CVAT's AI tools. When they do, the agents detect this, and process their requests by calling appropriate methods on the corresponding AA function. Depending on how you create the native function, it'll be accessible to only you, or your organization. For more details, consult the descriptions of the `function create-native` and `function run-agent` commands in {{< ilink "/docs/api_sdk/cli" "the CLI documentation" >}}. This SDK layer can be divided into several parts: - The interface, containing the protocols that an AA function must implement, as well as helpers for use by such functions. Consult "..." - The driver, containing functionality to annotate a CVAT dataset using an AA function. Consult "..." - Predefined AA functions based on torchvision. Consult "..." ## Example An AA function may be implemented in any way that is appropriate for your use case. However, a typical AA function will be based on a machine learning model and consist of the following basic elements: - Code to load the ML model. - A specification defining which protocol the AA function implements, as well as static properties of the AA function (such as a description of the annotations that the AA function can produce). - Code to convert data from SDK data structures to a format the ML model can understand. - Code to run the ML model. - Code to convert resulting annotations to SDK data structures. The following code snippet shows an example AA function implementation (specifically, a detection function), as well as code that creates an instance of the function and uses it for auto-annotation. ```python from typing import List import PIL.Image import torchvision.models from cvat_sdk import make_client import cvat_sdk.models as models import cvat_sdk.auto_annotation as cvataa class TorchvisionDetectionFunction: def __init__(self, model_name: str, weights_name: str, **kwargs) -> None: # load the ML model weights_enum = torchvision.models.get_model_weights(model_name) self._weights = weights_enum[weights_name] self._transforms = self._weights.transforms() self._model = torchvision.models.get_model(model_name, weights=self._weights, **kwargs) self._model.eval() @property def spec(self) -> cvataa.DetectionFunctionSpec: # describe the annotations return cvataa.DetectionFunctionSpec( labels=[ cvataa.label_spec(cat, i, type="rectangle") for i, cat in enumerate(self._weights.meta["categories"]) if cat != "N/A" ] ) def detect( self, context: cvataa.DetectionFunctionContext, image: PIL.Image.Image ) -> list[models.LabeledShapeRequest]: # determine the threshold for filtering results conf_threshold = context.conf_threshold or 0 # convert the input into a form the model can understand transformed_image = [self._transforms(image)] # run the ML model results = self._model(transformed_image) # convert the results into the form SDK requires return [ cvataa.rectangle(label.item(), [x.item() for x in box]) for result in results for box, label, score in zip(result["boxes"], result["labels"], result["scores"]) if score >= conf_threshold ] # log into the CVAT server with make_client(host="http://localhost", credentials=("user", "password")) as client: # create a function that uses Faster R-CNN func = TorchvisionDetectionFunction("fasterrcnn_resnet50_fpn_v2", "DEFAULT", box_score_thresh=0.5) # annotate task 12345 using the function cvataa.annotate_task(client, 12345, func) ``` ## Auto-annotation interface This part of the auto-annotation layer defines the protocols that an AA function must implement. ### Detection function protocol A detection function is a type of AA function that accepts an image and returns a list of shapes found in that image. A detection function can be used in the following ways: - In immediate mode, the AA function is run for every image in a given CVAT task, and the resulting lists of shapes are combined and uploaded to CVAT. - In agent mode, the AA function can be used from the CVAT UI to either annotate a complete task (similar to immediate mode) or a single frame in a task. A detection function must have two attributes, `spec` and `detect`. `spec` must contain the AA function's specification, which is an instance of `DetectionFunctionSpec`. `DetectionFunctionSpec` must be initialized with a sequence of `PatchedLabelRequest` objects that represent the labels that the AA function knows about. See the docstring of `DetectionFunctionSpec` for more information on the constraints that these objects must follow. `BadFunctionError` will be raised if any constraint violations are detected. `detect` must be a function/method accepting two parameters: - `context` (`DetectionFunctionContext`). Contains invocation parameters and information about the current image. The following fields are available: - `frame_name` (`str`). The file name of the frame on the CVAT server. - `conf_threshold` (`float | None`). The confidence threshold that the function should use to filter objects. If `None`, the function may apply a default threshold at its discretion. - `image` (`PIL.Image.Image`). Contains image data. `detect` must return a sequence of `LabeledImageRequest` and/or `LabeledShapeRequest` objects, representing tags/shapes found in the image. See the docstring of `DetectionFunctionSpec` for more information on the constraints that these objects must follow. The same AA function may be used with any dataset that contain labels with the same name as the AA function's specification. The way it works is that the driver matches labels between the spec and the dataset, and replaces the label IDs in the tag & shape objects with those defined in the dataset. For example, suppose the AA function's spec defines the following labels: | Name | ID | | ----- | -- | | `bat` | 0 | | `rat` | 1 | And the dataset defines the following labels: | Name | ID | | ----- | --- | | `bat` | 100 | | `cat` | 101 | | `rat` | 102 | Then suppose `detect` returns a shape with `label_id` equal to 1. The driver will see that it refers to the `rat` label, and replace it with 102, since that's the ID this label has in the dataset. The same logic is used for sublabel and attribute IDs. #### Helper factory functions The CVAT API model types used in the detection function protocol are somewhat unwieldy to work with, so it's recommended to use the helper factory functions provided by this layer. These helpers instantiate an object of their corresponding model type, passing their arguments to the model constructor and sometimes setting some attributes to fixed values. The following helpers are available for building specifications: | Name | Model type | Fixed attributes | | ------------------------- | --------------------- | ----------------------------------------------------- | | `label_spec` | `PatchedLabelRequest` | - | | `skeleton_label_spec` | `PatchedLabelRequest` | `type="skeleton"` | | `keypoint_spec` | `SublabelRequest` | `type="points"` | | `attribute_spec` | `AttributeRequest` | `mutable=False` | | `checkbox_attribute_spec` | `AttributeRequest` | `mutable=False`, `input_type="checkbox"`, `values=[]` | | `number_attribute_spec` | `AttributeRequest` | `mutable=False`, `input_type="number"` | | `radio_attribute_spec` | `AttributeRequest` | `mutable=False`, `input_type="radio"` | | `select_attribute_spec` | `AttributeRequest` | `mutable=False`, `input_type="select"` | | `text_attribute_spec` | `AttributeRequest` | `mutable=False`, `input_type="number"`, `values=[]` | For `number_attribute_spec`, it's recommended to use the `cvat_sdk.attributes.number_attribute_values` function to create the `values` argument, since this function will enforce the constraints expected for attribute specs of this type. For example: ```python cvataa.number_attribute_spec("size", 1, number_attribute_values(0, 10)) ``` The following helpers are available for use in `detect`: | Name | Model type | Fixed attributes | | ----------- | ------------------------ | ----------------------------- | | `tag` | `LabeledImageRequest` | `frame=0` | | `shape` | `LabeledShapeRequest` | `frame=0` | | `mask` | `LabeledShapeRequest` | `frame=0`, `type="mask"` | | `polygon` | `LabeledShapeRequest` | `frame=0`, `type="polygon"` | | `rectangle` | `LabeledShapeRequest` | `frame=0`, `type="rectangle"` | | `skeleton` | `LabeledShapeRequest` | `frame=0`, `type="skeleton"` | | `keypoint` | `SubLabeledShapeRequest` | `frame=0`, `type="points"` | For `mask`, it is recommended to create the points list using the `cvat_sdk.masks.encode_mask` function, which will convert a bitmap into a list in the format that CVAT expects. For example: ```python cvataa.mask(my_label, encode_mask( my_mask, # boolean 2D array, same size as the input image [x1, y1, x2, y2], # top left and bottom right coordinates of the mask )) ``` To create shapes with attributes, it's recommended to use the `cvat_sdk.attributes.attribute_vals_from_dict` function, which returns a list of objects that can be passed to an `attributes` argument: ```python cvataa.rectangle( my_label, [x1, y2, x2, y2], attributes=attribute_vals_from_dict({my_attr1: val1, my_attr2: val2}) ) ``` ### Tracking function protocol A tracking function is a type of AA function that analyzes an image with one or more shapes on it, and then predicts the positions of those shapes on subsequent images. A tracking function can only be used in agent mode. When used with a tracking function, an agent will use it to process requests from the AI tracking tools in the CVAT UI. {{% alert title="Warning" color="warning" %}} Currently, only one agent should be run for each tracking function. If multiple agents for one tracking function are run at the same time, CVAT users may experience intermittent "Tracking state not found" errors when using the function. {{% /alert %}} A tracking function must have three attributes, `spec`, `init_tracking_state`, and `track`. It may also optionally have a `preprocess_image` attribute. `spec` must contain the AA function's specification, which is an instance of `TrackingFunctionSpec`. This specification must be initialized with a single `supported_shape_types` parameter, defining which types of shapes the AA function is able to track. For example: ```python spec = cvataa.TrackingFunctionSpec(supported_shape_types=["rectangle"]) ``` `init_tracking_state` must be a function accepting the following parameters: - `context` (`TrackingFunctionShapeContext`). An object with information about the shape being tracked. See details below. - `pp_image` (type varies). A preprocessed image. Consult the description of `preprocess_image` for more details. - `shape` (`TrackableShape`). A shape within the preprocessed image. `TrackableShape` is a minimal version of the `LabeledShape` SDK model, containing only the `type` and `points` fields. The shape's `type` is guaranteed to be one of the types listed in the `supported_shape_types` field of the spec. `init_tracking_state` must analyze the shape and create a state object containing any information that the AA function will need to predict its location on a subsequent image. It must then return this object. `init_tracking_state` must not modify either `pp_image` or `shape`. `track` must be a function accepting the following parameters: - `context` (`TrackingFunctionShapeContext`). An object with information about the shape being tracked. See details below. - `pp_image` (type varies). A preprocessed image. Consult the description of `preprocess_image` for more details. This image will have the same dimensions as those of the image used to create the `state` object. - `state` (type varies). The object returned by a previous call to `init_tracking_state`. `track` must locate the shape that was used to create the `state` object on the new preprocessed image. If it is able to do that, it must return its prediction as a new `TrackableShape` object. This object must have the same value of `type` as the original shape. If `track` is unable to locate the shape, it must return `None`. `track` may modify `state` as needed to improve prediction accuracy on subsequent frames. It must not modify `pp_image`. A `TrackingFunctionShapeContext` object passed to both `init_tracking_state` and `track` will have the following field: - `original_shape_type` (`str`). The type of the shape being tracked. In `init_tracking_state`, this is the same as `shape.type`. In `track`, this is the type of the shape that `state` was created from. `preprocess_image`, if implemented, must accept the following parameters: - `context` (`TrackingFunctionContext`). This is currently a dummy object and should be ignored. In future versions, this may contain additional information. - `image` (`PIL.Image.Image`). An image that will be used to either start or continue tracking. `preprocess_image` must perform any analysis on the image that the function can perform independently of the shapes being tracked and return an object representing the results of that analysis. This object will be passed as `pp_image` to `init_tracking_state` and `track`. If `preprocess_image` is not implemented, then the `pp_image` object will be the original image. In other words, the default implementation is: ```python def preprocess_image(context, image): return image ``` ## Auto-annotation driver The `annotate_task` function uses a detection function to annotate a CVAT task. It must be called as follows: ```python annotate_task(, , , ) ``` The supplied client will be used to make all API calls. By default, new annotations will be appended to the old ones. Use `clear_existing=True` to remove old annotations instead. If a detection function declares a label that has no matching label in the task, then by default, `BadFunctionError` is raised, and auto-annotation is aborted. If you use `allow_unmatched_label=True`, then such labels will be ignored, and any shapes referring to them will be dropped. Same logic applies to sublabels and attributes. It's possible to pass a custom confidence threshold to the function via the `conf_threshold` parameter. `annotate_task` will raise a `BadFunctionError` exception if it detects that the function violated the detection function protocol. ## Predefined AA functions This layer includes several predefined detection functions. You can use them as-is, or as a base on which to build your own. These AA functions use models from the the [torchvision](https://pytorch.org/vision/stable/index.html) library. To use them, install CVAT SDK with the `pytorch` extra: ``` $ pip install "cvat-sdk[pytorch]" ``` Each function is implemented as a dedicated module to allow usage via the CLI `auto-annotate` command. Usage from Python: ```python from cvat_sdk.auto_annotation.functions.torchvision_ import create as create_torchvision annotate_task(, , create_torchvision(, ...)) ``` Usage from the CLI: ```bash cvat-cli auto-annotate "" \ --function-module "cvat_sdk.auto_annotation.functions.torchvision_" \ -p model_name=str:"" ... ``` The `create` function in each module accepts the following parameters: - `model_name` (`str`) - the name of the model, such as `fasterrcnn_resnet50_fpn_v2`. This parameter is required. - `weights_name` (`str`) - the name of a weights enum value for the model, such as `COCO_V1`. Defaults to `DEFAULT`. It also accepts arbitrary additional parameters, which are passed directly to the model constructor. The following section describe each available function. ### `cvat_sdk.auto_annotation.function.torchvision_classification` This AA function uses torchvision's classification models. It produces tag annotations. For each frame, the function will output one tag whose label has the highest probability, as long as that probability is greater or equal to the input confidence threshold. If it is lower, the function will output nothing. ### `cvat_sdk.auto_annotation.functions.torchvision_detection` This AA function uses torchvision's object detection models. It produces rectangle annotations. ### `cvat_sdk.auto_annotation.functions.torchvision_instance_segmentation` This AA function uses torchvision's instance segmentation models. It produces mask or polygon annotations (depending on the value of `conv_mask_to_poly`). ### `cvat_sdk.auto_annotation.functions.torchvision_keypoint_detection` This AA function uses torchvision's keypoint detection models. It produces skeleton annotations. Keypoints which the model marks as invisible will be marked as occluded in CVAT. ### SAM2 Tracking Function For users who want to implement SAM2-based tracking, the CVAT repository includes a ready-to-use SAM2 tracking function at `ai-models/tracker/sam2/func.py`. This function implements the tracking function protocol described above and can be used with the CLI commands for creating native functions and running agents. For detailed setup and usage instructions, see the {{< ilink "/docs/enterprise/segment-anything-2-tracker" "SAM2 Tracker documentation" >}}.