Create a Dictionary Classifier

Audience: Data Governors

Content Summary: In addition to built-in classifiers, Sensitive Data Discovery can use custom classifiers to discover and apply tags to sensitive data. This page details how to create a custom dictionary classifier. For specific details and examples of other classifiers, see the Create a Custom Column Name Regex Classifier or Create a Custom Regex Classifier tutorials.

Use Case: Custom Dictionary Classifier

Scenario: You have data that includes the names of the rooms employees' desks are in across your organization. Although these locations may be considered sensitive in particular datasets, they would not be detected by Immuta's built-in classifiers.

A custom dictionary classifier allows you to create your own detectors that enable Immuta's Sensitive Data Discovery to match a list of room names to values in the dataset. The tutorial below uses this scenario to illustrate creating this classifier.

Attributes of the Custom Dictionary Classifier

Attributes of all custom classifiers are provided on the Sensitive Data Discovery API page. However, attributes specific to the custom dictionary classifier are outlined in the table below.

Attribute	Description
name	`string` Unique, request-friendly classifier name.
displayName	`string` Unique, human-readable classifier name.
description	`string` The classifier description.
type	`string` The type of classifier: `dictionary`.
config	`object` Includes `config.minConfidence`, `config.tags`, `config.values`, and `config.caseSensitive` (defaults to `false`). *See descriptions below.
minConfidence*	`number` When the detection confidence is at least this percentage, tags are applied.
tags*	`array[string]` The name of the tags to apply to the data source. Note: All tags must start with `Discovered.`.
values*	`array[string]` The list of words to include in the dictionary.
caseSensitive*	`boolean` Indicates whether or not `values` are case sensitive. Defaults to `false`.

Create a Custom Dictionary Classifier

Generate your API key on the API Keys tab on your profile page and save the API key somewhere secure. You will include this API key in the authorization header when you make a request to the Immuta API or use it to configure your instance with the Immuta CLI.

Save the custom dictionary classifier payload in a .json file. The dictionary below contains the words Research Lab, Blue Room, and Purple Room.

{
  "name": "EMPLOYEE_DESK_LOCATION_CLASSIFIER",
  "displayName": "Employee Desk Location Classifier",
  "description": "This classifier detects when an employee's desk location appears in a dataset.",
  "type": "dictionary",
  "config": {
    "values": ["Research Lab", "Blue Room", "Purple Room"],
    "caseSensitive": false,
    "minConfidence": 0.6,
    "tags": ["Discovered.desk-location"]
  }
}

Create the classifier using one of these methods:

Immuta CLIHTTP API

Immuta CLI

immuta api sdd/classifier -X POST --input ./example-payload.json

HTTP API

curl \
    --request POST \
    --header "Content-Type: application/json" \
    --header "Authorization: 12345678900000" \
    --data @example-payload.json \
    https://your-immuta-url.immuta.com/sdd/classifier

If the request is successful, you will receive a response that contains details about the classifier.

{
  "createdBy": {
    "id": 1,
    "name": "John",
    "email": "john@example.com"
  },
  "name": "EMPLOYEE_DESK_LOCATION_CLASSIFIER",
  "displayName": "Employee Desk Location Classifier",
  "description": "This classifier detects when an employee's desk location appears in a dataset.",
  "type": "dictionary",
  "config": {
    "tags": [
      "Discovered.desk-location"
    ],
    "values": [
      "Research Lab",
      "Blue Room",
      "Purple Room"
    ],
    "caseSensitive": false,
    "minConfidence": 0.6
  },
  "id": 68,
  "createdAt": "2021-10-20T17:57:51.696Z",
  "updatedAt": "2021-10-20T17:57:51.696Z"
}

What's Next

Continue to one of the following tutorials:

Run Sensitive Data Discovery on Data Sources: Trigger SDD to run on specified data sources.
Create a Template: Although only Data Governors can create classifiers, Data Owners can add classifiers to templates, which they then apply to their data sources to override minConfidence or tags for classifiers within the template.