14.10 Sensitive Information Classification Library

To help administrators manage internal documents, they first need to define text-based rules in the Sensitive Information Classification Library. The system uses these rules to automatically match internal documents and classify them. Combined with permissions set by the administrator for different document categories, the system can control and log the sharing and usage of documents based on their classification.

When setting text-based rules, administrators need to configure two types of categories: Feature Rules and Information Categories. Feature Rules define specific document-matching criteria, while Information Categories group different Feature Rules together to identify and classify documents.

Select "Category Management -> Sensitive Information Classification Library" to open the library window.

Operation Description
New Select the root node of Information Categories / Feature Rules, then choose "Action -> New" or click the New button on the toolbar to create a new Information Category or Feature Rule.
Search Select "Action -> Search" or click the Search button search icon on the toolbar to locate specific Feature Rules or Information Categories. Fuzzy search is supported.
Show Hidden Feature Rules Feature Rules imported from the Keyword Extraction Tool are hidden by default. Select "Action -> Show Hidden Feature Rules" to view them.
Import Select "Action -> Import" and choose a classification library file to import previously exported Information Categories or Feature Rules.
Export Select "Action -> Export" to export specific Information Categories or Feature Rules. Export options include:
  • Export All: Exports the entire Information Classification and Feature Rules library.
  • Export Information Categories: Exports specified Information Categories along with the Feature Rules applied to them. Feature Rules not linked to these categories will not be exported.
  • Export Feature Rules: Exports only the specified Feature Rules.

Information Category Settings

Operation Description
Information Category Name Administrators can define a custom name for the information category. Names must be unique.
Category Level
  • This defines the classification level of the information category, corresponding to the sensitivity level used by the Document Label feature. This option appears only if the Document Label module is installed.
  • By default, the level is unset, meaning no document label is applied. Administrators can select an appropriate level as needed, allowing management policies to link sensitive information categories with document label levels.
  • In the left-hand view of the Sensitive Information Classification Library, you can switch between Category View and Level View. In Level View, right-click an information category and select "Set Category Level" to modify its level.
Notes Optional notes or description for the information category.
Rule Group Feature Rules included in the information category; multiple rules can be selected.
Rule Weight Default weight is 100. Administrators can adjust it to any integer between 0–100. A document matches the information category only when the sum of matched Feature Rule weights reaches or exceeds 100.

Feature Rule Settings

Operation Description
Feature Rule Name Administrators can define a custom name for the feature rule. Names cannot start with "@" and must be unique.
Type Specifies the type of content the feature rule will identify, including File Name, File Type, File Size, File Content, and File Properties:
  • File Name: Supports keyword or regular expression patterns. Uses the Include Content field (excluding Exclude Content values) to match the target file's name or storage path.
  • File Type: Uses the Include Content field (excluding Exclude Content values) to perform a fuzzy match on the file header content.
  • File Size: Specifies the file size in the Include Content field to match target files of that size.
  • File Content: Supports keyword or regular expression patterns. Uses the Include Content field (excluding Exclude Content values) to match text within the document.
  • File Properties: Applicable only to Office documents. Set the property name, data type, and property value to match document attributes such as creation time, modification time, author, etc.
Content Scope
  • For certain supported file types, you can refine the part of the file to be scanned. The default is Entire Content, but you can also choose Header/Footer Only or Body Only.
  • Currently, header/footer and body-level scanning is supported for the following file types: doc, docx, xls, xlsx, ppt, pptx, wps, et, dps.
  • If a rule is set to scan only the header/footer or only the body, files of unsupported types will be considered non-matching.
Deduplication
  • When enabled, if the same text appears multiple times in a document during matching with this feature rule, it is counted only once.
  • If disabled, each occurrence is counted separately.
Case Sensitivity When enabled, matching English text in the Include Content field is case-sensitive.
Hit Count Specifies the minimum total number of occurrences of the text in Include Content required for a document to match this feature rule. Valid values are integers from 1 to 10,000. A document matches the rule only if the total occurrences meet or exceed this value.
Content Classification Specifies the type of information in Include Content and Exclude Content. Options include Keyword and Regular Expression:
  • Keyword: Matches text literally as entered in Include Content or Exclude Content.
  • Regular Expression: Matches text in Include Content or Exclude Content using regular expressions.
Include Content Defines the content used to match documents. Multiple entries are supported, separated by commas or line breaks.
  • Supports plain text or regular expressions.
  • For keywords, entries can be separated by commas or semicolons.
  • For regular expressions, entries must be separated by semicolons.
The format varies depending on the selected content recognition type.
File Name Content set here is used to match the target document's file name and storage path.
  • Keyword:
  • Supports wildcards.
  • Use \ as the path separator.
  • If the content does not include \, only the file name is matched, not the path.
  • If the content includes \, both the file name and path are matched.
  • Regular Expression:
  • Use \\ as the path separator.
  • If the content does not include \\, only the file name is matched.
  • If the content includes \\, both the file name and path are matched.
  • Example:
  • For a document located at D:\Company Confidential\Contract Documents\1025415\Sales Contract 2019.docx, only the following Include Content settings will match:
  • 1. Contract Documents → No match
  • 2. Company Confidential\ → Match
  • 3. Sales Contract → Match
  • 4. \d{7} → No match
  • 5. \\\d{7}
  • 6. \d{4} → Match
File Type Content set here is used to match the target document's file type.
  • If the predefined file types in the current feature rule library do not meet your needs, you can specify a custom file type using the following format:
  • Offset | File Header Signature, For example: 10|87828101
  • Both the offset and file header signature are expressed in hexadecimal.
  • A positive offset indicates the position from the start of the file, while a negative offset indicates the position from the end of the file.
  • If the offset is 0, it can be omitted, and only the file header signature needs to be set (e.g., FFD8FF).
  • How to obtain offset and file header:
  • Install and open UltraEdit.exe.Drag the target file into the software.From the menu, select Edit -> Hex Functions -> Hex Edit.
  • In general, it is recommended to open multiple files of the same type, compare their hex content, and select the portion that consistently appears as the file header signature, along with its corresponding offset, to improve file type matching accuracy.
File Content Content set here is used to match the target document's text. Support for Keyword and Regular Expression is the same as for File Name.
Exclude Content Specifies content that should be ignored during matching. The rules follow the same format as Include Content, and Exclude Content takes priority over Include Content.

The feature rule library already includes predefined rules for commonly used file types based on File Type. These predefined rules are displayed in blue and cannot be deleted or modified.

The predefined library includes the following file types:

Adobe Illustrator files, Altium Protel files, AutoCAD files, AnySecura encrypted files, Office files, PDF files, Photoshop files, Pro/ENGINEER files, SOLIDWORKS files, Visual Studio files, video files, image files, compressed files, and audio files.

Some types have hidden subtypes by default. To view them, select "Action -> Show Hidden Feature Rules."

When setting up an information category, users can either use existing predefined feature rules or right-click a feature rule and select "New Feature Identification" to create custom feature rules.