TesseractAction

Performs OCR on one or multiple images using the

Tesseract.

Returns either the whole text of the document(s) parsed or, if a X-OCR-Regex is found in the header, the text matching a specified regular expression, the text matching multiple named regular expressions or whether the text matches a specified regular expression.

Formcycle upload-fields that take advantage of CodBi's Media.MultipleDownload thus uploading more than one image are supported. The JSON returned will hold the properties named according to the transmitted file's names holding the found text.

Plugin-Properties

  • AI_Tesseract_Languages Optional three-letter language-code specification of language the Tesseract shall be able to recognize (defaults to deu). Multiple languages may be separated by a + (e.g. deu + eng). -- **AI_Tesseract_PoolSize ** Number of Tesseract-Instance that're concurrently available (see sizePool).

  • AI_Tesseract_MaxCPUPercent CPU usage threshold (%) — blocks OCR requests when exceeded (default: 101.0, effectively disabled).

  • AI_Tesseract_MaxRAMPercent RAM usage threshold (%) — blocks OCR requests when exceeded (default: 101.0, effectively disabled).

URLs needed for proper initialization:

The Maven repository's URL may be changed using the AI_Tesseract_MavenRepository plugin property.

Domains to whitelist

  • repo1.maven.org

  • github.com

  • raw.githubusercontent.com

  • api.github.com

  • objects.githubusercontent.com

DSGVO, EU-AI ACT & technical Advantages vs Dedicated Server AI Approach

  • No separate AI server setup (fewer systems to secure and audit).

  • Reduced data transfer: processing stays within the plugin runtime.

  • Simpler compliance scope: fewer endpoints and lower operational overhead.

  • Lower latency and fewer network dependencies for OCR execution.

  • Easier data minimization: fewer data copies and storage locations.

  • Clearer accountability boundaries for processor/controller roles.

  • Simplified breach response: No separate AI server to manage in case of incidents.

  • Easier implementation of data subject rights (access, deletion) without coordinating with a separate AI service.

  • Plugin does not store image data or OCR results persistently, minimizing data retention concerns.

  • Most unproblematic deletion request response: Data is never stored not even in server-backups so no deletion necessary.

Note On Removal

If OCR was activated once the DLL used is locked into memory, making it impossible to delete the plugin's files from the server. That is a technical limitation of the Tesseract library and not a CodBi-specific issue. If you want to remove the plugin after activation, you need to first disable the plugin and then reboot the server. After that you can delete the plugin.

Constructors

Link copied to clipboard
constructor()

Types

Link copied to clipboard
object Companion

Companion for static members.

Functions

Link copied to clipboard
open override fun execute(params: IPluginServletActionParams): IPluginServletActionRetVal

Does, if activated by the CodBi-Plugin-Property Active_AI containing OCR, use AI's janitor to store images that have an ID (if transmitted in the header X-OCR-Image-ID) and extracts all the text from the transmitted, or via X-OCR-Image-ID specified, images.

Link copied to clipboard
Link copied to clipboard
open override fun getDisplayName(p0: Locale): String
Link copied to clipboard
open override fun getName(): String

Specifies the name of this IPluginServletAction.

Link copied to clipboard
open override fun initialize(configData: IPluginInitializeData)

Initializes this plugin if the CodBi-Plugin-Property Active_AI contains OCR (case insensitive). By determining the pluginRoot it tells the execute-method where to store the temporary images. Furthermore, the appropriate native libraries for the server's os will be extracted from the JAR and copied onto the server's drive prior to being cloned to be provided as versions that won't be locked due to possible previous initializations of the plugin. This servlet will check if the appropriate models for the languages specified via the CodBi-Plugin-Property AI_Tesseract_Languages (e.g. deu+ita+eng or just deu) are already present within the Plugin's local resources and download the model for each language automatically, if not. If the property is not set deu will be assumed.

Link copied to clipboard
open fun initPlugin()
Link copied to clipboard
open fun install(p0: IPluginInstallData)
Link copied to clipboard
open override fun shutdown(shutdownData: IPluginShutdownData?)

Shuts down the pool and releases all Tesseract handles.

open fun shutdown()
Link copied to clipboard

Initiates a task that removes unused images that're expired (msExpirationIDedImages) from the cache (cacheIDedImages).

Link copied to clipboard
open fun uninstall(p0: IPluginUninstallData)
Link copied to clipboard
open override fun validateConfigurationData(configData: IPluginValidationData): IPluginInitializeValidationResult?

Wipes the local data needed to run the Tesseract, if Active_AI does not contain OCR. Furthermore, AI_Tesseract_Languages is checked for compliance to ^a-z{3}(\s\+\sa-z{3})*$**, if it is set.