Tesseract Action
Performs OCR on one or multiple images using the
Returns either the whole text of the document(s) parsed or, if a X-OCR-Regex is found in the header, the text matching a specified regular expression, the text matching multiple named regular expressions or whether the text matches a specified regular expression.
Formcycle upload-fields that take advantage of CodBi's Media.MultipleDownload thus uploading more than one image are supported. The JSON returned will hold the properties named according to the transmitted file's names holding the found text.
Plugin-Properties
AI_Tesseract_Languages Optional three-letter language-code specification of language the Tesseract shall be able to recognize (defaults to deu). Multiple languages may be separated by a + (e.g. deu + eng). -- **AI_Tesseract_PoolSize ** Number of Tesseract-Instance that're concurrently available (see sizePool).
AI_Tesseract_MaxCPUPercent CPU usage threshold (%) — blocks OCR requests when exceeded (default:
101.0, effectively disabled).AI_Tesseract_MaxRAMPercent RAM usage threshold (%) — blocks OCR requests when exceeded (default:
101.0, effectively disabled).
URLs needed for proper initialization:
The Maven repository's URL may be changed using the AI_Tesseract_MavenRepository plugin property.
Domains to whitelist
repo1.maven.org
github.com
raw.githubusercontent.com
api.github.com
objects.githubusercontent.com
DSGVO, EU-AI ACT & technical Advantages vs Dedicated Server AI Approach
No separate AI server setup (fewer systems to secure and audit).
Reduced data transfer: processing stays within the plugin runtime.
Simpler compliance scope: fewer endpoints and lower operational overhead.
Lower latency and fewer network dependencies for OCR execution.
Easier data minimization: fewer data copies and storage locations.
Clearer accountability boundaries for processor/controller roles.
Simplified breach response: No separate AI server to manage in case of incidents.
Easier implementation of data subject rights (access, deletion) without coordinating with a separate AI service.
Plugin does not store image data or OCR results persistently, minimizing data retention concerns.
Most unproblematic deletion request response: Data is never stored not even in server-backups so no deletion necessary.
Note On Removal
If OCR was activated once the DLL used is locked into memory, making it impossible to delete the plugin's files from the server. That is a technical limitation of the Tesseract library and not a CodBi-specific issue. If you want to remove the plugin after activation, you need to first disable the plugin and then reboot the server. After that you can delete the plugin.
Functions
Does, if activated by the CodBi-Plugin-Property Active_AI containing OCR, use AI's janitor to store images that have an ID (if transmitted in the header X-OCR-Image-ID) and extracts all the text from the transmitted, or via X-OCR-Image-ID specified, images.
Initializes this plugin if the CodBi-Plugin-Property Active_AI contains OCR (case insensitive). By determining the pluginRoot it tells the execute-method where to store the temporary images. Furthermore, the appropriate native libraries for the server's os will be extracted from the JAR and copied onto the server's drive prior to being cloned to be provided as versions that won't be locked due to possible previous initializations of the plugin. This servlet will check if the appropriate models for the languages specified via the CodBi-Plugin-Property AI_Tesseract_Languages (e.g. deu+ita+eng or just deu) are already present within the Plugin's local resources and download the model for each language automatically, if not. If the property is not set deu will be assumed.
Initiates a task that removes unused images that're expired (msExpirationIDedImages) from the cache (cacheIDedImages).
Wipes the local data needed to run the Tesseract, if Active_AI does not contain OCR. Furthermore, AI_Tesseract_Languages is checked for compliance to ^a-z{3}(\s\+\sa-z{3})*$**, if it is set.