Companion
Properties
Stores the last MAX_HISTORY_SIZE inference durations (in ms) per model-type key. Used by estimateWaitMs to compute average inference time per model type.
Shared semaphore that limits concurrent local AI inferences across all modules (LLAMA, Tesseract, Whisper). Configured via AI_LLAMA_ENGINE_MaxConcurrent (default 2).
Whether the queue-position badge is enabled globally. Configured via AI_QueueBadge.
Tracks every request that is waiting for or currently holding the inference semaphore. Streaming threads register before acquire; retry-based clients (sync LLAMA, Tesseract) register on the first failed tryAcquire. Tickets are removed when inference completes (in the finally block after release). The map value is the creation timestamp for waiting tickets, or Long.MAX_VALUE for running inferences (immune to stale cleanup).
Maps ticket UUID → model-type key (e.g. "llama-thinking", "llama-fast", "tesseract"). Registered alongside queueTickets so estimateWaitMs can look up which model types are ahead in the queue and calculate an approximate wait time.
Functions
Removes waiting tickets older than 30 s (abandoned clients). Active tickets (Long.MAX_VALUE) are not affected.
Estimates the total wait time (in ms) for a ticket by summing the average inference duration of every ticket ahead of it in the queue. Only tickets whose model type has recorded history contribute to the estimate. Returns null if no estimate is possible (no history for any of the queued model types).
Records the duration of a completed inference for a given model type.
Replaces the shared inference semaphore with a new limit.