Patch: Release large-DICOM semaphore lock early (for better performance)

While tracking down performance bottlenecks in our Orthanc deployment on fly.io using Postgres for index and Tigris as an S3 storage backend, I found one issue:

In ServerContext.cpp in the DicomCacheLocker constructor, there is logic that is supposed to limit cache-locking to (currently hard-coded) at most one cache-entry that exceeds 50 MB (currently hard-coded as well). This uses a semaphore locker.

The problem is that the semaphore lock is acquired ahead of time (to be sure) and only released once the DICOM data is completely retrieved (ServerContext:ReadDicom()has finished).

ReadDicom() in our context has two stages:

  1. lookup of the storage attachment from Postgres
  2. retrieval of the payload from S3

Both involve network interactions and can be slow depending on network condition, service load, etc.

In fact, after step 1, the uncompressed size of the payload is already known and the semaphore lock could typically be released before the payload is retrieved. This would allow for better parallelism when serving (“small”) image retrievals.

We actually found this issue when looking at verbose logs and saw that HTTP processing would stop after permissions checks (advanced authorization plugin) and before doing Postgres index lookups unexpectedly until a different thread had finished S3 transfers.

I am attaching a patch that passes in the semaphore locker to ReadDicom() such that the big/small check and lock release can be done there once the index data has been retrieved.

As I am not yet allowed to upload attachments, please see here: Orthanc patch: Release large-dicom semaphore-lock directly after reading index · GitHub

I’d be glad if this (or a similar amendment) could find its way into Orthanc.

We still have performance bottlenecks but they are somewhere else. Our observation is that Orthanc serves image data a lot slower (plus ~ 300ms) when using a Postgres index instead of the local sqlite index even though raw Postgres latency is at about 3-4 ms.

Hi @tvogel

Thanks for your patch, I have included it in this commit.

Best regards,

Alain.

Great, thanks a lot!

Ah, any idea what the cause for our Postgres being that much slower could be? I have read Scalability of Orthanc — Orthanc Book documentation and had hoped our 3-4 ms Postgres round-trip to be sufficient but there must be some additional problem.

Logs logs logs :wink: and a description of your setup, stats and what you are doing …

Good news: It turned out not to be a problem in Orthanc but instead was an “unexpected feature” of our S3 provider (Tigris data): When you switch a single-region bucket from one region to another, only new objects are allocated in the new region - old objects stay where they were created. I had expected them to be migrated (at least after access from the target region).

That led to the misleading conclusion that Postgres was the reason while in fact, the reason was that the objects referred in the Postgres index “happened” to be located in Chicago while the ones referred in the SQLite index were from much closer Frankfurt. This was because of the order of events while populating the indexes.

Clearing everything out and setting it up consistently again made the timing differences go away.

1 Like