Performance issue when adding a lot of jobs in the queue

Hello there !

We encountered a strange issue recently.

We are using Orthanc to manage a decentralized network of DICOM storages. We are using the jobs a lot to move DICOM files inside the network with the TransferAccelerator plugin between Orthancs and with standard DICOM operations with other DICOM storages.

One of our users triggered an operation that leads to the creation of 750 jobs on one Orthanc instance.

We though that it mustn’t be an issue because in our configuration we have 4 parallel jobs allowed. We though processing those 750 jobs will just take some time since there is this queue system in place.

The server was put on fire, about 48Gb of RAM were consumed (that’s almost 100% on that machine…), processor was really busy too. A lot of jobs failed, we think because of performance issue…

One important note is that we are relying a lot on the job API to keep track of the progress. We call this API for each job so if there is a lot of jobs…

We are not sure about what caused those performances issue, is there specific process applied on a job whenever it has been posted on the server ? Or maybe the job API can be the root cause ?

I know that we are heavy Orthanc users and that our case is not the common one… If it’s not clear enough, we can give more details :wink:.

Thanks
Stéphane

Hi Stéphane,

I played a bit with a high number of jobs with this script that creates 1000 DicomStore jobs or 1000 Transfer jobs.

import threading
import time
from orthanc_api_client import OrthancApiClient

o = OrthancApiClient("http://localhost:8043")


jobs_count = 1000
jobs_ids = []

poller_threads_count = 20
poller_threads = []

for i in range(0, jobs_count):
    # job = o.modalities.send_async(target_modality="service", resources_ids=["737c0c8d-ea890b4d-e36a43bb-fb8c8d41-aa0ed0a8"])
    job = o.transfers.send_async(target_peer="win-service", resource_type="study", resources_ids=["737c0c8d-ea890b4d-e36a43bb-fb8c8d41-aa0ed0a8"])

    print(f"{i} created job")
    jobs_ids.append(job.info.orthanc_id)

def poll_job(thread_id: str):
    print(f"{thread_id} started poller thread ")
    while True:
        for job_id in jobs_ids:
            time.sleep(0.1)
            o.jobs.get_json(job_id)
        print(f"{thread_id} ... {len(jobs_ids)}")

for i in range(0, poller_threads_count):
    t = threading.Thread(target=poll_job, args=(str(i), ))
    t.start()

time.sleep(500)

When creating Transfer jobs, the memory usage stays stable around 1 GB and the CPU load around 1100% (equivalent to 11 cores running at 100% - I have 12 cores). This is what I expect since the TransferJobs uses multiple threads and 4 jobs are running together.

When creating DicomStore jobs, the memory usage go higher around 3 GB and the CPU load around 200%

The DicomStore jobs actually consumes a lot more space than the TransferJobs because, right now, the TransferJobs are not serialized in the registry at all (I have added a TODO for that)

This makes a huge difference in this case where I’m sending a 6000 instances study:
Serializing the job registry with 1000 DicomStore jobs take more than 10 seconds and the content is 300MB once serialized + the native original objects, this can explain, at least partially, the 3GB memory consumption.

So right now, I was unable to reproduce a high memory load as you did. Can you clarify what kind of jobs were being created (DicomStore or TransferJobs) ? Were you handling particularly large studies with tens of thousands of instances ?

Any chance you can provide a reproducible setup ?

Anyway, it seems the current implementation of the job registry is not suitable to handle very large number of jobs and we should probably think of implementing it with a dedicated table in DB - that would also allow filtering the jobs based on their status.

Best regards,

Alain.

Hello Alain !

Sorry for the very late response… We haven’t encountered this issue since the day I created this thread…

However we were created DicomStore jobs to retrieve data from a DICOM storage to our Orthanc server. There are also TransferJobs but in our case there was a lot of DicomStore (I know that’s not a really precise point ^^).

I think the main problem on our side is that we are checking the state of Orthanc jobs regularly using the API. So I guess that this implies serializing job details every time we check the progress…

Since my last message, did you checked the database change that can help managing a large number of jobs ?

Thanks
Stéphane

I don’t think the serialization is the issue. The jobs registry main data structure stays in memory and is serialized to disk every 10 seconds but not every time you check the progress. Of course, if you check every 1ms, even converting the memory object to json to build the response can become cpu intensive.
Note that, as an alternative to polling, you may be able to implement lua (OnJobSuccess/OnJobFailure) or python callbacks (OnChange + JOB_FAILURE/JOB_SUCCESS)

Not yet ! We are working on the DB but not yet on that topic. However, more and more peoples are requesting it so I hope it will reach the top of the priority list in the next months.

Thank you for your answer !

For sure we are not polling every 1ms.

Yes we are thinking about adding a webhook on the application instead of polling, this will be a lot lighter.