Performance issue when adding a lot of jobs in the queue

Hello there !

We encountered a strange issue recently.

We are using Orthanc to manage a decentralized network of DICOM storages. We are using the jobs a lot to move DICOM files inside the network with the TransferAccelerator plugin between Orthancs and with standard DICOM operations with other DICOM storages.

One of our users triggered an operation that leads to the creation of 750 jobs on one Orthanc instance.

We though that it mustn’t be an issue because in our configuration we have 4 parallel jobs allowed. We though processing those 750 jobs will just take some time since there is this queue system in place.

The server was put on fire, about 48Gb of RAM were consumed (that’s almost 100% on that machine…), processor was really busy too. A lot of jobs failed, we think because of performance issue…

One important note is that we are relying a lot on the job API to keep track of the progress. We call this API for each job so if there is a lot of jobs…

We are not sure about what caused those performances issue, is there specific process applied on a job whenever it has been posted on the server ? Or maybe the job API can be the root cause ?

I know that we are heavy Orthanc users and that our case is not the common one… If it’s not clear enough, we can give more details :wink:.

Thanks
Stéphane

Hi Stéphane,

I played a bit with a high number of jobs with this script that creates 1000 DicomStore jobs or 1000 Transfer jobs.

import threading
import time
from orthanc_api_client import OrthancApiClient

o = OrthancApiClient("http://localhost:8043")


jobs_count = 1000
jobs_ids = []

poller_threads_count = 20
poller_threads = []

for i in range(0, jobs_count):
    # job = o.modalities.send_async(target_modality="service", resources_ids=["737c0c8d-ea890b4d-e36a43bb-fb8c8d41-aa0ed0a8"])
    job = o.transfers.send_async(target_peer="win-service", resource_type="study", resources_ids=["737c0c8d-ea890b4d-e36a43bb-fb8c8d41-aa0ed0a8"])

    print(f"{i} created job")
    jobs_ids.append(job.info.orthanc_id)

def poll_job(thread_id: str):
    print(f"{thread_id} started poller thread ")
    while True:
        for job_id in jobs_ids:
            time.sleep(0.1)
            o.jobs.get_json(job_id)
        print(f"{thread_id} ... {len(jobs_ids)}")

for i in range(0, poller_threads_count):
    t = threading.Thread(target=poll_job, args=(str(i), ))
    t.start()

time.sleep(500)

When creating Transfer jobs, the memory usage stays stable around 1 GB and the CPU load around 1100% (equivalent to 11 cores running at 100% - I have 12 cores). This is what I expect since the TransferJobs uses multiple threads and 4 jobs are running together.

When creating DicomStore jobs, the memory usage go higher around 3 GB and the CPU load around 200%

The DicomStore jobs actually consumes a lot more space than the TransferJobs because, right now, the TransferJobs are not serialized in the registry at all (I have added a TODO for that)

This makes a huge difference in this case where I’m sending a 6000 instances study:
Serializing the job registry with 1000 DicomStore jobs take more than 10 seconds and the content is 300MB once serialized + the native original objects, this can explain, at least partially, the 3GB memory consumption.

So right now, I was unable to reproduce a high memory load as you did. Can you clarify what kind of jobs were being created (DicomStore or TransferJobs) ? Were you handling particularly large studies with tens of thousands of instances ?

Any chance you can provide a reproducible setup ?

Anyway, it seems the current implementation of the job registry is not suitable to handle very large number of jobs and we should probably think of implementing it with a dedicated table in DB - that would also allow filtering the jobs based on their status.

Best regards,

Alain.