Performance issue when adding a lot of jobs in the queue

shulard · September 21, 2023, 12:34pm

Hello there !

We encountered a strange issue recently.

We are using Orthanc to manage a decentralized network of DICOM storages. We are using the jobs a lot to move DICOM files inside the network with the TransferAccelerator plugin between Orthancs and with standard DICOM operations with other DICOM storages.

One of our users triggered an operation that leads to the creation of 750 jobs on one Orthanc instance.

We though that it mustn’t be an issue because in our configuration we have 4 parallel jobs allowed. We though processing those 750 jobs will just take some time since there is this queue system in place.

The server was put on fire, about 48Gb of RAM were consumed (that’s almost 100% on that machine…), processor was really busy too. A lot of jobs failed, we think because of performance issue…

One important note is that we are relying a lot on the job API to keep track of the progress. We call this API for each job so if there is a lot of jobs…

We are not sure about what caused those performances issue, is there specific process applied on a job whenever it has been posted on the server ? Or maybe the job API can be the root cause ?

I know that we are heavy Orthanc users and that our case is not the common one… If it’s not clear enough, we can give more details .

Thanks
Stéphane

alainmazy · September 26, 2023, 10:23am

Hi Stéphane,

I played a bit with a high number of jobs with this script that creates 1000 DicomStore jobs or 1000 Transfer jobs.

import threading
import time
from orthanc_api_client import OrthancApiClient

o = OrthancApiClient("http://localhost:8043")


jobs_count = 1000
jobs_ids = []

poller_threads_count = 20
poller_threads = []

for i in range(0, jobs_count):
    # job = o.modalities.send_async(target_modality="service", resources_ids=["737c0c8d-ea890b4d-e36a43bb-fb8c8d41-aa0ed0a8"])
    job = o.transfers.send_async(target_peer="win-service", resource_type="study", resources_ids=["737c0c8d-ea890b4d-e36a43bb-fb8c8d41-aa0ed0a8"])

    print(f"{i} created job")
    jobs_ids.append(job.info.orthanc_id)

def poll_job(thread_id: str):
    print(f"{thread_id} started poller thread ")
    while True:
        for job_id in jobs_ids:
            time.sleep(0.1)
            o.jobs.get_json(job_id)
        print(f"{thread_id} ... {len(jobs_ids)}")

for i in range(0, poller_threads_count):
    t = threading.Thread(target=poll_job, args=(str(i), ))
    t.start()

time.sleep(500)

When creating Transfer jobs, the memory usage stays stable around 1 GB and the CPU load around 1100% (equivalent to 11 cores running at 100% - I have 12 cores). This is what I expect since the TransferJobs uses multiple threads and 4 jobs are running together.

When creating DicomStore jobs, the memory usage go higher around 3 GB and the CPU load around 200%

The DicomStore jobs actually consumes a lot more space than the TransferJobs because, right now, the TransferJobs are not serialized in the registry at all (I have added a TODO for that)

This makes a huge difference in this case where I’m sending a 6000 instances study:
Serializing the job registry with 1000 DicomStore jobs take more than 10 seconds and the content is 300MB once serialized + the native original objects, this can explain, at least partially, the 3GB memory consumption.

So right now, I was unable to reproduce a high memory load as you did. Can you clarify what kind of jobs were being created (DicomStore or TransferJobs) ? Were you handling particularly large studies with tens of thousands of instances ?

Any chance you can provide a reproducible setup ?

Anyway, it seems the current implementation of the job registry is not suitable to handle very large number of jobs and we should probably think of implementing it with a dedicated table in DB - that would also allow filtering the jobs based on their status.

Best regards,

Alain.

shulard · July 11, 2024, 9:13am

Hello Alain !

Sorry for the very late response… We haven’t encountered this issue since the day I created this thread…

However we were created DicomStore jobs to retrieve data from a DICOM storage to our Orthanc server. There are also TransferJobs but in our case there was a lot of DicomStore (I know that’s not a really precise point ^^).

I think the main problem on our side is that we are checking the state of Orthanc jobs regularly using the API. So I guess that this implies serializing job details every time we check the progress…

Since my last message, did you checked the database change that can help managing a large number of jobs ?

Thanks
Stéphane

alainmazy · July 12, 2024, 2:21pm

I don’t think the serialization is the issue. The jobs registry main data structure stays in memory and is serialized to disk every 10 seconds but not every time you check the progress. Of course, if you check every 1ms, even converting the memory object to json to build the response can become cpu intensive.
Note that, as an alternative to polling, you may be able to implement lua (OnJobSuccess/OnJobFailure) or python callbacks (OnChange + JOB_FAILURE/JOB_SUCCESS)

Not yet ! We are working on the DB but not yet on that topic. However, more and more peoples are requesting it so I hope it will reach the top of the priority list in the next months.

shulard · July 12, 2024, 2:59pm

Thank you for your answer !

For sure we are not polling every 1ms.

Yes we are thinking about adding a webhook on the application instead of polling, this will be a lot lighter.

kemppare · August 13, 2024, 2:54am

Hi,

I’m using Orthanc for query and retrieve of many small DICOM studies (DICOM ECG in particular) from PACS. I’m struggling with the speed of the retrieve, as it takes over an hour to download 2000 files.
I studied the verbose and trace level logging during the QR and noticed the following:’

C-MOVE of one study takes around 2500 ms
The transfer itself is fast, taking just a few milliseconds.
According to the PACS vendor, majority of time is used for waiting for the next C-MOVE command. This can be seen on the log
o T0808 08:57:11.709445 1f74 ServerIndex.cpp:161] Change related to resource a655e848-2e6b3e6a-c0cfdff5-dcadd044-ae782ee3 of type Instance: UpdatedMetadata
o T0808 08:57:13.301335 JOBS-WORKER-0 DicomControlUserConnection.cpp:413] (dicom) Received Final Move Response:
All transfer jobs are handled by JOBS-WORKER-0.

Alain, you mentioned above that “TransferJobs are not serialized in the registry at all (I have added a TODO for that)”. Indeed, according to the log, it looks like Orthanc is not serializing the C-Moves at all. Have you implemented the serialization of the TransferJobs already and if so, how can I take it into use?

Best Regards, Reko

kemppare · August 13, 2024, 3:02am

Here are the links for the log files: 2 kohdetta

alainmazy · August 26, 2024, 1:56pm

Hi @kemppare

Your message is not related to the topic discussed - there is only one C-Move job involved in your case - please start a new topic.

Alain

kemppare · August 27, 2024, 5:50am

Actually I figured this out already. It seems that Orthanc is not serializing a single C-Move but I managed to make multiple C-MOVEs in parallel by setting first the parameters for ConcurrentJobs, JobsEngineThreadsCount and SynchronousCMove then changed my code to do C-FIND and C-move in smaller batches in parallel.