Frozen Jobs and REST API not responding after stuck C-MOVE

Hello,

We have encountered an issue and I am hoping to get some help looking into it. Unfortunately I’m not yet able to reproduce locally and my access to the instance where the issue has occurred is limited. But here is what we know:

Set-up:

  1. We initiated a C-MOVE SCU request via the REST API (Orthanc configured for Synchronous C-MOVEs). The C-MOVE is to move a study from a remote PACS to Orthanc.
  2. Eventually the application that initiated the C-MOVE through the API times out and closes the connection.
  3. The C-MOVE job remains in a ‘running’ state indefinitely after this.
  4. We never see any corresponding incoming C-STORE requests. (I am unsure yet if we are getting any sort of C-MOVE-RSP from the PACS).

I suspect the lack of C-STORE response is due to a networking or configuration issue on the PACS end. However, here is where we are having an issue with Orthanc:

  1. The job remains indefinitely.
  2. Attempting to cancel or pause the job via the API return a 200 OK, but the job remains in a running state. (This persists between reboots). I did find another description of this issue: Unable to Pause or Cancel running jobs (DICOM MOVE SCU)
  3. Attempting to use the /reset endpoint to restart Orthanc hangs up. Orthanc stops but freezes up when it is trying to shut down.
  4. Certain other REST API requests never respond. Initiating a C-ECHO through the API seems to work. However, a C-FIND does not. Crucially, though, the trace-level logs show that the DICOM operations succeed. So the C-FIND occurs but the HTTP request never returns a response.
  5. Looking at the metrics orthanc_jobs_running and orthanc_rest_api_active_requests keep increasing as I make more requests. The jobs and api requests seem stuck running.

Orthanc Version 1.12.7

Here is relevant config

{
  "DicomAet": "${DICOM_AET}",
  "Plugins": ["/usr/share/orthanc/plugins", "/usr/local/share/orthanc/plugins"],
  "UserMetadata": {
    "DeletionDate": 3030,
  },

  "StorageDirectory": "/var/lib/orthanc/db",
  "MaximumStorageSize": 51200,
  "MaximumStorageCacheSize": 6144,
  "MediaArchiveSize": 10,

  "DicomThreadsCount": 16,
  "ConcurrentJobs": 0, // (Unlimited)
  "StorageAccessOnFind": "Never", // Fastest setting - uses DB Index whenever possible

  "RemoteAccessAllowed": true,
  "AuthenticationEnabled": false,

  "DatabaseServerIdentifier": "Orthanc1",
  "DicomModalitiesInDatabase": true,
  "PostgreSQL": {
    "EnableIndex": true,
    "EnableStorage": false,
    "Host": "${HOST}",
    "Port": 5432,
    "Database": "orthanc",
    "Username": "${POSTGRES_USER}",
    "Password": "${POSTGRES_PASSWORD}",
    "EnableSsl": false,
    "MaximumConnectionRetries": 10,
    "ConnectionRetryInterval": 5,
    "IndexConnectionsCount": 50,
    "EnableVerboseLogs": false
  }
}

I will update if able to provide a minimal reproducible example. In the meantime any insight into how to mitigate or debug this issue would be appreciated.

Hi,

First, note that "ConcurrentJobs": 0 does not mean unlimited but A value of "0" indicates to use all the available CPU logical cores.

Since you are sharing only the relevant configuration; are you sure you have not configured DICOM SCU/SCP timeouts ?

Imagine you have only 1 concurrent job and you have very large DICOM timeouts, then, a DICOM job could block the job engine for a very long time while waiting for DICOM messages.

/reset probably waits for the current jobs to complete or, at least, to yield after a step.

BTW, it would be very helpful to have a reproducible setup (or at least full logs).

Best regards,

Alain.

Thank you for your input. The config above is almost complete (we have it split into several files). The other file does not contain any timeout config, but I will double check to make sure that we are not setting that anywhere else.

I am still working to reproduce this behavior, but so far, I haven’t been able to mimic the behavior of the production PACS locally and so I haven’t been able to trigger the same behavior. However, your input has helped me imagine a way I might be able to do so. I’ll report back when I have more information.

Hello @alainmazy, I have some more information for you.

Timeout Configuration

I confirmed that we are not setting DicomScuTimeout. However, it appears that the Timeout property was being set to 0 for the modality. We did not explicitly set this when setting up a modality, but it seems that perhaps it is set to a default value from the database? We did try to set it to 30 but did not see a change in behavior. However, we already had stuck jobs at that time, so it’s possible that we were encountering a different issue. Will try again today to set it to 30 once we have no stuck jobs.

Can you confirm what the 0 value means in this context? I know that for DicomScuTimeout a value of 0 indicates that there is no timeout. However, in this case does 0 mean no timeout or does it mean “do not override DicomScuTimeout”? If it does not mean no timeout, but it is also set to 0 if no value is passed when creating a modality, I would consider that unexpected behavior. I would expect that not passing a value would leave it as the default.

Stuck C-MOVE

We were able to reproduce a stuck C-MOVE. Here are the steps that got there:

  1. Issue a C-FIND (success)
  2. Use answers/$id/retrieve to start a C-MOVE
  3. We see the successful association and then a C-MOVE-RQ. There is no response. The API request never returns and I don’t see anything more in the logs on the Orthanc side.

REST API and Jobs Frozen

After this, we begin seeing the “stuck” behavior from Orthanc. Some API requests return successfully. Others trigger DICOM operations but never return. Others don’t even trigger DICOM opersation.

Here is a summary of the behavior we see:

  • /modality/$id/query shows a successful DICOM request and response, but the REST API request never returns a response
  • subsequent retrieves (C-MOVEs) do not even send a DICOM request
  • unable to cancel, pause, or delete jobs – they remain in a ‘Running’ state
  • using the /reset endpoint to restart Orthanc does not work. It logs Orthanc is stopping and a few services stop, but then it gets hung up. We resorted to restarting the Docker container.
  • upon restarting the jobs are still in a Running state. However, we are able to try another C-MOVE and see the association and request one time. Then subsequent requests revert back to the fully “stuck” behavior.

I am in the process of gathering and cleaning up the logs. We still have not been able to reproduce locally and I don’t have direct access to the server where this is occurring. I will post relevant logs ASAP. Please let me know if there are logs that would be most helpful.