Accelerated Transfer locks Jobs

Hi Team,

I am trying to track down an issue that is happening on different servers on a semi-regular basis. We use the osimis/orthanc docker image and are currently using the latest (23.7.1) that receives scans from CTs or Xray devices. These are then forwarded to an Orthanc peer using the Accelerated Transfers plugin.

What is happening is that when a push transfer is started, then POST to /transfers/send times out and then any further request to /jobs also times out.

The only way I’ve been able to recover from this is to restart Orthanc, at which time the transfer is started again by sending a post to /transfers/send and the scan is then successfully transferred.

I’ve been able to capture this with trace logs on, but there’s nothing obvious in the logs.

I’ve added 2 logs to this gist Orthanc Accelerated Transfers Frozen Transfer · GitHub

In the log of the unsuccessful transfer, the last entry before it appears to become unresponsive is New job submitted with priority 0: e72c0d9c-ab83-402b-b3c3-4fbe3f587e24 (line 81 @ 08:52:26.889859). At which point there are no more logs until I manually check the health of the orthanc system @ 08:54:11.814665 by calling /system which works and /jobs which timeout.

I’m currently rolling back to an older release to see if it related to a newer release. Unfortunately I can’t replicate the issue.

If there is any further advice on how I can pinpoint the issue, it would be greatly appreciated!

The config file for the system that is experiencing this issue is

{
  "Transfers": {
    "MaxHttpRetries": 2,
    "BucketSize": 20000
  },
  "RegisteredUsers": {
  },
  "PostgreSQL": {
    "Database": "...",
    "Username": "...",
    "Host": "...",
    "EnableIndex": true,
    "Password": "...",
    "EnableStorage": false,
    "Port": 5432,
    "EnableSsl": false,
    "Lock": false
  },
  "PythonScript": "/var/lib/orthanc/python/orthanc.py",
  "OrthancPeers": {
    "PEER": [
      "http://x.x.x.x:8042/"
    ]
  },
  "DicomModalities": {
    "MODALITY": [
      "AET",
      "x.x.x.x",
      104
    ]
  },
  "StableAge": 30,
  "AuthenticationEnabled": true,
  "DicomPort": 104,
  "HttpTimeout": 300,
  "OverwriteInstances": true,
  "Name": "Name",
  "MaximumStorageSize": 40920,
  "StorageDirectory": "/var/lib/orthanc/db",
  "RemoteAccessAllowed": true,
  "HttpsCACertificates": "/etc/ssl/certs/ca-certificates.crt",
  "Plugins": [
    "/run/orthanc/plugins",
    "/usr/share/orthanc/plugins"
  ],
  "DicomWeb": {
    "Enable": true
  },
  "Gdcm": {
    "Throttling": 4,
    "RestrictTransferSyntaxes": [
      "1.2.840.10008.1.2.4.90",
      "1.2.840.10008.1.2.4.91",
      "1.2.840.10008.1.2.4.92",
      "1.2.840.10008.1.2.4.93"
    ]
  },
  "OrthancExplorer2": {
    "Enable": true,
    "IsDefaultOrthancUI": false
  }
}

Hi James,

We actually have recently observed the same locks with DicomWebStore jobs and we have a setup where this happens quite often. I was working on that today ! Of course, I have not been able to reproduce it on my dev machine :frowning:

I’m pretty sure these are similar issues since the last 2 jobs related lines in our logs and yours are identical.

This really looks like a deadlock related to the JobsRegistry mutex but so far, I have not found anything suspicious.

I’m not sure this bug has been introduced recently - the JobsRegistry has not suffered many changes in the past few months.

I’ll keep you posted.

Alain.

Hi Alain, thanks for that. Let me know if I can help, happy to install a debug build if it helps too.
James

Hi James,

I have just built a debug version of the osimis/orthanc Docker image that will dump the core as soon as the /jobs route is not responsive anymore. Of course, I have not been able to test it so far.
I have just deployed it on the site where we had this kind of issues. Now I’m waiting … last problem occurred 2 days ago so this might take some time … I’ll keep you posted next week.

So, that might be nice if you could also give it a try on your side to record an event.

Here are the full changes:

And, to use it:

# to test:
# mkdir -p /tmp/cores
# docker run -v /tmp/cores:/cores -p 8044:8042 osimis/orthanc:debug-script 
## to force generation of a core:
# curl -v http://orthanc:orthanc@localhost:8044/generate-core
#
# to analyse core:
# docker run -v /tmp/cores:/cores -it --entrypoint=bash osimis/orthanc:debug-script
#$ gdb /usr/local/bin/Orthanc /cores/core.1
#$$ thread apply all bt      -> this should hopefully give enough information about where the executable is stuck.

Thanks for your help !

Alain.

I haven’t been able to dump the core, but I have been able to replicate the issue on my local machine. I will post my replication setup tomorrow.

Hi James,

Could you share your replication setup ? I haven’t been able yet to record a core dump on my side.

Thanks

Alain.

Sorry - I ran out of time today. I will see what I can do tomorrow. Interestingly I did quickly try the debug-script container and experienced the lockup. But when I when to do a core dump I got a 404 error. Is /generate-core the right endpoint?

Hi Alain,

I worked out why the generate-core wasn’t working. Our python script overwrote the test one. I have that working now, however have come across another issue. When I try to generate a core dump (either through /generate-core or manually through a shell) I get the following:

root@85ecc978bc70:/# ps -e
    PID TTY          TIME CMD
      1 ?        00:00:00 Orthanc
     79 pts/0    00:00:00 bash
     85 pts/0    00:00:00 ps
root@85ecc978bc70:/# gcore 1
ptrace: Operation not permitted.
You can't do that without a process to debug.
The program is not being run.
gcore: failed to create core.1

or through the /generate-core

orthanc-1  | Generating core
orthanc-1  | ptrace: Operation not permitted.
orthanc-1  | You can't do that without a process to debug.
orthanc-1  | The program is not being run.
orthanc-1  | gcore: failed to create core.1

James

Hi James,

It seems that you must run the container in “privileged” mode with this option in your docker-compose:

image: osimis/orthanc:debug-script
privileged: true

Thanks again for your help !

Alain.

Hi James,

Thanks to your core.dump file, I could understand what was going wrong and, hopefully, I have fixed it.

I have pushed osimis/orthanc:mainline-2023.11.07 with the fix (this is a release version - it is not able to generate a core file). If you could give it a try on your system, that would be great !

Best regards,

Alain.

Oh fantastic!! I will try it out this evening.
James

Hi James,

Just realized that, since we are currently migrating the mercurial server, these images were not built with the latest code :frowning: . I will update our build process and build the images again.

Sorry for the inconvenience.

Alain.

No dramas - just let me know when they’re good to test

Here it is: osimis/orthanc:mainline-2023.11.09

Hi Alain,
I installed 2023.11.09 on the unit where the issue would regularly occur. 2 days now and things are looking good. Will let you know if it reoccurs. Thanks so much for your work in solving this!
James

1 Like

As an update, we’ve had no issues with this mainline version for the past 10 days. Looking very good! Thanks again!
James

Hi Alain,

I’m experimenting a problem with similar symptoms using the version 1.12.1 and this not happens in the version 1.11.3. I believe this is a possible solution. When will be release a stable version with that fix?

Best regards.

We’ll try to make a release in the coming weeks.

Alain

Ok, thanks Alain :slight_smile:

Best regards