Some CT Studies fail to transfer between peers with Transfers Accelerator

yomarbuzz · February 13, 2024, 2:31pm

Hi,

I’m setting up a long term sync job from local orthanc a to remote orthanc b.

I’ve checked the json file that was tracking the jobs and I’ve noticed that most of the studies failing are CT (86 CT, 1 XA, 4 US, 3 SR).

when I try to send the study through the UI one of two things happens:

I get the following error on initiating the transfer, and the transfer job never starts:

orthanc-users | E0213 13:45:46.323983 HTTP-25 PluginsManager.cpp:153] Unsupported return MIME type: application/dicom+json, multipart/related; type=application/octet-stream; transfer-syntax=*, will return DICOM+JSON

or 2) The transfer bar remains grey for a while and I get no logs (10-20 seconds), the transfer job starts and it seems like all images are transferred to remote but I get a 504 gateway timeout at commit stage and the job fails:

orthanc-users | sender-transfer-id: 8b93bb2e-9490-47c8-8ba5-4df0d5e74a65
orthanc-users | Content-Length: 0
orthanc-users | Content-Type: application/x-www-form-urlencoded
orthanc-users |
orthanc-users | E0213 14:09:01.167454 HTTP-17 HttpOutput.cpp:78] This HTTP answer has not sent the proper number of bytes in its body
orthanc-users | E0213 14:09:01.226502 HTTP-13 HttpOutput.cpp:78] This HTTP answer has not sent the proper number of bytes in its body
orthanc-users | E0213 14:09:08.675124 HTTP-19 PluginsManager.cpp:153] Unsupported return MIME type: application/dicom+json, multipart/related; type=application/octet-stream; transfer-syntax=*, will return DICOM+JSON
orthanc-users | < HTTP/1.1 504 Gateway Time-out
orthanc-users | < Server: awselb/2.0
orthanc-users | < Date: Tue, 13 Feb 2024 14:10:25 GMT
orthanc-users | < Content-Type: text/html
orthanc-users | < Content-Length: 132
orthanc-users | < Connection: keep-alive
orthanc-users | <
orthanc-users | * Connection #0 to host site.example.com left intact
orthanc-users | E0213 14:10:25.570201 JOBS-WORKER-0 HttpClient.cpp:1100] Error in HTTP request, received HTTP status 504 (Gateway Timeout) after POST request on: https://site.example.com/transfers/push/af11af69-17a5-44df-8d39-ba4d3994d36a/commit
orthanc-users | E0213 14:10:25.574843 JOBS-WORKER-0 PluginsManager.cpp:188] Exception while invoking plugin service 8006: Error in the network protocol
orthanc-users | E0213 14:10:25.574960 JOBS-WORKER-0 PluginsManager.cpp:153] Cannot commit push transaction on remote peer: OrthancRemote

I have set:

ORTHANC__HTTP_KEEP_ALIVE=true
ORTHANC__HTTP_TIMEOUT=3000
ORTHANC__HTTP_REQUEST_TIMEOUT=3000
ORTHANC__HTTP_TCP_NODELAY=false

On both remote and local. Any help would be appreciated. Sample file which has failed in both ways.

On a related note, It would be great if anyone could point me to a sample of python/lua that does the following:

Robustly forwards all new studies through transfers accelerator to a remote orthanc peer, minimizes transfers failures, implements retries and ensures that all studies are sent in full.
Gets existing studies from an orthanc sorted latest to earliest, and attempts to forward them in same order, in a manner described in 1). This should be stateful and resume on restart.

I’ve tried implementing both on my own but I’d prefer a time-tested solution.

Kind Regards,

Yomarbuzz

alainmazy · February 14, 2024, 3:14pm

Hi,

No problems when testing with this setup and your files.

Please adapt this sample to make it reproduce your issue and come back to us.

BR,

Alain

yomarbuzz · February 14, 2024, 8:03pm

Hi Alain,

Thanks for your response, unfortunately my setup is a lot more complex, I will try to reproduce.

Can you give me additional guidance on unsupported response MIME type, responding with DICOM+JSON error? what could trigger this on job creation / commit stage? any ideas on how to get a more verbose output on what is being sent/received?

yomarbuzz · February 18, 2024, 10:08pm

Issue was with low timeout value on the load balancer, not Orthanc-related.

Answering my own question on study forwarding / pushing old studies newest to oldest to a remote orthanc peer in case anyone needs it:

You will need to enable transfers accelerator plugin on both local and remote, and add remote peer to “OrthancPeers” configuration:

“OrthancPeers” : {
“OrthancRemote” : {
“Url” : “https://remote.url/”,
“HttpHeaders” : { “api-key” : “xyz” },
}
},

Study Forwarding:

Enable python plugin and mount the following script.

add PEER env var with peer name in above config (OrthancRemote) or modify the script to use the name directly. Script will attempt to forward stable studies to remote and retry on failures.

Forwarding existing studies to remote in newest to oldest order + tracking state in file:

Script to generate study list from local orthanc

if you re-run this, it will append studies generated since the last run to the list without changing UploadStatus. studies are saved in studies.json

Script that handles uploads with retries and updates upload state in studies.json

You can run these scripts directly with python. Put both scripts in same folder. If you have to stop/encounter errors modify studies.json - change all studies with UploadStatus : Ongoing to NotStarted and re-run.

BNOEAFK · March 15, 2024, 4:12am

I’m experiencing this on “larger” studies - smaller ones appear to be just fine and I can’t help but think I’m experiencing the same issues.

Without divulging any security related information, was there anything “special” in your load balancers that you had to update to extend the timeout? (I’m using AWS Application Load Balancers, so am keen to know if they’re potentially at fault in my own test environment)…

I’d appreciate any feedback @yomarbuzz

yomarbuzz · March 19, 2024, 12:17am

Hi,

The issue was actually not resolved, increasing ALB timeouts may have reduced the incidence of errors.

I’ve tested this extensively but I wasn’t able to pinpoint what was causing it. I can confirm that this is happening for large studies only, the transfer will send all files but fail at the commit stage with a timeout. My best guess is that this is caused by an interaction with S3 plugin - i.e. transfers plugin requests a storage commit for the files that have been sent to remote Orthanc, remote orthanc has received it in temporary storage but not yet uploaded fully to S3, so it can’t return a storage commit in time.

I’ve gone through every network component and increased timeouts to 1 hour, this doesn’t seem like an HTTP issue.

I’ve had to switch to using transfers plugin in pull mode, using cloudflare tunnel (cloudflared) to proxy my local Orthanc to a public URL. This has worked so far, but I don’t think storage commitment is implemented for pull mode, so I’m writing custom scripts to robustly keep track of the transfer jobs and retry if needed.

If anyone has good working samples of python + transfers accelerator please share.