I’ve noticed on many of our systems (including automated tests) that we’re getting a lot of “Socket Hangup” errors from external systems accessing Orthanc from the http api. I’ve added retires to our code when this occurs and this dramatically improves the reliability. I can’t see any errors in the Orthanc logs. Is there anything I can do to trace why this might be occurring?
Do you mean that some of your HTTP clients fail to connect to Orthanc ?
From my experience, this happens very often when a client has a polling interval very close to the KeepAlive configuration of Orthanc. E.g, if a script is polling Orthanc every 1 second and KeepAliveTimeout is 1 sec (the default), it seems that, at some point, the HTTP client sends a message without reopening a connection while Orthanc is closing the socket. However, I have never investigated in details.
Any hints is welcome - but I probably won’t investigate in the near future
Interesting… That reflects what I see. I have a script that is polling for job completion status at 500ms and if the KeepAlive is 1 second, then it is quite probable that a request will match the KeepAlive timeout. I have also disabled KeepAlive for our automated tests which seems to be improving the reliability.
Anecdotally, I never observed this up until the last 4-6 weeks. Very scientific I know!?
Oh I think I discovered why I’ve only noticed it recently. We use Node for the vast bulk of our work, and we’ve recently been upgrading to the latest LTS (20). In Node 19, they changed the defaults so that KeepAlive is now defaulted on (Node.js 19 is now available! | Node.js). In earlier versions (18 etc… which we’ve been using up until very recently) it was defaulted off. This would explain why I’ve only noticed the socket disconnect very recently.
With more research, there is a known race condition that can occur with KeepAlive How do browsers handle HTTP keepalive race condition? - Stack Overflow. According to the KeepAlive RFC, a 408 response should be sent when closing the connection, but it is up to the client what to do in this situation. Chrome’s behaviour has changed. In short, either disable KeepAlive or on the client side, catch the Socket Hangup and retry.