Hello,
Quite often, I run into a case where pulling up the Web viewer
page and clicking on "All patients" pulls up the standard patient
browser, but with NO patients listed, even when I know there are
patients on that Orthanc. I've generally interpreted this as some
sort of timeout/caching issue.
This is a good hunch (we've encountered timeout issues at the reverse
proxy level multiple times, more information below). Could you confirm
it first?
* Use your browser developer tools to check the exact response
(including status code and headers) from the Apache reverse-proxy, or
use another HTTP client with more diagnostics (e.g. a CLI client like
cURL). Perhaps share the results if you need help.
* Check the logs of the Apache reverse-proxy, and possibly increase the
Apache log level to get the details you want. In your case, you're
looking for messages that Apache timed out waiting for a response from
Orthanc (I'm not sure about the terminology in Apache actually, in
nginx for example they call the target hosts "upstreams"). Again, feel
free to share the results if you want.
* If you can find an setting or module to capture responses from a
target host (i.e. Orthanc) on Apache, enable it (though only on 500s
and 400s for example, lest you risk hitting your storage heavily).
These are the easy checks you can do immediately. If you have trouble
reproducing, try bigger and bigger studies; maybe produce a synthetic
one if needed.
If you want to go the extra mile or simply can't find much, consider:
* Capturing network traces with tcpdump/windump or with Wireshark
directly. Especially if you still can't reproduce, tcpdump (and I
assume Windump) has an option to do a "rolling capture" of a fixed
size, meaning you can keep it running forever and allocate it say a few
gigabytes. When the problem occurs, you can interrupt the capture at
your leisure and inspect it later. This will help observe all network
interactions on the server-side and help pinpoint the exact origin.
* Looking at network throughput charts, especially per-connection (TCP)
charts if you can get them. Be on the lookout for "roller coasters".
* Looking at storage I/O. Look for: either "roller coasters" or
evidence of long periods of saturation.
Some context on your hunch: Many reverse proxies "debounce" their read
timeout counters upon observing bytes being read from the peer.
However, just because the peer (i.e. Orthanc) isn't sending data
doesn't mean it isn't hard at work processing it. So, you might
sometimes see Orthanc preparing a response for a long while, the
reverse proxy abandoning the request and telling the client "I think it
failed" (often via a 502 "bad gateway" message), and then when Orthanc
is finally ready to send its answer nobody is listening anymore.
Normally we see this with the preparation of very large responses, but
depending on hardware resources and contention on the host there's no
reason it can't happen in other contexts. Examples: (1) The viewer
generates many small requests, maybe the host is simply overloaded in
some way and a few threads are starved of resources and cannot make
progress; (2) There are many patients, maybe they are being sorted,
requiring blocking the pipeline for preparation of the list.
If you end up increasing the read timeouts of your reverse proxy to
whatever seems appropriate in your scenario, please consider signaling
your interest in better performance for your scenario as there might be
opportunities there. In general, we don't consider it normal that end-
users need to increase arbitrary timeouts like that: it's generally a
sign of something scaling badly (e.g. "the more patients there are, the
longer it takes for the first byte in the response to arrive, and the
higher I need to set my read timeout").
I can force a "refresh" by temporarily calling the API directly
(ex. https://myorthanc/patients). That seems to do the trick so that
when you return to the patient browser menu, all the patients
magically reappear.
I would imagine the frontend would do the same request, so the fact
that it works sometimes and sometimes not is indeed indicative of a
hot/cold cache scenario as you suggested. It could just be the system
cache.
Hope this helps,