Orthanc 1.5.4 stops responding under load, disconnects from postgreSQL 9.6.11 DB

Dear orthanc-users!

We’re performance testing orthanc and bombarding it increasingly with loads of cases incoming over the transfer plug-in. This test was only with some 10 studies from a single source and it managed to take down the receiving (transfer push) orthanc instance. Orthanc stopped responding to further calls of any kind.

investigating a bit more, we saw that orthanc under load suddenly dropped the DB connection. see graphs below. no mention in the orthanc log file of any errors, nor of the fact that the DB connection dropped. Orthanc did NOT on its own for the next 20 minutes restart the DB connection, finally a manual server restart restored the service at 10:30.

AWS RDS PostgreSQL 9.6.11 DB
image.png

orthanc 1.5.4 on Docker on Ubuntu 18.04 LTS
image.png

nothing remarkable in terms of load. the server should handle this fine. the CPU spike at 10:30 is the start of the load.

Asking the DB what happened, we find the following:


2019-03-22 09:20:19 UTC::@:[25134]:LOG: checkpoint complete: wrote 743 buffers (2.9%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=277.216 s, sync=0.020 s, total=280.965 s; sync files=48, longest=0.001 s, average=0.000 s; distance=15467 kB, estimate=27859 kB
2019-03-22 09:20:38 UTC::@:[25134]:LOG: checkpoint starting: time
2019-03-22 09:24:40 UTC::@:[3992]:LOG: database system was interrupted; last known up at 2019-03-22 09:20:19 UTC
2019-03-22 09:24:40 UTC::@:[3992]:LOG: database system was not properly shut down; automatic recovery in progress
2019-03-22 09:24:40 UTC::@:[3992]:LOG: redo starts at 130/5401EC70
2019-03-22 09:24:40 UTC::@:[3992]:LOG: invalid record length at 130/551A12D8: wanted 24, got 0
2019-03-22 09:24:40 UTC::@:[3992]:LOG: redo done at 130/551A12A0
2019-03-22 09:24:40 UTC::@:[3992]:LOG: last completed transaction was at log time 2019-03-22 09:20:37.543816+00
2019-03-22 09:24:40 UTC::@:[3992]:LOG: checkpoint starting: end-of-recovery immediate
2019-03-22 09:24:40 UTC::@:[3992]:LOG: checkpoint complete: wrote 246 buffers (1.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.005 s, sync=0.004 s, total=0.020 s; sync files=22, longest=0.004 s, average=0.000 s; distance=17929 kB, estimate=17929 kB
2019-03-22 09:24:40 UTC::@:[3992]:LOG: MultiXact member wraparound protections are now enabled
2019-03-22 09:24:40 UTC::@:[3940]:LOG: database system is ready to accept connections
2019-03-22 09:24:40 UTC::@:[3996]:LOG: autovacuum launcher started
2019-03-22 09:29:40 UTC::@:[3993]:LOG: checkpoint starting: time
2019-03-22 09:29:41 UTC::@:[3993]:LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 transaction log file(s) added, 0 removed, 1 recycled; write=0.342 s, sync=0.002 s, total=0.356 s; sync files=3, longest=0.002 s, average=0.000 s; distance=11 kB, estimate=11 kB

as you can see something happened in the DB (we don’t know what), however it restored itself in just 5 minutes, after which Orthanc did NOT reconnect.

my questions are:

  1. has anyone seen any similar behaviour? especially under load?
  2. Why is orthanc not reconnecting automatically to the DB once it’s back up? is there a configurable parameter to set the time-intervals and amounts of re-tries to reconnect?
  3. Why did the DB interrupted in the first place? was this related to orthanc and the load it was hammering on the DB or something else?

any help we can get with this is of course greatly appreciated. We’re going through a performance test of Orthanc in order to understand how well it scales. This is the first troublesome issue we’ve run into.

thank you,
Pär
www.cmrad.com

Hello,

You must upgrade to Orthanc 1.5.6 and associated PostgreSQL plugins 3.2:
https://groups.google.com/d/msg/orthanc-users/LBil4zk8ZiY/QeCEExRwAwAJ

These releases solve issues about growing memory within the PostgreSQL client, that can lead to swapping (i.e. performance issues) then eventually in crashing Orthanc.

Sébastien-

Hey Sebastien, thank you for that. Your recommendation seems to have performed magic on the postgres DB. We’ll continue performance testing the system, but if you don’t hear back from me this was most likely the issue. You can see on the attached graph of memory and swap used by the DB that the upgrade released heaps of memory from the DB.

thanks again,
Pär
www.cmrad.com

orthancDBmem.png