Memory issues on server with lots of data and high load

a.sallai · February 7, 2024, 3:35pm

Hi everyone!

I’ll try to post updates with more specifics, but I just want to put this out there early, maybe someone knows what to do.
So we are running the jodogne/orthanc-python:1.12.2 docker image with lots of data and high load and in the past weeks we noticed memory-related errors followed by restarts of our docker container.

It doesn’t seem to be specific to any particular data, we uploaded the same studies to a dev instance and couldn’t reproduce it.

It also doesn’t seem to be related to RAM issues, we didn’t see any problems in our monitoring.

We saw these 2 kinds of errors in the logs:

2024-02-05T09:03:49.476043+01:00 sirtuxford sirtuxford-orthanc[2946221]: malloc(): unsorted double linked list corrupted
...
...
2024-02-06T13:20:28.031706+01:00 sirtuxford sirtuxford-orthanc[2946221]: free(): corrupted unsorted chunks
...
...
2024-02-07T07:42:09.808245+01:00 sirtuxford sirtuxford-orthanc[2946221]: free(): corrupted unsorted chunks
...
...
2024-02-07T13:11:23.394106+01:00 sirtuxford sirtuxford-orthanc[2946221]: malloc(): unsorted double linked list corrupted

And corresponding journal entries, for example:

Feb 07 13:11:23 sirtuxford kernel: traps: Orthanc[3162777] general protection fault ip:7f78904c7611 sp:7f784b7bd4c0 error:0 in libc-2.28.so[7f78904c7000+147000]

I’ll try to post more info as we investigate, suggestions are greatly appreciated!

a.sallai · February 7, 2024, 4:40pm

We saw only 5 instances of these errors in sept-oct-nov, but since the 12th of january 2024, they have become much more common (happening a few times almost every day) which might correspond with the 1.12.2 release, we’ll try to upgrade to 1.12.3 first and then downgrade if that doesn’t help.

a.sallai · February 8, 2024, 12:26am

Also noticed that for some studies, our OnStableStudy lua event handler wasn’t called. Yesterday this happened at least 4 times, all within a couple minutes of the restart caused by the above problems.
Checked all of these studies and the REST API (/studies/{id}) says they are all stable:

{
   "ID" : "d34ee41a-97ae3864-27050d9e-747fe1ff-fbaa6772",
   "IsStable" : true,
   ...
}

alainmazy · February 9, 2024, 8:15am

Hi,

If an Orthanc is stopped before the studies are marked as stable, Orthanc will mark them as stable at the next restart.

HTH,

Alain

a.sallai · February 9, 2024, 11:25am

Thanks!

Shouldn’t we trigger OnStableStudy when that happens?

alainmazy · February 12, 2024, 8:48am

Yes, indeed we should but that’s not straightforward to make it work correctly.
It is actually already in our todo:

* Right now, some Stable events never occurs (e.g. when Orthanc is restarted before the event is triggered).
  Since these events are used to e.g. generate dicom-web cache (or update it !), we should try
  to make sure these events always happen.
  - Generate the events when setting IsStable=true when starting an Orthanc (ok for SQLite) ?
  - Also consider the use case of an Orthanc cluster that is being scaled-down just after one Orthanc instance
    has received a few instances -> we can not only check for missing stable events at startup since no Orthanc will start.  
    We would need to maintain the list of "unstable" resources in DB instead of memory only.

Sylvain · February 12, 2024, 9:24am

FYI, the memory problem is due to orthanc/dcmtk storing instances in memory before writing them to disk. Orthanc can also transcode in memory before writing to disk, so you may have twice the amount of memory used.

A few workarounds:

increase the host memory
reduce ConcurrentJobs (default seems to be 2)
reduce DicomThreadsCount (default seems to be 4)
disable incoming transcoding to compressed data?

a.sallai · February 14, 2024, 3:33pm

I appreciate the suggestion, but no, this problem is not caused by insufficient amounts of memory.

We narrowed it down to the event handlers in our lua script: in the OnStableStudy event handler we gather a couple of values from the study to send it to another system and there we do this:

ParseJson(RestApiGet("/instances/" .. firstInstanceId .. "/tags"))

The study in question has interesting, deeply embedded data elements and it seems that the way lua tries to parse it results in some memory errors.

Later I’ll post links to some core dumps and an example study that not alway, but often reproduces the problem! Not sure what causes the heisenbug behavior, my guess is ASLR, sometimes it wouldn’t be reproduced for 5-10 consequent runs, but when I killed Orthanc and restarted it, the next try immediately reproduced it.