Orthanc crashes when reading from AWS S3

olivert · December 18, 2024, 8:32pm

have been using 1.12.4 for many months without problems
recently Orthanc has been crashing on random studies stored in S3
last verbose log messages before restart:

I1218 11:49:05.456958 HTTP-22 AWS S3 Storage:/StoragePlugin.cpp:259] AWS S3 Storage: read whole attachment 2013cb3e-2539-45ca-a58f-8ec98c2ba267 (36.46MB in 4.44s = 68.96Mbps)
I1218 11:49:05.567547 HTTP-35 StorageCache.cpp:128] Read attachment “4ab33e95-12d3-4d1f-af77-4b88871dbd61” with content type 1 from cache
W1218 11:49:08.857127 MAIN main.cpp:2059] Orthanc version: 1.12.4

any suggestions on where to start troubleshooting?

olivert · December 18, 2024, 11:48pm

actually, the same study will cause the crash

link to anonymized study

benjamin.golinvaux · December 19, 2024, 10:13am

Hello,

There is no error message in the log.

Could it be that your processed crashed because of OOM? If you can reproduce, maybe try to gather more info on the cause of the crash itself.

Container? Bare process? Linux? Windows?

olivert · December 19, 2024, 5:56pm

thanks - no error message. also tried with TRACE logging and still no error message.

docker compose container on ubuntu host

will see if docker compose is the culprit

alainmazy · December 20, 2024, 7:56am

Hi @olivert

You should check the memory usage using e.g docker stats at the time Orthanc is crashing. What is the RAM size of your host ?

Do not hesitate to look at the logs before the last 3 lines before dying. Maybe you have actually 50 HTTP clients each requesting a 35 MB file from S3.

Alain

olivert · December 20, 2024, 1:14pm

Thanks @alainmazy

host has 8 GB RAM

in this scenario, there’s only 1 user requesting a single study. i thought the same study could generate the crash, but turns out that’s not the case (posted a google link to the study earlier in this thread). the logs before the last 3 lines are just more calls to S3 for instances of the same study.

using docker stats - is it normal for Orthanc memory usage to grow and not reduce if a user views 1 study after another and then closes the viewer window?

benjamin.golinvaux · December 20, 2024, 1:21pm

There are caches, and the memory is not always reclaimed directly.

Are you under the impression that your container could be OOM-killed?

You might want to give it a try on a dev machine with more memory, to see if that can help. There are settings that help lowering the Orthanc memory usage, btw.

olivert · December 20, 2024, 3:49pm

Thanks @benjamin.golinvaux

since there are no error messages in the Orthanc logs, assuming docker is the cause

we’re using the default MALLOC of 5 for a host with 8 GB. does that seem ok?

benjamin.golinvaux · January 3, 2025, 5:26am

Hi Olivert,
Sorry I was on Flu Holiday.

Orthanc memory usage strongly depends upon your use case. Orthanc can happily run on a Raspberry Pi with 4GB or can use a huge Postgres DB and manage TB of Dicom data without issues, depending on what you’re asking it to do.

Setting MALLOC_ARENA_MAX is a good compromise between thread contention and memory usage, but it’s more of a secondary level setting : this will not help much if your setup is too constrained in the first place.

You can use a few different methods to find out whether your container is killed because of memory usage:

You can check the docker container exit code (docker ps -a --no-trunc) : 137 being a telltale sign of OOM killed containers
You can check in the kernel ring buffer for oom messages: dmesg | grep -i "oom")
You can use docker stats while your container is running and try to correlate the container exiting with a memory metric reaching a certain point.
You can limit the memory used by your container, with the --memory and --memory-swap flag and check if this makes the container crash faster. For instance: docker run -it --memory=1g --memory-swap=2g your_orthanc_image

Please let us know if this is still and issue and whether you need help or advice

HTH

olivert · January 3, 2025, 8:43pm

Thanks @benjamin.golinvaux

will try these suggestions

in the meantime, the crash happened again. it appears to be while orthanc is retrieving a file from S3.

I0103 12:59:59.379233 7f9fcc7c06c0 AWS S3 Storage:/StoragePlugin.cpp:159] AWS S3 Storage: reading range of attachment 868d300d-fc46-41b9-b27a-b057d4135dc3 of type 1

is the last entry before orthanc stops responding

the earlier entries for receiving the same attachment are:
I0103 12:58:20.548129 DICOM-2 AWS S3 Storage:/StoragePlugin.cpp:110] AWS S3 Storage: creating attachment 868d300d-fc46-41b9-b27a-b057d4135dc3 of type 1
I0103 12:58:20.684867 DICOM-2 AWS S3 Storage:/StoragePlugin.cpp:134] AWS S3 Storage: created attachment 868d300d-fc46-41b9-b27a-b057d4135dc3 (8.03MB in 136.75ms = 492.86Mbps)

is the size of the file a problem for orthanc?

benjamin.golinvaux · January 4, 2025, 12:56pm

Hi Olivert

This size should not be a problem at all.

Have you tried enabling all relevant logs so that we can really see what the very last operation is before this crash?

You might want to enable the S3 logs with the EnableAwsSdkLogs option (in the S3 plugin configuration block), as well as all the other Orthanc logs.

Could you check what the exit code is when Orthanc is stopped like that?

Also, you might want to perform the same kind of operations with smaller studies, then bigger ones, and check if the size is somehow correlated with how soon this crashes happens?

Also, if it’s not too cumbersome based on your setup, you might want to compare the behavior of Orthanc inside a container with it running directly on your host (for instance, the LSB binaries).

Good luck! Let us know how it goes.

HTH

olivert · January 8, 2025, 7:20pm

thanks @benjamin.golinvaux !

the problem is intermittent - so hesitant to turn on too much logging
currently logging is set to verbose
if S3 logs aren’t too heavy, will turn those on too

benjamin.golinvaux · January 9, 2025, 5:50am

Hi Olivert,

S3 log volumes depends on the amount of studies you are handling and on the specific modalities you are using. For instance, some fMRI studies contain tens of thousands of very small instances and in that case, logging every object stored in S3 (1 object == 1 instance) will be very verbose. But many modalities have a much smaller # of instances per study. In your example, it seems that instances are on the bigger side (8MB), so S3 log volume should be less of a problem (don’t trust me on that and make some tests!)

In general, it would really take a significant amount of logging for your system to be slowed down by it. At least if you switch to stdout logging as suggested (it’s quite common for containers to be verbose). I don’t know how well writing logs to a folder share is optimized by the container runtime.

Let us know how it goes!

HTH