OnChange Callbacks and CPU loads

I have a moderately complex setup running in Docker containers.

At this point, I am not looking for answers to my current problem, but suggestions for how I might debug and approach the issue.

Currently, I am seeing CPU usage gradually increasing over a few days after receiving and/or anonymizing some moderately sized exams (10-20k images). The activity does not correspond to any ongoing script activity, user activity, or entries in the logs.

This is with Orthanc versions 1.12.0 and 1.11.3 both running the python plugin 4.0.

Running strace on the active Orthanc process produces repeating entries every 100ms similar to:

clock_nanclock_nanosleep(CLOCK_REALTIME, 0, {tv_sec=0, tv_nsec=100000000}, NULL) = 0

Given the complexity of my setup, there are several possible avenues I could explore to track down this CPU usage. At this point, I am wondering about the OnChange script I wrote and registered using the Python script.

I only make use ofthe STABLE_STUDY trigger, skipping any processing for any other triggers. I would not think that any other events passing through my code would create a cpu spike.

I do wonder about events listed at the NEW_INSTANCE level. Is it possible that the event generation of the NEW_INSTANCE for these 10-20k instance studies might arrive faster than can be acknowledged by my “ignore anything but stable study triggers” OnChange routine? This would imply that the system was able to write new instances to disk faster than it can process the events they generate.

Are these signals on the back end generated and sent for processing immediately? Or queued in some way? I mean before the signal even reaches my own OnChange routine. Is there some kind of rate limiter on the rate at which events might arrive?

I do not even know if the strace output is helpful. I am going to turn on the detailed/debug info in the logs to try to see if anything in the log corresponds to the CPU usage. If that doesn’t work, I will need to build a very simple test case setup and try to duplicate the problem.

Thanks for suggestions,
John.

On other Orthanc instances where no anonymizations are performed, but I still have an OnChange hook for studies, I see a low CPU burn (5-10%) at rest.

Digging into the strace output, it looks like there’s a fair amount of thread and message management going on with calls to futex, poll, madvise and restart_syscall. There are a lot of “resuming interupted” type messages. It makes me wonder whether the docker setup cannot keep up with the background threading/messaging in the Orthanc event engine.

The odd thing is that the host has more than enough CPU capacity to spare.

I do not see this behavior on instances running older Orthanc versions (1.9.7). These are running mostly the same code base with OnChange hooks, but do not exhibit this sort of low CPU burn when not obviously receiving or processing DICOM.

Still investigating…

Hi John,

Thanks for all these investigations. Right now, I’m unable to explain why the CPU usage would have increased at rest between 1.9.7 and 1.11.3. We have added a memory housekeeping thread in 1.11.3. Maybe try to get back to 1.11.2 to check if it changes…

Concerning the events, they are indeed queued to be processed later on in the ChangeThread. With some logging, you should be able to detect how late they are being executed compared to the ingestion time.

Concerning the sleep(100ms), there are quite a few places where this happens in Orthanc, notably the main thread.

Best regards,

Alain.

Thanks for the info on the threading/events.

I found one of my Orthancs running idle at 40% CPU less than 24 hours after restarting. So I turned on the remaining VERBOSE logging options and restarted.

One thing that I discovered, though I do not know if it matters to my particular problem, is that the latest Osimis Docker images have WebDav enabled by default. I was seeing webdav cleanup operations every minute or so in the log when I knew I had not turned it on. So I explicitly turn it off now and the webdav stuff has dropped from the log.

I am going to come back to this Orthanc in a few hours to see if its CPU has climbed from the more typical idle of 0.5 to 1.0%.

John.

I notice a gradual increase in CPU at idle with time, though I cannot determine if that would continue because the CPU spikes (understandably) during active anonymization jobs.

Note that when I say “anonymization” here, I am talking about my Lua/Python driven anonymization scripts and not just the basic Orthanc anonymization call.

It’s after the anonymization completes and there are no more entries going to the log that the newer 1.11.3 continues to burn CPU. The log files indicate no activity, even with all the VERBOSE flags turned on. But the CPU remains high.

I thought perhaps some sort of post-anonymization activity like storage compression might be going on. I do not have StorageCompression turned on explicitly. I confirmed that the /tmp/Config.json file that the Osimis docker prepares does not even have a StorageCompression setting. Could StorageCompression be ON by default if I do not explicitly turn it OFF?

I can see there are quite a few threads still “hanging around” after an anonymization. htop indicates that two threads consume the bulk of the CPU usage between them with a couple dozen other threads mostly idling.

Could this be an issue with Lua? I ask because my Lua script is moderate to large and includes recursive routines for crawling DICOM nested sequences. Is it possible that some stack management of memory from the recursion is still unwinding even after the entire script is complete?

I do notice that the high idle CPU usage usually corresponds to Orthancs of mine that have requested a lot of memory from the system. As a counter-argument to this being a docker problem, the postgres containers I run in parallel with Orthanc use similar amounts of memory without idling at high CPU.

Restarting any of the hot idling containers returns the CPU usage to more typical background levels at idle, which then grow slowly with time until the next anonymization call creates another spike in CPU and subsequent hot idling.

Still digging.

John.

Hi John,

Storage compression is disabled by default and is, anyway, performed synchronously.

In 1.11.3, we have introduced threading to implement anonymization and modification but, of course, the thread should go away at the end of the job. Are you interrupting/pausing anonymization jobs ?

Your lua/python scripts are quite complex for an outsider so, that would be nice if you could simplify it as much as possible.

Best regards,

Alain.

I am not interrupting or pausing jobs explicitly.

I agree my Lua script is too complex for casual debugging. I don’t see that changing soon because the desired result is rather complex. I understand, and don’t expect help there.

That said, I anonymize at the study level, processing all instances for a given study. There are successive calls to modify and anonymize, since I need to keep some fields, remove others, and modify still others on top of the basic anonymization performed by anonymize.

I missed that 1.11.3 introduced threading to modify and anonymize. Those are still synchronous, yes? Otherwise, my script is not monitoring execution status - at least not like I would monitor the jobs service within Orthanc. I assume that the process is complete when the RestApi call to modify/anonmize returns.

I suppose its possible I could be generating thousands of threads with my anonymize/modify operations and not know it. That said, the output files do appear to be anonymized as I expect. Hopefully, I’m not (manually) sending the files off to their destination before anonymization is complete.

I ran a test anonymization this morning and within a few hours the CPU was at 25%. That’s after I deleted all images off the Orthanc. So it’s busy doing something.

I’m going to dig into your new documentation on the threading of modify/anonymize to see if I need to change anything.

John.

Disabling my OnChange callback hook altogether has no effect.

Stopping (not destroying) and restarting the docker Orthanc container brings the CPU back down to normal idle levels.

In the past, I had the vague impression that stop/starting the Orthanc process reset the stability measurements so that any formerly not stable studies/series/images would become stable after the restart. It was based on that impression that I thought the hot idling might have to do with the stability messaging queue - since it seems to go away with reboot.

But now, I have turned off my own OnChange hook and I still see the CPU idle hot after I run one of my Lua anonymizations.

Other than strace, does anyone have any other ideas for exploring behind the scenes to try to figure out what the process is doing with 25% of the CPU? As I mentioned earlier, strace only really shows a lot of thread signaling.

John.

The idling seems to scale with the size of the study being anonymized.

Is there any kind of prioritization going on with the threads behind anonymize/modify? I am wondering if threads are simply not fully completing and/or are locked in a battle of politeness:

1: “No, you first”
2: “No, you”
1: “I insist”
2: “No, no, I couldn’t. You must.”

I left the various thread counts in the configuration.json file at their default values. Would the threading associated with api calls from Lua/Python to anonymize/modify be governed by the HttpThreadsCount thread count? I’m guessing the HTTPThreads should have nothing to do with the Lua/Python RestAPI calls, which themselves are probably built directly into the Orthanc back end.

Suppose my RestApi calls to anonymize/modify from Lua/Python were indeed going through the HTTP server, could there be an interaction here between the HTTP_TIMEOUT setting and my anonymize/modify calls? As though I have the timeout too high and the threads will complete once the HTTP timeout is reached?

Apologies, but I am grasping at straws now.

John.

Yes, they are synchronous.

In the past, I had the vague impression that stop/starting the Orthanc process reset the stability measurements so that any formerly not stable studies/series/images would become stable after the restart. It was based on that impression that I thought the hot idling might have to do with the stability messaging queue - since it seems to go away with reboot.

Indeed, after a reboot, all unstable studies get stable and the STABLE_STUDY events for these study are actually never generated.

But now, I have turned off my own OnChange hook and I still see the CPU idle hot after I run one of my Lua anonymizations.

How do you trigger your anonymization then ? with tools/execute-script ? Does this issue still happen if you simply always anonymize the same study through a simple lua script that calls the anonymize route ? That would allow us to start investigating …

I left the various thread counts in the configuration.json file at their default values. Would the threading associated with api calls from Lua/Python to anonymize/modify be governed by the HttpThreadsCount thread count? I’m guessing the HTTPThreads should have nothing to do with the Lua/Python RestAPI calls, which themselves are probably built directly into the Orthanc back end.

Your understanding is correct.

Other than strace, does anyone have any other ideas for exploring behind the scenes to try to figure out what the process is doing with 25% of the CPU? As I mentioned earlier, strace only really shows a lot of thread signaling.

Could you just monitor the thread count and see if there are idle threads piling up ? If Orthanc were still performing something actively like modification/anonymization, you would see logs in verbose mode while Orthanc stores files in the storage.

Best regards,

Alain.

In some cases, the anonymization is triggered by an OnChange hook in the Python plugin, which itself then calls execute-script to invoke a Lua anonymization function. In other cases, I manually trigger the anonymization through a web form which calls a special URL (again in the python plugin) with python again making the call to execute-script to invoke the Lua.

In either case (automatic or manual) I see the subsequent jump in CPU usage after the Lua script completes.

I’ve got one Orthanc idling hot at 15% CPU with no activity in the logs. The logs indicate the anonymization lua script finished successfully yesterday. With verbose turned on, I do not see any messages in the logs during these high CPU usage times.

I ran a script to count the threads for the dozen or so different Orthancs I have running in docker. All have 65-66 threads with CPU usage idling at 15% for one Orthanc and less than 1% for all the others.

Note that the accompanying Postgres servers do not show similar CPU usage when idling. So, it does not appear to be any housekeeping that would involve updating the sql database. DICOM files are stored on disk - not in the SQL database.

When I get a chance, I will start reverting to older versions of Orthanc and testing until I get to a version where I don’t see the hot idle CPU behavior.

Thanks for your suggestions,
John.

See my reply above.

Update

I tried invoking an old style standard RestAPI studies/*/anonymize call and observe the same high idle following the completion of the anonymization. In this case, the only portion of my own scripts that would be invoked is the IncomingHTTPFilter in Python where I perform some simple access control.

In a second test, I turned off my IncomingHTTP filter so that I am invoking the anonymize process directly via the Orthanc explorer. None of my codes are intervening (apart from being loaded and made available by the Orthanc backend).

In both cases, the RestAPI anonymization creates a CPU spike in both the Orthanc and accompanying postgres. Once complete, the CPU settles down to 2-3%, but not less than 1%.

If I subsequently delete the anonymized study and the original study, the Orthanc briefly spikes and settles down to 13-15% CPU usage while idle. I see some minor ongoing CPU usage by the postgres, less than 1% but higher than the other idling Postgres.

The postgres CPU eventually settles down to less than 0.05% while the Orthanc CPU remains at 13-15% with no apparent activity - no incoming data and no resident images to manage.

John.

Hi John

Ok, let’s try to reproduce the issue.

I’m suggesting to start from the docker-compose below. It has 2 orthanc version 23.5.1 and 22.12.2.
On both orthanc, I’m uploading a 600 instances study, anonymize it and then delete the anonymized study.
After that, here is what docker stats shows:
image

Could you try to reproduce your issue starting from this setup ?

Thanks

Alain

version: "3"
services:
  orthanc-a:
    image: osimis/orthanc:23.5.1
    depends_on: [orthanc-index-a]
    restart: unless-stopped
    ports: ["4242:4242", "8042:8042"]
    volumes: ["orthanc-storage-a:/var/lib/orthanc/db"]
    environment:
      VERBOSE_STARTUP: "true"
      VERBOSE_ENABLED: "true"
      ORTHANC__AUTHENTICATION_ENABLED: "false"
      ORTHANC__POSTGRESQL__HOST: "orthanc-index-a"

  orthanc-index-a:
    image: postgres:14
    restart: unless-stopped
    ports: ["5432:5432"]
    volumes: ["orthanc-index-a:/var/lib/postgresql/data"]
    environment:
      POSTGRES_HOST_AUTH_METHOD: "trust"

  orthanc-b:
    image: osimis/orthanc:22.12.2
    restart: unless-stopped
    depends_on: [orthanc-index-b]
    ports: ["8043:8042"]
    volumes: ["orthanc-storage-b:/var/lib/orthanc/db"]
    environment:
      VERBOSE_STARTUP: "true"
      VERBOSE_ENABLED: "true"
      ORTHANC__AUTHENTICATION_ENABLED: "false"
      ORTHANC__POSTGRESQL__HOST: "orthanc-index-a"

  orthanc-index-b:
    image: postgres:14
    restart: unless-stopped
    volumes: ["orthanc-index-b:/var/lib/postgresql/data"]
    environment:
      POSTGRES_HOST_AUTH_METHOD: "trust"

volumes:
  orthanc-storage-a:
  orthanc-index-a:
  orthanc-storage-b:
  orthanc-index-b:


I ran a test anonymization with 768 images. After the anonymization, the Orthanc A settled down to 2-3% CPU. After I deleted the images (leaving the Orthanc A empty), it bumped up to 9-10% CPU usage and remains there many minutes later.

dockerstats

I am invoking the anonymization from the browser explorer. I can try from the command line with curl.

Update: I tried doing everything with curl from the command line and observed the same behavior. 2-3% CPU when idling after the anonymization, 9-10% idle after deleting the studies. I am going to try on another host altogether and see what I find.

John.

On a different host, I performed the same test (768 images) with your docker setup.

After anonymization, the Orthanc A settled into 1-2% CPU, not much more than the background CPU usage of other orthancs on the same host.

After deleting both the anonymized and original studies, Orthanc A rose to 4-5% CPU usage and remains there minutes later.

Note: I am anonymizing (and deleting) at the study level if that makes any difference.

In the postgres database, I can see there are various tables recording things like changes (and deletions?). Are all the SQL tables filled synchronously from the Orthanc side or are any sql operations queued on the Orthanc side? Is it possible that I am encountering a race condition when I delete the images before all their meta has synced with the SQL database? Does deletion take precedence over a backlog of meta to be recorded in the SQL?

John.

Since this may become important, here are some stats on the machines where I am running docker versions of Orthanc:

Main system (where I first observed the issue)
CentOS Linux release 7.9.2009 (Core)
Docker version 24.0.1, build 6802122
3.10.0-1160.90.1.el7.x86_64

Second system
CentOS Linux release 7.9.2009 (Core)
Docker version 23.0.1, build a5ee5b1
3.10.0-1160.83.1.el7.x86_64

If I get a chance, I will test Alain’s simple Docker setup on one or more Fedora, Ubuntu, Mac or Windows hosts.

John.

I mentioned earlier that when I run strace on one of these cases, I see a lot of back end futex calls. There are messages regarding “unfinished” stuff as well as futex timeouts and resumptions.

I’m wondering if this might be related to a very old bug in futex_wait(), though it is quite old and I am running on an up to date CentOS distribution:

https://groups.google.com/g/mechanical-sympathy/c/QbmpZxp6C64

I have not had a chance yet to test Alain’s simple docker test on a different OS host.

John.

This more recent post regarding race conditions with threads and futex signaling suggest it might be a hardware issue:

I might try figuring out if I can force Docker onto one CPU or can run it in such a way to test whether this is an underlying hardware issue.

John.

Hmm. Given my rudimentary understanding of how CentOS manages updates, the kernel I have really is circa 3.10 generation but with layers of Redhat backports patched on. I wonder if they never backported the fix for the futex_wait problem, which according to my earlier link was not addressed until 3.14 or so.

I just ran Alain’s simple Docker setup on a Fedora 37 system. After anonymization, the CPU settles down to normal levels. That’s good! If I subsequently delete both the original and anonymized studies, the CPU usage consistently runs 1.5 to 2.0%, certainly not as much as I was seeing on my main hosts, but not the normal <1%.

strace shows similar futex unfinished/wake/resumed messages.

Fedora 37
6.2.14-200.fc37.x86_64 x86_64 x86_64 x86_64 GNU/Linux
Docker 23.0.5 build bc4487a