fastest way to ingest a LOT of data

I’m hoping to load about 18 TB of images (~22k studies) into Orthanc. I was wondering if anyone has advice on fast, or at least faster, ways to do this.

I did some measuring. With a very small test study (just one series in fact):

  • dcmsend or storescu - 30 seconds
  • curl (post, with expect header set) - 25 seconds
  • curl, using GNU parallel, -j8 - 16 seconds
  • dicomweb from OsiriX MD - 22 seconds

Parallel curl is clearly a winner here (except that it makes checking for and recovering from failures rather harder).

Is there anything I can do to MASSIVELY increase the speed?

Thanks!

Hello,

First of all, make sure to use the PostgreSQL index plugin:
https://book.orthanc-server.com/faq/scalability.html#recommended-setup-for-best-performance

Another built-in possibility to import data would be to try with WebDAV:
https://book.orthanc-server.com/users/webdav.html

Note that all these methods use either HTTP or DICOM, which degrades performance because of frequent network handshakings. You could give a try with HTTP clients that use keep-alive connections (Java, or “requests” in Python).

I presume that best performance would consist in locally mounting the drive containing the DICOM files, then using a thread pool in Python to run multiple REST uploads in parallel onto the localhost. Running on the localhost should make the handshaking time negligible. A Python sample demonstrating the use of such a thread pool is available:
https://hg.orthanc-server.com/orthanc/file/tip/OrthancServer/Resources/Samples/Python/HighPerformanceAutoRouting.py

You might also want to try a Python plugin so as to bypass any network connection, and directly “talk” to the core of Orthanc:
https://book.orthanc-server.com/plugins/python.html

Such a Python plugin would use the “multithreading” library of Python in order to create a pool of threads that read DICOM files from a folder, then call “orthanc.RestApiPost(‘/instances’, dicom)” for each of those DICOM files:
https://book.orthanc-server.com/plugins/python.html#listening-to-changes

Sébastien-

For outside dicom directorys like referral cd’s and outside studies downloaded from other sources, I have used Tomovision’s free utility Dicomanager
Have been able to send to DCM4CHEE and Orthanc.

I can’t say I’ve timed it but it seems to transfer films pretty darned fast with their dicom tags intact.

Might be worth a try… it is built for mass transfer with no coding as long as the source is a dicom directory or just a bunch of dicom files in a known location.

I would put an orthanc instance using the postgresql on the same physical machine as the source files for speed and see what happens.

Just did a test with the Tomovision utility from a slow machine, across my internal network to a DCM4CHEE Virtual machine.
less than 3 seconds per film in a 9 film series. Will try if I have time to send to my Orthanc instance (I’m just running the default sqlite)

OK just confirmed … source on a pentium dual core 15 year old computer to Orthanc running on

a raspberry pi 3b+ and usb hard drive… Talk about a torture test… 3 seconds per film using Dicomanage
These are high quality direct digital xrays from a Veterinary DR machine by sedecal.

So if you can arrange your source (or it is already) in dicom directories… You’d probably have to divvy up the batches
Dicomanage will recursively search out a root directory for you, but the whole think would likely throw memory errors before it could load the whole batch. But at least in the small test 3 sec per film on some pretty limited hardware

I think is a pretty good mark.

You might also want to try a Python plugin so as to bypass any network connection, and directly “talk” to the core of Orthanc:
https://book.orthanc-server.com/plugins/python.html

I just tried this method - I had it ingest a directory full of files (serially, no threading).

starttime = datetime.datetime.now()
for fname in glob(‘/incoming/*dcm’):
with open(fname, ‘rb’) as f:
orthanc.RestApiPost(‘/instances’, f.read())
print(fname)
print(‘finished at’, datetime.datetime.now()-starttime)

Interestingly, it was almost precisely the same speed as dcmsend from another host, so I suspect just doing the simple way below (parallel curl) is just about as fast as it’s gonna reasonably get.

Fortunately, that works out to only about 3-4 weeks ingestion time, which is totally reasonable for a project of this size.

Thanks.

Relatedly, before I start on this project – is it even reasonable to be using Orthanc for this volume of images? When I search this group I see people talking about 1 TB, 5 TB instances – a few as big as 10, but not really any bigger.

I’m planning on starting with about 20 TB and expect that to grow by ~ 3-4 TB a year. Is that a problem for Orthanc?

The backend storage will be a iXsystems TrueNAS mounted via NFS via 40 GbE. The Postgres db will be locally stored on SSD (RAID1). How large should I expect that index to get?

Hello,

As far as I’m concerned, all I can do is referring you to the following FAQ in the Orthanc Book:
https://book.orthanc-server.com/faq/scalability.html

Sébastien-

It all depends on your hardware. I have never hard issues with 20tb+ instances of Orthanc as long as you use PostgreSQL for indexing.

Great. We’ll be throwing plenty of hardware at it and using postgres and following the scaling instructions in the Book!