What motivated my patch is crashing on large non-dicom files: bugs.orthanc-server com/show_bug.cgi?id=217
Improvement 1: Currently indexer reads all files into memory and tries to find their instanceid and hence determines if they’re dicom. My patch uses DCMTK to open files without copying them to memory and determine that they’re really valid DICOM. This means that a 20GB file in the folder would no longer cause the indexer to crash as we now determine that this file is not DICOM early.
Improvement 2: I chose to check for DICOM validity using DCMTK instead of checking the file signature, because it would otherwise be too easy for a malicious one to prepend a 20GB file with the correct signature hence causing copying the whole file into memory and crashing.
Improvement 3: My patch makes finding the instance number not require reading the whole file into memory, hence reducing the amount of code that requires the full file in memory. With my patch, the only time we read the whole file is for the line OrthancPlugins::RestApiPost(upload, "/instances", dicom, false);. So my patch brings orthanc-indexer closer to working without reading the file in full.
Dear orthanc developers, wouldn’t it be beneficial to have a function in the OrthancPlugins namespace that has the same api of malloc, plus filepath as another param, that "mmap"s a file so that in all these cases when we keep the whole file in memory for passing it around, we would actually use not so much RAM and not crash when we don’t have so much RAM? This would save much RAM
Thanks for your contribution! However, even if we appreciate it, we cannot accept your patch as such, as it introduces a dependency of the orthanc-indexer plugin on the DCMTK library, which would make the production of the plugin much more complex.
The proper solution would be to use to the Orthanc::DicomStreamReader class that is part of the Orthanc framework and that doesn’t depend on DCMTK. It would also be important to add a configuration option to disable this behavior, as in some rare situations, DICOM datasets might be stored without their associated DICOM File Meta Information header.
I will have a look at this, but this is not a high-priority task.
DicomStreamReader would not work well as there’s a note in the file you linked saying “this class won’t work well inside sequences” whereas we do currently seek patientid inside as well so it would be backward incompatible.
My better idea is this patch github com/CGDogan/orthanc-indexer/commit/657200e2b4cd90b31b0ea03cc5b1c8aa91de0531 but it’s not ready for testing. It adds a wrapper that mmaps files if possible otherwise reads them. Would you accept this after it’s complete and I get CLA confirmation?
DicomStreamReader will work just fine, as content of sequences must not be explored to get the patient ID.
As far as the CLA is concerned, I have not been affiliated with Osimis for 2 years, so I don’t get a copy of CLA anymore. Furthermore, the orthanc-indexer plugin is from UCLouvain, so I guess the CLA would be meaningless. You should discuss this with Osimis.
For the CLA part, OK, but for the first part, my new patch is nevertheless more important than my previous patch and/or whether we would use DicomStreamReader. This is because at least at some points we do need the whole file in the memory as a byte array, this way or another. So it’d be excellent if you could comment on whether you would accept a patch introducing a class that allows reading files using mmap if supported, which is what I linked.
The class DicomStreadReader can work on any C++ stream, which includes the standard std::ifstream. The latter stream doesn’t load the file in memory, and is entirely cross-platform (contrarily to mmap). So, I would not accept a patch involving mmap. Again, as stated before, I’ll have a look at the issue, but not immediately.
It makes perfect sense that the indexer is a lower priority part. A patch in the upstream isn’t a priority for me either as we needed a fork anyway. But when you do start working on it, whenever in the future, could you please check out this boost::mapped_file patch? I’ve proofread it countless times so it should be ready for you whenever you’d like. github com/CGDogan/orthanc-indexer/commit/eb3b5fa8a40a6b592e9573aaa69c12f263bdc193 (to get the .patch file, add “.patch” to the end of the URL) I’ve seen an enormous boost in efficiency of the plugin with this patch, it reduces the number of redundant copies by the plugin by 50%. My scope of work is changing and I won’t be submitting more patches (my local fork isn’t too relevant) so keep up the good work and regards.
If you like the patch, Orthanc users will certainly see a visible benefit from a similar patch you could prepare for FilesystemStorage.cpp . Noteworthy drop in memory & electricity usage in orthanc. In fact, I would like to even propose a RegisterStorageArea3 since this is what would make Orthanc really efficient whether or not someone already uses the PostgreSQL plugin but the order of priorities is up to you as always.