Checking for Duplicated instances with same instance number

Hello, i’m having an issue with a Siemens CT scan, that when it sends the study a second time to orthanc, it changes the instance UID and orthanc doesnt delete the duplicated instances.

the CT scan keeps the

  • 0020,0013 (InstanceNumber): 1
    the same but the
  • 0002,0003 (MediaStorageSOPInstanceUID): 1.3.12.2.1107.5.1.4.86707.30000022032323040628900014391

changes each time the study is sent. is there a way to reject the instances or delete the previous instances when recieving the study? ideally it would be with the python plugin

Hello,

You could listen to incoming instances in the Python plugin using the orthanc.ChangeType.NEW_INSTANCE change: Python plugin for Orthanc — Orthanc Book documentation

Then, for each incoming instance, make a /tools/find request against a combination of the SeriesInstanceUID and InstanceNumber tags (using the orthanc.RestApiPost() function in Python): REST API of Orthanc — Orthanc Book documentation

If the result is non-empty and matches the SOPInstanceUID of the incoming instance, either delete the new instance or the previous instance (using the orthanc.RestApiDelete() function in Python): REST API of Orthanc — Orthanc Book documentation

HTH,
Sébastien-

Hi Jodogne, thanks for the reply and the information, i missed the ChangeType.NEWINSTANCE
is there somewhere i can find a list of the orthanc python plugin functions? othen than the python plugin page?

i did manage to make a script to fix one of the studies i had issues with. but i will implement this for future use, is there a performance difference between using the /tools/find and a specific custom restcallback with the python plugin?

Here is the Script i did to fix one study at the time

import requests
import json
baseUrl = "#yourOrthancIp"
auth=("#yourauthmethod")

r_study = requests.get(baseUrl + "/studies/#studyId", auth=('#yourauthmethod'))

instances_list = []
duplicated_instances = []

for series in r_study.json()["Series"]:
    r_s = requests.get(baseUrl + "/series/" + series, auth=('#yourauthmethod'))
    print("checking series: " + series)

    for instance in r_s.json()["Instances"]:
        print(instance)

        r_i = requests.get( baseUrl + "/instances/" + instance + "/simplified-tags", auth=('#yourauthmethod'))
        if r_i.json()['InstanceNumber'] in instances_list:
            duplicated_instances.append(instance)
            instanceDelete = requests.delete(baseUrl + "/instances/" + instance, auth=('#yourauthmethod'))
            print("instance already in list: " + instance + "deleting")
        else:
            instances_list.append(r_i.json()['InstanceNumber'])
            print(r_i.json()['InstanceNumber'])

Hi Francisco,

Here is how to list the available API

HTH,

Alain

Hi Jodogne, after some work and testing i managed to come up with 2 scripts, one to fix the files duplicated in a server and the other to filter out the incoming cstore instances. so far the one to fix scanned 2350 instances in about 30 seconds on a local orthanc install. and i need to fix a server with about 13 million instances and i believe that around half of them are duplicates. is there a better way to do this? i dont want to consume all the orthanc resources deleting the images nor do i want to be waiting until next year to delete them all. is this possible with peering? Btw the server has been performing really good. we are using a postgres database in nvme, and the storage in 6x 6tb drives in raid z2, we get around 300-600mb a sec. it could be better but cant seem to find where is the overhead.
here are the scripts

Duplicated_Instances_fix-py

import datetime
import time
import requests
import json

from requests.auth import HTTPBasicAuth

baseUrl = "http://127.0.0.1:8042"
auth = HTTPBasicAuth('orthanc', 'orthanc')

r_studies = requests.get(baseUrl + "/studies", auth=auth)

scanned_instances_count = 0
scanned_series_count = 0
scanned_studies_count = 0
deleted_instances_count = 0
start_time = datetime.datetime.now()

print(datetime.datetime.now())

for study in r_studies.json():
    study_start_time = datetime.datetime.now()
    print("removing duplicated instances for study: %s" % study)
    r_study = requests.get(baseUrl + "/studies/" + study, auth=('orthanc', 'orthanc'))

    for series in r_study.json()["Series"]:
        deleted_instances = []
        series_start_time = datetime.datetime.now()

        r_series = requests.get(baseUrl + "/series/" + series, auth=auth)
        print("removing duplicated instances for series: ", series)
        print(datetime.datetime.now())

        for instance in r_series.json()["Instances"]:

            if instance in deleted_instances:
                print("instance already deleted skipping")
                break

            time.sleep(0.001)
            r_instance = requests.get(baseUrl + "/instances/" + instance + "/simplified-tags", auth=auth)
            data = {
                "Level": "Instance",
                "Query": {
                    "SeriesInstanceUID": r_instance.json()["SeriesInstanceUID"],
                    "InstanceNumber": r_instance.json()["InstanceNumber"]

                }
            }

            r_find = requests.post(baseUrl + "/tools/find", json=data, auth=auth)
            find = json.loads(json.dumps(r_find.json()))

            while len(find) > 1:
                print("duplicates in series found")

                r_delete = requests.delete(baseUrl + "/instances/" + find[-1], auth=auth)
                print("deleted duplicate: ", find[-1])
                deleted_instances.append(find[-1])
                deleted_instances_count += 1
                del find[-1]

            scanned_instances_count += 1

        scanned_series_count += 1

        series_end_time = datetime.datetime.now()
        series_delta_time = series_end_time - series_start_time
        print("Time to scan and delete duplicates on this series: ", series_delta_time)

    scanned_studies_count += 1
    study_end_time = datetime.datetime.now()
    study_delta = study_end_time - study_start_time
    print("Time to scan and delete duplicates on this study:", study_delta)


end_time = datetime.datetime.now()
total_time = end_time - start_time

print("Deleted a total of ", deleted_instances_count, "Instances")
print("Scanned a total of ", scanned_studies_count, " studies, ", scanned_series_count, " series, ", scanned_instances_count, " instances")
print("in ", total_time, "S")


Incoming_duplicate_check.py

import json
import orthanc

def FilterIncomingCStoreInstance(receivedDicom):
    instance_json = json.loads(receivedDicom.GetInstanceSimplifiedJson())
    data = {
        "Level": "Instance",
        "Query": {
            "SeriesInstanceUID": instance_json["SeriesInstanceUID"],
            "InstanceNumber": instance_json["InstanceNumber"]

        }
    }
    query = json.loads(orthanc.RestApiPost('/tools/find', json.dumps(data)))
    if query.__len__():
        orthanc.RestApiDelete('/instances/%s' % query[0])
        print("delete already stored Instance: %s" % query[0])

    return 0


orthanc.RegisterIncomingCStoreInstanceFilter(FilterIncomingCStoreInstance)

Hi Alain, thanks for the info. i did not see this part of the documentation. thanks to this i was able to get the script working

Hi Francisco,

Instead of using tools/find instance by instance, you can, for each study, call http://localhost:8042/studies/.../instances?expand then, you’ll have all main dicom tags from all instances and their ParentSeries → you would then be able to list the instances to delete and feed them to tools/bulk-delete

HTH,

Alain

Hi Alain, that seems better as the number of rest calls is going to decrease dramatically. but i have to check the sizes of some of the studies as there are chances we have studies with duplicates with over 15k images

Hi Alain,

I tried the bulk delete method and it worked really well it went from around 30s in one study to just under 5. does the bulk delete endpoint create a job or it waits for the deletion of all resources before it returns a response?

Here is script, i used pandas as a way to make easier the querying of duplicates from the estudies. it maybe can use some more optimization.
Duplicate_bulk_delete.py

import datetime
import requests
import pandas
from requests.auth import HTTPBasicAuth

baseUrl = "http://127.0.0.1:8042"
auth = HTTPBasicAuth('orthanc', 'orthanc')

r_studies = requests.get(baseUrl + "/studies", auth=auth)

r_study = requests.get(baseUrl + "/studies/")
scanned_instances_count = 0
scanned_series_count = 0
scanned_studies_count = 0
deleted_instances_count = 0
start_time = datetime.datetime.now()

print(datetime.datetime.now())
for study in r_studies.json():

    duplicated_instances = []
    r_study = requests.get(baseUrl + "/studies/" + study + "/instances")
    instanceDataFrame = pandas.json_normalize(r_study.json())

    duplicated = instanceDataFrame.duplicated(subset=['ParentSeries', 'MainDicomTags.InstanceNumber'])

    for index, value in duplicated.items():

        if value:
            duplicated_instances.append(instanceDataFrame.iloc[[index]][['ID']].values[0][0])
            deleted_instances_count += 1

    data = {"Resources": duplicated_instances}

    if duplicated_instances.__len__() > 1:
        r_bulkdelete = requests.post(baseUrl + "/tools/bulk-delete", json=data, auth=auth)
        if r_bulkdelete.status_code == 200:
            print("Instances deleted")

    scanned_instances_count += duplicated.index.size
    scanned_studies_count += 1


end_time = datetime.datetime.now()
total_time = end_time - start_time

print("Deleted a total of ", deleted_instances_count, "Instances")
print("Scanned a total of ", scanned_studies_count, " studies, ", scanned_instances_count, " instances")
print("in ", total_time, "S")

output from testing in a local orthanc instance with 13k images and sql lite database

2023-05-26 10:06:01.963088
Instances deleted
Deleted a total of  2363 Instances
Scanned a total of  5  studies,  12201  instances
in  0:00:03.421359 S

It does not create a job → once it returns its response, all resources are deleted.