How to create a custom Kedro DataSet

4 minute read

In the following article, i will show how to create a custom dataset for Kedro, an open source Python library for Production-Ready Machine Learning Code.

The custom dataset will be used to read DICOM files and produce image and CSV datasets which are native Kedro datasets.

This article is linked to my article on how to create a kedro project and my project on pneumothorax classifier, you will find more informations on the DICOM dataset there.

To be able to read and extract data from DICOM files, we will use the pydicom library.

DICOMDataSet class

This is the whole custom dataset class we are going to create. I will explain each part. One remark though, we are not going to implement the save part of the dataset class, as it will be not used in our project and it is not very common to write DICOM files in a ML project.

import numpy as np
import pandas as pd

import pydicom

# PIL is the package from Pillow
from PIL import Image


class DICOMDataSet(AbstractDataSet):
    def __init__(self, filepath: str):
        """Creates a new instance of DICOMDataSet to load / save image data for given filepath.

        Args:
            filepath: The location of the DICOM file to load / save data.
        """
        
        # parse the path and protocol (e.g. file, http, s3, etc.)
        protocol, path = get_protocol_and_path(filepath)
        self._protocol = protocol
        self._filepath = PurePosixPath(path)
        self._fs = fsspec.filesystem(self._protocol)

    def _load(self) -> (pd.DataFrame,np.ndarray):
        """Loads data from the DICOM file.

        Returns:
            Metadata from the DICOM file as a pandas Dataframe,
            Image data  as a numpy array
        """
        # using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
        load_path = get_filepath_str(self._filepath, self._protocol)
        with self._fs.open(load_path) as f:
            ds = pydicom.dcmread(f)

            df = pd.DataFrame.from_records([(el.name,el.value) for el in ds if el.name not in ['Pixel Data', 'File Meta Information Version']])
            df = df.T
            df.columns = df.iloc[0]
            df = df.iloc[1:]
            pixel_array = ds.pixel_array
            return (df,pixel_array)
    

    def _save(self, data: np.ndarray) -> None:
        """Saves image data to the specified filepath"""
        return None

    
    def _describe(self) -> Dict[str, Any]:
        """Returns a dict that describes the attributes of the dataset.
        """
        return dict(
            filepath=self._filepath,
            protocol=self._protocol
        )

init of instance

def __init__(self, filepath: str):
        """Creates a new instance of DICOMDataSet to load / save image data for given filepath.

        Args:
            filepath: The location of the DICOM file to load / save data.
        """
        
        # parse the path and protocol (e.g. file, http, s3, etc.)
        protocol, path = get_protocol_and_path(filepath)
        self._protocol = protocol
        self._filepath = PurePosixPath(path)
        self._fs = fsspec.filesystem(self._protocol)

This part is quite the same for all custom datasets, we store the filepath and the protocol (e.g. file, http, s3, etc.) used to access the dataset.

Load method

def _load(self) -> (pd.DataFrame,np.ndarray):
        """Loads data from the DICOM file.

        Returns:
            Metadata from the DICOM file as a pandas Dataframe,
            Image data  as a numpy array
        """
        # using get_filepath_str ensures that the protocol and path are appended correctly for different filesystems
        load_path = get_filepath_str(self._filepath, self._protocol)
        with self._fs.open(load_path) as f:
            ds = pydicom.dcmread(f)

            df = pd.DataFrame.from_records([(el.name,el.value) for el in ds if el.name not in ['Pixel Data', 'File Meta Information Version']])
            df = df.T
            df.columns = df.iloc[0]
            df = df.iloc[1:]
            pixel_array = ds.pixel_array
            return (df,pixel_array)

This is definitively the most important part of the class. The load method will be used each time we access the dataset. We will use the pydicom library to read the .dcm files.

ds = pydicom.dcmread(f)

the dcmread method from pydicom library, will be used to read the contents of the dcm file and create a pydicom.dataset.FileDataset class. see pydicom dcmread.

The pydicom.dataset.FileDataset is iterable and we can retrieve each dcm metadata value and the pixel_array that corresponds to the image of xray stored.

df = pd.DataFrame.from_records([(el.name,el.value) for el in ds if el.name not in ['Pixel Data', 'File Meta Information Version']])
            df = df.T
            df.columns = df.iloc[0]
            df = df.iloc[1:]
            pixel_array = ds.pixel_array
            return (df,pixel_array)

We will retrieve all metadata info in a pandas DataFrame, and the xray image in the form of a numpy.ndarray, and then return them a a tuple.

Describe method

def _describe(self) -> Dict[str, Any]:
        """Returns a dict that describes the attributes of the dataset.
        """
        return dict(
            filepath=self._filepath,
            protocol=self._protocol
        )

This method is mandatory and will be called each time we try to execute a .head() method on a kedro dataset. Here we just return a dictionary containing the protocol and the filepath of the dataset on disk.

Add dataset to kedro data catalog

When you add a custom dataset ( same thing for regular ones) you have 2 options : defining only 1 file or a directory. For our dcm files option 2 is the way to go as we have more than 12k dcm files to extract.

Read a single file

To read a single dcm file, just define the type and filepath.

dicom_single:
    type: pneumothorax.io.datasets.dicom_dataset.DICOMDataSet
    filepath: data/01_raw/dicom-images-train/1.2.276.0.7230010.3.1.4.8323329.300.1517875162.258081.dcm

Read whole directory

To read a whole directory of files, set the type to PartitionedDataSet. Define dataset parameter to the DICOMDataSet type and set path to the directory containing our dcm files. filename_suffix will be used to identify all the dcm files.

dicom_train:
    type: PartitionedDataSet
    dataset: pneumothorax.io.datasets.dicom_dataset.DICOMDataSet
    path: data/01_raw/dicom-images-train
    filename_suffix: ".dcm"

Categories:

Updated: