Preprocessing the Dataset

Introduction

In this notebook you will create your own dataset based on the CAMELYON16 data set solely (~700 giga bytes), as all positive WSIs come with xml files including coordinates of the tumorous regions. Subsequent notebooks will use this data set you will create now.

Once you have finished this series of notebooks you can enhancing your implementation to also including the 50 positive WSIs of the CAMELYON17 training data set, which also come with xml files.

The purpose of the preprocessing is the following:

If we had enough RAM to store the whole data set, we would just load it once at the beginning of the training. But this is not the case. Reading the different WSI-files in their compressed tiff format every single time we train a new batch is very time consuming. So storing tiles with a fixed zoom level, fixed size, cropped and labeled in one single file, will save us a lot of time.

Requirements

Python-Modules

# Python Standard Library
import random
import sys
import os

# External Modules
import numpy as np
import h5py
from datetime import datetime
from skimage.filters import threshold_otsu
from matplotlib import pyplot as plt


from preprocessing.datamodel import SlideManager
from preprocessing.processing import split_negative_slide, split_positive_slide, create_tumor_mask, rgb2gray
from preprocessing.util import TileMap

Data

The data used in this notebook are from the CAMELYON data sets, which are freely available on the CAMELYON data page.

The whole data sets have the following sizes:

  • CAMELYON16 (~715 GiB)
  • CAMELYON17 (~2,8 TiB)

For this notebook to work the following file structure (for CAMELYON16) inside the data folder must be given:

data
├── CAMELYON16
│   ├── training
│   │   ├── lesion_annotations
│   │   │   └── tumor_001.xml - tumor_110.xml
│   │   ├── normal
│   │   │   └── normal_001.tif - normal_160.tif
│   │   └── tumor
│   │       └── tumor_001.tif - tumor_110.tif
│   └── test
│       ├── lesion_annotations
│       │   └── test_001.xml - tumor_110.xml
│       └── images
│           └── test_001.tif - normal_160.tif
│
└── CAMELYON17
    └── training
        ├── center_0
        │   └── patient_000_node_0.tif - patient_019_node_4.tif
        ├── center_1
        │   └── patient_020_node_0.tif - patient_039_node_4.tif
        ├── center_2
        │   └── patient_040_node_0.tif - patient_059_node_4.tif
        ├── center_3
        │   └── patient_060_node_0.tif - patient_079_node_4.tif
        ├── center_4
        │   └── patient_080_node_0.tif - patient_099_node_4.tif
        ├── lesion_annotations
        │   └── patient_004_node_4.xml - patient_099_node_4.xml
        └── stage_labels.csv

Note: For the SlideManager class also uppercase and lowercase matters, especially to map annotations to tumor slides, so be consistant in file labeling.

Task:

If you have not done so far, download all remaining data of the CAMELYON16 data set and store it in a folder structure shown above.

Dataset Generation

In this notebook, we will use parts of the data-handling-usage-guide.ipynb to create our own dataset. You have two options. We suggest to work through the whole application scenario with option (B) first.

Either Option

  • Process all files from the CAMELYON16 data set
  • No overlap for negative tiles
  • Minimum of 20% tissue in tiles for normal slides
  • Minimum of 60% tumorours tissue for positive slides

Option A

  • Slide zoom level 0 (0-9, 0 beeing the highest zoom)
  • Tile_size of 312x312
  • 156 pixel overlap for tumorous (positive tiles) since they are scarce
  • We save up to 1000 tiles per slide
  • Processing in this notebook will take approximately ~60 hours [*]
  • Classifying the tiles of the WSIs of the test set will later take ~1 hour per WSI [*]

Option B

  • Slide zoom level 3 (0-9, 0 beeing the highest zoom)
  • Tile_size of 256x256
  • 128 pixel overlap for tumorous (positive tiles) since they are scarce
  • We only save up to 100 tiles per slide
  • Processing in this notebook will take approximately ~5 hours [*]
  • Training of CNN in the next Notebook will take ~10 hours [*]
  • Classifying the tiles of the WSIs of the test set will later take ~10 minutes per WSI [*]

Remark:

  • [*] [Tested on Xeon1231v3 @3.8Ghz, 16GB DDR3 @1666Hz, data set stored on magnetic harddrive]
  • If you have the possibility to store the CAMELYON16 data set on SSD, do so.

Most importantly, we will save all tiles from all WSIs into a single HDF5 file. This is crucial because when accessing the data later for training, most time is consumed when opening a file. Additionally, the training works better, when a single batch (e.g. 100 tiles), is as heterogenous as the original data. So when we want to read 100 random tiles, we ideally want to read 100 tiles from 100 different slides and we do not want to open 100 different files to do so.

Background Information:

Depending on the staining process and the slide scanner, the slides can differ quite a lot in color. Therefore a batch containing 100 tiles from one slide only will most likely prevent the CNN from generalizing well.

### EDIT THIS CELL:
### Assign the path to your CAMELYON16 data
CAM_BASE_DIR = '/path-to-CAMELYON16-folder/'

#example: absolute path for linux
CAM_BASE_DIR = '/media/klaus/2612FE3171F55111/'
### Do not edit this cell
CAM16_DIR = CAM_BASE_DIR + 'CAMELYON16/'
GENERATED_DATA = CAM_BASE_DIR + 'output/'

#example: output path may of course be different
GENERATED_DATA = '/media/klaus/Toshiba/CAM16_output/'
mgr = SlideManager(cam16_dir=CAM16_DIR)
n_slides= len(mgr.slides)
### Execute this cell for option A

level = 0
tile_size = 312
### Execute this cell for option B

level = 3
tile_size = 256
### Execute this cell

poi = 0.20 ### 20% of negative tiles must contain tissue (in contrast to slide backgroun)
poi_tumor = 0.60 ### 60% of pos tiles must contain metastases
### to not have too few positive tile, we use half overlapping tilesize
overlap_tumor = tile_size // 2
### we have enough normal tissue, so negative tiles will be less of a problem
overlap = 0.0
max_tiles_per_slide = 1000

Hint:

  • As mentioned above, the next two blocks will take alot of time, depending on the choosen option. Before starting to preprocess the full data set it might help to process just a few slides, e.g. two normal and two tumor, to test whether everything works as expected.
  • In some rare cases jupyter notebook can become unstable when running for hours. It might be a good idea to run the python program from shell instead. To do so export the notebook as python program. Go to File --> Download as --> Python

Create Individual Files per WSI

To make this process resumable if anything fails, we will first create one HDF5-File for each WSI. This way, if anything fails, like power failure, Python Kernel dying, you can just delete the very last file, which will most likely be corrupted, and resume the process by reexecuting the cells.

tiles_pos = 0

for i in range(len(mgr.annotated_slides)):
    try: 
        
        filename = '{}/{}_{}x{}_poi{}_poiTumor{}_level{}.hdf5'.format(GENERATED_DATA, mgr.annotated_slides[i].name, tile_size, tile_size, 
                                                       poi, poi_tumor, level)
        # 'w-' creates file, fails if exists
        h5 = h5py.File(filename, "w-", libver='latest')
        
        # create a new and unconsumed tile iterator
        tile_iter = split_positive_slide(mgr.annotated_slides[i], level=level,
                                         tile_size=tile_size, overlap=overlap_tumor,
                                         poi_threshold=poi_tumor) 

        tiles_batch = []
        for tile, bounds in tile_iter:
            if len(tiles_batch) % 10 == 0: print('positive slide #:', i, 'tiles so far:', len(tiles_batch))
            if len(tiles_batch) > max_tiles_per_slide: break
            tiles_batch.append(tile)

        # creating a date set in the file
        dset = h5.create_dataset(mgr.annotated_slides[i].name, 
                                 (len(tiles_batch), tile_size, tile_size, 3), 
                                 dtype=np.uint8,
                                 data=np.array(tiles_batch),
                                 compression=0)   
        h5.close()

        tiles_pos += len(tiles_batch)
        print(datetime.now(), i, '/', len(mgr.annotated_slides), '  tiles  ', len(tiles_batch))
        print('pos tiles total: ', tiles_pos)

    except:
        print('slide nr {}/{} failed'.format(i, len(mgr.annotated_slides)))
        print(sys.exc_info()[0])
tiles_neg = 0

for i in range(len(mgr.negative_slides)): 
    try:
        filename = '{}/{}_{}x{}_poi{}_poiTumor{}_level{}.hdf5'.format(GENERATED_DATA, mgr.negative_slides[i].name, tile_size, tile_size, 
                                                       poi, poi_tumor, level)
        # 'w-' creates file, fails if exists
        h5 = h5py.File(filename, "w-", libver='latest')
        
        # load the slide into numpy array
        arr = np.asarray(mgr.negative_slides[i].get_full_slide(level=4))

        # convert it to gray scale
        arr_gray = rgb2gray(arr)

        # calculate otsu threshold
        threshold = threshold_otsu(arr_gray)

        # create a new and unconsumed tile iterator
        # because we have so many  negative slides we do not use overlap
        tile_iter = split_negative_slide(mgr.negative_slides[i], level=level,
                                         otsu_threshold=threshold,
                                         tile_size=tile_size, overlap=overlap,
                                         poi_threshold=poi)

        tiles_batch = []
        for tile, bounds in tile_iter:
            if len(tiles_batch) % 10 == 0: print('neg slide:', i, 'tiles so far:', len(tiles_batch))
            if len(tiles_batch) > max_tiles_per_slide: break
            tiles_batch.append(tile)

        # creating a date set in the file
        dset = h5.create_dataset(mgr.negative_slides[i].name, 
                                 (len(tiles_batch), tile_size, tile_size, 3), 
                                 dtype=np.uint8,
                                 data=np.array(tiles_batch),
                                 compression=0)
        h5.close()
        
        tiles_neg += len(tiles_batch)
        print(datetime.now(), i, '/', len(mgr.negative_slides), '  tiles  ', len(tiles_batch))
        print('neg tiles total: ', tiles_neg)
        
    except:
        print('slide nr {}/{} failed'.format(i, len(mgr.negative_slides)))
        print(sys.exc_info()[0])

Create Single File

Now we will create a new, and final HDF5 file to contain all tiles of all WSIs we just created. The benefit of this is to further reduce reading time, as opening a file needs some time and this way we just need to open one single file.

single_file = '{}/all_wsis_{}x{}_poi{}_poiTumor{}_level{}.hdf5'.format(GENERATED_DATA, tile_size, tile_size, 
                                                       poi, poi_tumor, level)
h5_single = h5py.File(single_file, 'w')

for f in os.listdir(GENERATED_DATA):
    if f.startswith('normal_') or f.startswith('tumor_'):
        filename = GENERATED_DATA + f
        with h5py.File(filename, 'r') as h5:
            for key in h5.keys():
                print('processing: "{}", shape: {}'.format(key, h5[key].shape))
                if h5[key].shape[0] > 0: ### dont create dsets for WSIs with 0 tiles
                    dset = h5_single.create_dataset(key, 
                        h5[key].shape, 
                        dtype=np.uint8,
                        data=h5[key][:],
                        compression=0) 
            
h5_single.close()

Summary and Outlook

The next step is to train a neural network with the preprocessed data to be able to classify and predict unseen tiles.

If you are curious how the preprocessing library you have used here works and how to use openslide, then take a look at the source code, it should not be too hard to understand the code. Note:

  • For negative slides: we use Otsu thresholding to distignuish between slide background and tissue
  • For positive slides: we just use the xml-files, which include polygons for metastatic regions

Licenses

Notebook License (CC-BY-SA 4.0)

The following license applies to the complete notebook, including code cells. It does however not apply to any referenced external media (e.g., images).

create-custom-dataset
by Klaus Strohmenger
is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.
Based on a work at https://gitlab.com/deep.TEACHING.

Code License (MIT)

The following license only applies to code cells of the notebook.

Copyright 2018 Klaus Strohmenger

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.