Tiled image processing

Tiled image processing#

In this notebook we will process a big dataset that has been saved in zarr format to count cells in individual tiles using dask and zarr. The underlying principles will be explained in the next sections.

import zarr
import dask.array as da
import numpy as np
import stackview

For demonstration purposes, we use a dataset that is provided by Theresa Suckert, OncoRay, University Hospital Carl Gustav Carus, TU Dresden. The dataset is licensed License: CC-BY 4.0. We are using a cropped version here that was resaved a 8-bit image to be able to provide it with the notebook. You find the full size 16-bit image in CZI file format online. The biological background is explained in Suckert et al. 2020, where we also applied a similar workflow.

When working with big data, you will likely have an image stored in the right format to begin with. For demonstration purposes, we save here a test image into the zarr format, which is commonly used for handling big image data.

Loading the zarr-backed image data#

For processing images tile-by-tile, the data must be stored in a compatible file format in advance, for example as zarr file.

zarr_image = zarr.open("data/P1_H_C3H_M004_17-cropped.zarr", mode="r")
zarr_image.info

Type               : Array
Zarr format        : 3
Data type          : UInt8()
Fill value         : 0
Shape              : (2000, 5000)
Chunk shape        : (100, 100)
Order              : C
Read-only          : True
Store type         : LocalStore
Filters            : ()
Serializer         : BytesCodec(endian=None)
Compressors        : (ZstdCodec(level=0, checksum=False),)
No. bytes          : 10000000 (9.5M)

Such a zarr dataset can be converted to a Dask stack:

da.from_zarr(zarr_image)

	Array	Chunk
Bytes	9.54 MiB	9.77 kiB
Shape	(2000, 5000)	(100, 100)
Dask graph	1000 chunks in 2 graph layers
Data type	uint8 numpy.ndarray

We can also apply image processing to this tiled dataset directly.

Counting nuclei#

For counting the nuclei, we setup a simple image processing workflow. It returns an image with a single pixel containing the number of nuclei in the given input image. These single pixels will be assembled to a pixel count map; an image with much less pixels than the original image, but with the advantage that we can look at it - it’s no big data anymore.cle.exclude_labels_with_map_values_within_range

import numpy as np
from skimage import filters, segmentation, measure, morphology
from scipy import ndimage

def count_nuclei(image, threshold=15):
    """
    Label objects in a binary image and produce a pixel-count-map image.
    """
    #print("processing image of shape", image.shape)
    # Count nuclei including those which touch the image border
    # Replace voronoi_otsu_labeling with equivalent scikit-image approach
    blurred = filters.gaussian(np.asarray(image), sigma=3.5, preserve_range=True)
    binary = blurred > threshold
    
    # Generate seeds for watershed
    distance = ndimage.distance_transform_edt(binary)
    local_maxima = morphology.local_maxima(distance, footprint=morphology.disk(1))
    seeds = measure.label(local_maxima)
    
    # Perform watershed segmentation
    labels = segmentation.watershed(-distance, seeds, mask=binary)
    
    # Relabel to ensure sequential numbering
    nuclei_count = labels.max()
    
    # Count nuclei excluding those which touch the image border
    labels_without_borders = segmentation.clear_border(labels.copy())
    
    # Relabel again
    labels_without_borders = measure.label(labels_without_borders > 0)
    nuclei_count_excluding_borders = labels_without_borders.max()
    
    # Both nuclei-count including and excluding nuclei at image borders 
    # are no good approximation. We should exclude the nuclei only on 
    # half of the borders to get a good estimate.
    # Alternatively, we just take the average of both counts.
    result = np.asarray([[(nuclei_count + nuclei_count_excluding_borders) / 2]])
    
    return result

For testing, we apply the function to a subset of the data.

image_crop = np.asarray(zarr_image[1000:1500, 1000:1500])

stackview.insight(image_crop)

shape	(500, 500)
dtype	uint8
size	244.1 kB
min	0
max	255

count_nuclei(image_crop)

array([[172.]])

Next, we setup a delayed result by mapping a function on blocks of the data.

tile_map = da.map_blocks(count_nuclei, da.from_zarr(zarr_image))

tile_map

	Array	Chunk
Bytes	76.29 MiB	78.12 kiB
Shape	(2000, 5000)	(100, 100)
Dask graph	1000 chunks in 3 graph layers
Data type	float64 numpy.ndarray

As this result image is orders of magnitude smaller then the original, we can compute the whole result map. This is because the count nuclei function reduces the data massively. Each block is turned into a single number.

result = tile_map.compute()

result.shape

(20, 50)

The sum of intensity of all pixels in this image corresponds to the counted number of nuclei in the original image.

result.sum()

np.float64(5545.0)

Again, as the result map is small, we can just visualize it.

stackview.insight(result)

shape	(20, 50)
dtype	float64
size	7.8 kB
min	0.0
max	20.5

Sanity check#

Applying the count nuclei function to the whole zarr image allows us to check for the error of the computation above.

Note that this would not work with big data.

count_nuclei(np.asarray(zarr_image))

array([[5423.5]])

Exercise#

Use the function da.from_zarr(zarr_image).rechunk((500, 500)) to apply cound_nuclei to differently chunked data. Depending on the chunk size, the error of the method seems different. why?