Cloud Hosted Data Optimization Knowledge Base

ESIP Cloud Computing Cluster

This repository captures optimization practices for data in the cloud.

Why Optimize Data for Cloud Access?

Increasingly, organizations with large data holdings are turning to cloud-hosted storage to improve capacity, scalability, and access to computing resources near data. These data are often stored in object stores or other storage that offers horizontally-scalable access over a network. This demands attention from data providers seeking to maximize the value of data, as the performance characteristics of cloud-hosted stores can differ greatly from their predecessors. Further, the cost profile of cloud hosting differs from that of on premises hosting, such that unnecessary data requests, data transfer, and compute time all tend to directly contribute to the overall cost to host data.

In short, well-optimized data in a cloud environment are typically less expensive, more usable, and have faster typical access times than poorly optimized data.

Resources for Cloud Data Optimization

The following resources are not directly related to cloud data but may have relevant optimization efforts, background, and takeaways:

Factors Influencing Optimization Decisions

Performance optimization involves reducing the time to locate, retrieve, decode, and prepare data values for analysis.

To optimize for performance, the author should:

Given the above, the answer for how to optimize data will always be “It depends.” The primary factors it depends on are:

While the above optimizes for performance, these concerns need to be balanced against organizational requirements, regulatory requirements, and other factors such as creation, hosting, and transfer cost.

Optimization Practices

Chunking

For many file formats, including HDF, NetCDF, and Zarr, chunks constitute the smallest unit of data within a file that can be read at once. Reading a chunk of data incurs a latency cost, so chunks ought not be too small or performance will suffer. Reading a chunk of data also incurs a data transfer and decoding cost, so chunks ought not be too large or software may need to transfer and process many extraneous bytes for a particular use case and may also need to keep bytes in slower memory locations, compounding issues. There is a fundamental tension, where use cases that need a small amount of data from a dimension, such as a time series at a point in space, suffer from a large chunk size in that dimension, while use cases that need a large amount of data from that dimension, such as a spatial analysis at a point in time, suffer from a small chunk size in that dimension.

The optimal chunk shape varies based on expected use cases, but it also varies with the latency and throughput of the data store. For data in cloud storage, the latter characteristics can be dramatically different from data stored on a local disk. Chunk size for cloud stores therefore needs careful consideration and cannot easily rely on non-cloud rules of thumb.

Chunk Size

A chunk size should be selected that is large in order to reduce the number of tasks that parallel schedulers like Dask have to think about (which affects overhead) but also small enough so that many of them can fit in memory at once. The Pangeo project has been recommending a chunk size of about 100MB, which originated from the Dask Best Practices. The Zarr Tutorial recommends a chunk size of at least 1MB. The Amazon S3 Best Practices says the typical size for byte-range requests is 8-16MB. It would seem that chunk sizes on the order of 10MB or 100MB are most optimal for Cloud usage.

Antipatterns

A community survey of the ESIP Cloud Computing Cluster noted the following antipatterns in cloud data:

Moving forward

Please contribute to this document if you have input. We welcome pull requests.

Additionally, the community has noted the following specific needs for input or experimentation:

  1. Identifying commonalities across communities, organizations, and source data formats
  2. Performance analysis on a variety of data organizations, analyses and data structure types.
  3. Chunking and compression options in the context of scalable data access to model output
  4. Data on how optimization decisions vary between different access clients like remote users, dask, and spark clusters.