Antipatterns (e.g. what doesn’t work)

A community survey of the ESIP Cloud Computing Cluster noted the following antipatterns in cloud data:

  • Large granule sizes with no means to subset those granules, requiring full-file transfers.

  • Storing data uncompressed, which can increase performance but double cost.

  • Lift and shift, which fails to make use of the cloud’s elasticity and highly scalable storage while also ignoring changes in access performance.

  • Large central file mounts, which create scalability and performance issues

  • Persistent processes, especially inelastic one. Scientific workloads tend to have high variability in demand and require attention to scaling to provide burst performance without inordinate cost

  • Requiring many requests in order to read a file, e.g. due to chunk sizes that are too small compared to the overall access size, which create significant network chatter

  • Web-based portals instead of direct programmatic data access, which limits automated activity and scalability while returning too many results

  • Failing to cache for repeat access, both client- and server-side, and particularly in service outputs which are expensive to produce and need idempotency

  • Putting workflows on the client, which creates significant back-and-forth network traffic

  • Allowing for unconstrained spatial queries of full detail

  • Glaciering data without a defined thawing process, with ownership of associated costs

  • Extracting time series from thousands of one-time-step files

  • Hierarchical data tree walks, which tend to be slow and prone to runtime errors (e.g. netCDF/HDF groups?)

  • Scattering metadata all over the place in the dataset

  • Making “one-time” chunking scheme, variable or projection decisions for all end users (seems like this should be configurable for different users and different use cases)

  • Data without metadata and metadata APIs: Data is on the cloud but people can’t understand it without downloading it

  • Storing related files in zip, tar, or similar formats not well-suited for partial file reads and spatiotemporal access

  • Serving data with no provision for partial-file access

  • Compressing data with unconventional (non-default, difficult to find/install) compressors.

  • Everyone, data producers and data users, needing time to learn new tools and data access methods.