Data-as-a-Product: A data science framework for data collaborations

Abstract

For data science teams, data preparation takes substantial investment of time, data science expertise and subject matter proficiency. However, as the name implies, data preparation is typically viewed merely as a means to an end, encouraging creation of expensive but often single-use and fragile elements in data analysis workflows. Rather than seeing data preparation as an obstacle to be removed, we propose a framework that recognizes the time and expertise invested in data preparation and seeks to maximize the value that can be derived from it. Viewing analysis-ready data as a multi-purpose, modularly built product that should lend itself to collaborative development and maintenance, the framework of Data-as-a-Product (DaaP) aims to remove barriers to version tracking and collaborative data development and maintenance. Specifically, the framework, which is entirely implemented in R, enables joint code and data versioning based on git, standardizes metadata capture, tracks R packages used, and encourages best practices such as adherence to functional programming and use of data testing. Collectively, the patterns established by the DaaP framework can help data science teams transition from developing expensive, single-use “wrangled” datasets to building maintainable, version-controlled, and extendable data products that could serve as reliable components of their data analyses workflows.

Type
Publication
Presented at 2021 Conference