Data Debt

pics-blog-dataarch-5Data Debt

posted by Mosaic Data Science

 

 

 

 

 

Software Debt

In 2011 Chris Sterling published the very instructive book Managing Software Debt:  Building for Inevitable Change.[i] The book generalizes the concept of technical debt to account for a variety of similar classes of software-development process debt.  Besides technical debt, Mr. Sterling describes quality debt, configuration-management debt, design debt, and platform-experience debt.  These various forms of debt together constitute software debt, which

accumulates when the focus is on immediate completion and [technical quality] is neglected… Each time a team ignores the small but growing issues and thinks that this neglect will not affect the outcome, the team members are being dishonest with stakeholders and ultimately with themselves.[ii]

The definition of each subtype of software debt is particular to its domain.  For example, design debt occurs when technically robust design is delayed until the cost of adding features to the existing code base exceeds the cost of rebuilding the application from scratch. [iii]  It’s instructive to note that improper design may not be delayed.  Excessive architecture can increase the cost of adding features as easily as deficient or absent architecture.  Hence Sterling writes about an agile team having “just enough . . . design to get started,” with architectural work product arriving just in time and being developed in an evolutionary fashion.[iv]  Design debt occurs when the design isn’t done right (at the technical level) the first time around. 

The Common Pattern

Doing something wrong (again, at the technical level) the first time is the pattern shared by all of the subtypes of software debt.  A software-development team or team member decides (perhaps very consciously) not to do something right the first time, in order to reduce the immediate expense of the activity.  The side effect is that the team must incur higher costs later to clean up the mess.

Costs can be negative (negative costs are benefits).  A side effect of software debt can be delaying the arrival of benefits.  This usually diminishes the net present value of these benefits, so they are worth less when they arrive than they would be if they arrived at the earliest possible time.

Measured Debt Assumption

The software-debt pattern has two variants.  Measured debt assumption occurs when the team incurs a debt judiciously as part of a broader resource-allocation decision.  One can often identify instances of measured debt assumption because part of the decision process is, the team also commits to incurring the delayed cost (paying off the debt) at a specific point in the future.

Passive Debt Assumption

The alternative is passive debt assumption, where the team incurs the debt without acknowledging or accounting for the decision.  Often in such cases the delayed cost is out of all proportion with the present cost, making delaying the cost a poor judgment.  Thus passive debt assumption is an antipattern, while measured debt assumption is not.

Incurring Software Debt is an Economic Decision

The key lesson about software debt is that each instance reflects an economic (not merely technical) decision.  One should make all such decisions methodically, accounting for all of the costs and benefits of the alternatives, and choosing the most economical alternative.  The debt metaphor is useful because it highlights a ubiquitous form of suboptimal decision making in software development.  It also helps us frame software-debt decisions in ways that underscore management’s responsibility to own and make important economic decisions.

 

Data Debt

This article adds a new subtype to Mr. Sterling’s list:  data debt.  Data debt occurs when data is improperly handled at the technical level with the intention of postponing certain costs, even though the postponed costs will be higher, or the postponed benefits will be lower.  The remainder of this document describes some important types of data debt.  The types are presented as dichotomies, with each member of a pair functioning as the dual of the other.

Urgent OLTP or Important Analytics

The rise of the chief data officer reflects the convergence of several trends.  One of them is the recognition that data science has a clear and compelling strategy role, increasingly often.  Another is that the traditional information technology (IT) organization (including the chief information officer) typically focuses most of its resources on operational computing:  email, online transaction processing (OLTP) applications, the Web site, etc.  Traditional IT organizations often also lack the technical expertise to create and operate an effective business intelligence (BI) function, much less an advanced analytics function.  (And when IT is simply resource starved, even OLTP suffers.)

Outdated OLTP Technology

OLTP applications evolve steadily.  A common short-term cost-cutting measure in IT is to delay upgrading OLTP applications to the current version.  Management often justifies the decision by acknowledging only the delayed marginal functional benefits of the current version.  In truth, old OLTP applications can also fail to collect transactional data in a form that supports good customer service, or that makes the data easily usable for master data management (MDM) or business intelligence (BI).  In extreme cases obsolete OLTP applications written in Cobol or Fortran are nursed along, sometimes for decades.

The opposite case occurs (less frequently) when the IT organization devotes so much of its resources to building the perfect MDM or BI system that the business must continue to use outmoded OLTP applications.  The irony is that the outmoded OLTP often provides low-quality input data to the MDM or BI, requiring significantly more ETL work than would be required if the OLTP were updated to a technology that guaranteed data quality upon input.  Thus the excessive focus on MDM or BI ironically delays delivery of standardization and analytical benefits.

I observed the outdated OLTP antipattern while consulting at a leisure-services corporation.  The IT organization had delayed migrating off its legacy Cobol-based OLTP application for about two decades.  By the time I arrived (well after the year 2000), most of the IT organization was devoted to keeping the OLTP application running, with a small engineering team trying to develop a replacement application incrementally, using SOA technology as an interface between the new and old applications.  The organization had only recently managed to offer its customer base a Web site that would let customers search for services online.  Customers still had to complete transactions over the telephone.  The IT organization hired me to catalog the data assets in the Cobol file system, because only one IT employee understood where each entity type was stored, and she was in poor health.  There was no BI function.  The corporation would have to spend roughly double its historical IT budget for several years, to escape the vicious cycle of IT obsolescence it had brought upon itself.  Its poor online functionality had clearly cost a great deal of revenue and market share lost to more forward-looking competitors.  The primitive data-storage back end meant that substantial BI capability would have to be postponed until the OLTP technology was updated enough to capture and store data in a more modern data layer.

Delayed Analytics

Delayed analytics is by far the more common antipattern of the urgent-or-important duo.  Typically OLTP excellence is considered urgent, while analytics are important but not urgent.  As a result, operational IT lives up to high expectations, while the less salient but equally strategic analytics develop haphazardly.  For example, a retailer’s Web site may offer robust transaction-processing support, but product pricing is still handled manually, even though pricing and revenue optimization analytics could increase revenues by 10-15%, which for some retail businesses would double profits.  In such cases the problem is not that the OLTP system cannot provide good historical data (quite the contrary).  Other relevant costs include having to store multiple years of history in the OLTP system (which can slow down an OLTP server, in part by shifting reporting activities onto the OLTP system, which can require more OLTP hardware to keep application performance satisfactory until the analytics are built).  The alternative is losing the previous years’ data.

 

Data Reuse and Data Specificity

Historically MDM and BI have been the most common opportunities for data re-use.  Thus we now have MDM servers, service buses, data virtualization, and data warehousing, all efforts (in part) to re-use and re-purpose data.  In the last few years organizations have learned to think of data as an asset, so they now seek multiple revenue-generating or cost-reducing applications for historical data, especially big data.  The most recent result is the concept of a data lake—the unstructured/noSQL update of the enterprise data warehouse (EDW).

Data Reuse

A lesson that even many data scientists have been slow to learn is that analytical models can be very particular about how a variable is operationalized (defined and measured).  For example, there is no single concept of cost that can serve all accounting or analytical purposes.  Rather, one can define a dozen or so cost concepts, and compute their values from more fundamental accounting variables.  The question then becomes when to define a variable and collect its values.  The same observation is less obvious, but equally important, regarding “softer” social-science variables such as preferences and affinities.

Failure to Standardize

MDM and EDW are both common use cases for data standardization.  The OLTP antipattern occurs when each OLTP application stores its own versions of key entity types (such as customer and product).  The BI antipattern occurs when each subject-matter area in a data warehouse maintains its own dimensions for common entity types.  (In such cases you don’t really have an “enterprise” data warehouse, because you lack shared, conformed dimensions.)  The extreme case is, each functional area of the business maintains its own data mart.  Failure to standardize is even easier with “unstructured” noSQL technologies that lower the cost of dumping raw source data into a datastore.  It may be easier in the short term to dump into a noSQL store a new copy of a shared entity type, along with other data required for analysis, than to share a pre-existing copy.

Data in Search of a Problem

The opposite pattern occurs when, for example, an MDM system is constructed to standardize a variety of entity types in the absence of clear use cases for those entity types, or adequate organizational support for those use cases.  In such cases the MDM data can languish on the MDM server for months or years, until it becomes clear to everyone involved that the data lacks a supported use case, and the MDM data or system is decommissioned.

Another case of data in search of a problem is a noSQL “data lake” that stores every possible bit of data generated by a corporation, regardless of the existence, plausibility, or economic value of known use cases.  Here resources that might have been devoted more productively to completing an analytical project are instead spent on laying the groundwork for as many projects as possible, without completing any of them.  Just in time data (like just in time architecture) is a far preferable alternative.

Storage Platform Dogmatics

Every technology has its devotees.  And, as the proverb goes, sometimes folks swing a relational hammer because that’s the tool they know, not because they need to pound a nail.  Others use the newest noSQL datastore they can find—because it’s new, not because it’s right. 

All Data is Relational

Even before the advent of noSQL stores, important alternatives to relational databases existed, and remain relevant.  Among these are XML files, binary files, and key-value stores (such as the venerable Berkeley DB).  Each has its use cases and remains relevant, especially in OLTP.  For example, key-value stores are used to accelerate frequent lookup operations on high-traffic Web sites.

noSQL in Search of a Problem

Hadoop is by far the most common noSQL datastore, and its advocates are now trying to extend it with SQL interfaces, in an effort to sell Hadoop as a general-purpose replacement for the relational database.  Hadoop’s original use case is distributed two-stage embarrassing parallelism, which recently has been extended quite naturally to n stages.  However, it is hard to see how Hadoop zealots will avoid re-inventing the relational database in the process of trying to make SQL’s square peg fig Hadoop’s round hole.  Efforts to treat Cassandra, MongoDB, and other noSQL datastores as general-purpose databases are similarly problematic.  (A few years ago the author witnessed a large-scale financial OLTP application developed on Cassandra fail, because the datastore was ill-suited to the use case.)  Every noSQL store trades away certain key properties of a relational database to improve performance for a specific use case.  (MongoDB focuses on text storage, Neo4J specializes in graphs.  Many of them trade away ACID guarantees for the conveniences of scalability on commodity hardware.  Etc.)  They are, by design, specialized.

Data Security

The extremes in data security are easy to describe, and easy to criticize, especially in hindsight.  It is harder to decide dispassionately how much data security is enough.  A key consideration is that the cost of a security breach may far exceed the economic value of the data itself.  For example, a stolen credit-card number is reportedly worth about $20 on the black market.  But the loss of goodwill with the real owner of the credit card may be worth much more.  In the data-security arena, avoidance of catastrophic loss is probably a better class of objective function than expected value.

Store and Pray

The store and pray antipattern occurs when data having significant value to an unauthorized user is stored without more than token attention to data security.  For example, social-security numbers (or entire consumer identities) may be stored in plain text within a database, even though such data’s value on the black market rivals that of a stolen credit-card number.

The Data Fortress

The data-fortress patterns occurs when every possible security measure is applied, regardless of whether the resulting inconvenience to data users is disproportionate.  IT organizations that maintain strong separation between end users and technologists encourage this pattern.

Data Architecture

Finally, data-architecture debt (properly a case of design debt) can be the source of substantial data debt.

Amateur Hour

Errors of data representation (whatever the datastore) can degrade an application’s performance, scalability, and/or storage-space requirements by several orders of magnitude, and can make data unnecessarily hard to manipulate.  Data-architecture mistakes are particularly expensive because they can take many months, even years, to refactor, in the enterprise context.[v]  They are also expensive because often a great deal of code depends on the physical representation of a single entity type.  Data-architecture debt accrues when application developers erroneously suppose that they can design physical storage models as well as a data architect, or that they can treat the database as a “black box”[vi]—and these assumptions bear out for the first few months of application development, until the accrual of data and code brings to light substantial data-architecture oversights.  I recall a case where a single entity-type attribute was stored in nearly 50 different relational-database columns, having a half-dozen different types among them.  By the time this oversight came to light, it was practically impossible to refactor all of the representations into a “single source of truth.”

Architecture Worship

The less common (and usually less fatal) data-architecture debt occurs when a data architect is excessively idealistic in applying architectural forms.  I have seen this sort of idealism lead to database tables containing three rows, in the interest of achieving perfect third normal form.  Formal excess sometimes makes a physical data architecture difficult to query, and it can lead to performance problems.  So it is relatively easy to remedy.


[i] Chris Sterling, Managing Software Debt:  Building for Inevitable Change (Addison-Wesley 2011).

[ii] Chris Sterling ibid, p. 2.

[iii] Chris Sterling ibid, p. 3.

[iv] Chris Sterling ibid, p. xxi-xxii.

[v] Scott W. Ambler and Pramod J. Sadalage, Refactoring Databases:  Evolutionary Database Design (Addison-Wesley 2006), p. 34.

[vi] Thomas Kyte, “The Black Box Approach,” Expert Oracle Database Architecture, 2nd Ed. (Apress 2010), pp. 3-11.

Leave a comment

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>

*

seventeen + seventeen =