Data Architecture 101, Part 1:
posted by Mosaic Data Science
This post is the first in a series on relational database architecture and tuning. It’s a mature subject, but we continue to encounter programmers and data scientists who have limited exposure to the material. This blog aims to become a “nutshell” treatment of the subject, so those of you who work with data in a relational database management system (RDBMS) can quickly learn how to make the best possible use of a database.
Database design involves three kinds of models: business, logical, and physical. A business model is an informal or quasi-formal conceptual model (such as an enterprise ontology) that captures and describes classes of things that matter to the business. Ideally business analysts document the business model. The other two kinds of models are usually the work product of a data architect. The rest of this article (and series) is about them.
A logical model formalizes a business model, translating business categories into formal entity types (roughly speaking, that’s a class to you object-oriented programmers). An entity type has a name, a set of attributes, and a set of relations with other entity types. (Sorry, object-oriented types: no methods, and inheritance is nothing special—just one relation among many.) These relations have cardinalities, which say how many entities (instances) of one entity type can have a given relation with entities of another type. Cardinality also says whether the relation is mandatory on each side. For example, customer and salesperson could be entity types. One salesperson could have zero or more customers, while each customer has exactly one salesperson.
Entity-Relation (ER) diagrams are undirected graphs whose nodes represent entity types, and whose arcs represent relations among the entity types. The arcs have special ends that denote cardinality. See Wikipedia for details.
Strictly speaking a logical model should be expressed in a specific kind of structure called third normal form (3NF). To understand database normalization, you have to understand several, eh, key concepts. A primary key is a unique identifier for an entity. A natural primary key is a unique identifier that comes from outside of a database. (Someone enters it manually, or a data-integration process imports it.) Note that a natural key often is, but need not be, intelligible to a human being. A natural key for an entity type in one system can be the surrogate key for the same entity type, in another (source) system. What makes the key “natural” is whether it’s internally generated. For example, your social security number is your surrogate key in the Social Security Administration’s database. But it would be a natural key in any other database; and as a number, it has no particular meaning to a human being. It’s just a unique identifier that comes from somewhere else.
As an aside, a single entity type may have several possible natural keys. For example, an employee entity type might be uniquely identified by full name or social security number. In subsequent posts we’ll revisit this issue.
A surrogate primary key is a primary key generated by the database, usually a sequence-generated integer. A foreign key is a reference in one table to a primary key (usually surrogate) in another table. Finally, a join is a query that matches values in one column to values in another column. One very common kind of join matches a foreign key to the surrogate primary key it references.
Returning to logical models and normal forms: normalization basically means storing a value in just one place, assigning it a surrogate key there, and representing the value elsewhere with foreign keys that reference the surrogate key. Denormalization does just the opposite; it stores a value repeatedly. The more you normalize, the more joining you have to do. But the more you normalize, the fewer opportunities exist for data-quality problems. (For example, if an employee’s last name is only stored in one place, you only have to correct a misspelled last name in once place.) Also, the more you normalize, the less space you use, both on disk and during I/O. Those are the essential trade-offs that motivate the architectural patterns defining the various classes of relational database.
Note that the trade-offs among join overhead, data quality, and I/O are related to, but not quite the same as, the old programming heuristic that trades space for speed. In analytical contexts we generally avoid normalizing, thereby using more space, in order to achieve more speed. In transaction-processing contexts, we mostly normalize to achieve more speed. In part, which way we go depends on what kind of speed we’re after. Analytical speed involves traversing a great deal of data, sometimes computing on and outputting a lot of it in the process. Transactional speed is mostly about performing a single create, read, update, or delete (CRUD) operation.
We said above that a logical model should be in 3NF. This form approximates the degree of normalization common in online transaction processing (OLTP) database applications (the sort of application that processes things like retail sales, expense reports, and time cards). The architecture of such databases eliminates most common forms of denormalization.
The last type of data model is a physical model. (Note that we have defined a logical model as a 3NF physical model. That’s important: a logical model is basically a canonical physical model.) A physical model uses various normalization and/or denormalization techniques to transform a logical model into a form suitable for a specific use case defined mainly by its input/output (I/O) and data-quality requirements. For example, master data management (MDM) servers have very stringent data-quality requirements (because their job is to be the source of record for a few widely shared entity types), and they mostly receive and send one entity at a time. So, if the MDM server’s peak throughput requirements (in entities per second) are low enough that the server can do a lot of joining and still keep up with its I/O load, the MDM server’s physical model might be even more normalized than 3NF. At the other extreme, a data cube completely denormalizes its data, even avoiding joins; it is a single massive table containing nothing but values. Online analytical processing (OLAP) tools are designed to sift through cubes very quickly while applying various kinds of “slice and dice” operations over a substantial subset of the data, for example averaging a dollar amount over a time period and geography.
In our next post we’ll present all of the major relational architectures, describing them in terms of their (de)normalization, join, and I/O patterns. For now, remember that normalization favors transaction processing, while denormalization favors analytical processing. Much of the craft of data architecture lies in understanding when, how, and why to (de)normalize.