Data Architecture 101, Part 5: Indexes
posted by Mosaic Data Science
Indexes have two main purposes in relational databases. First, they can improve query performance. Second, they can implement data-integrity constraints. (For example, you can create a unique index to enforce a uniqueness constraint.) This article focuses on the former purpose, in the BI/analytics (not OLTP) context. Throughout, we use Oracle indexes as examples. Oracle’s indexing capabilities generally lead the market, so if you understand how to use indexes in an Oracle database, it’s easy to transfer that knowledge to other (less capable) RDBMS platforms. For example, SQL Server clustered tables approximate Oracle index-organized tables.
Remember that indexes are only one of many performance-optimization tools in the data-architecture toolkit. A common design mistake is to treat indexes as performance bandaids, created ad hoc long after a schema has been designed, to resolve performance problems that often could have been avoided by more careful schema design using a well-informed, integrated combination of design techniques such as denormalization, partitioning, and materialization, as well as indexing. Such ad hoc indexing can eventually cause its own performance problems. (For example, excessive indexing can slow down ETL by requiring a rebuild of many indexes as part of the ETL cycle.)
Oracle has several types of index: b*trees, bitmaps, function-based indexes, geospatial indexes, and domain indexes. While bitmaps are the most common in the analytics context, all of them can be useful. Below we survey analytics use cases for each type. Bear in mind that the analytics context implies the following data-access patterns:
- Queries over large active sets predominate.
- Daily or real-time ETL inserts and updates records.
- Data is never deleted. At most, old records are taken offline in batches, often by swapping old partitions out of the database once a month.
A b*tree index is a tree structured to support very rapid lookup of a single record, or of a small set or range of records. B*trees are height balanced; the leaf nodes are all at the same depth from the root. Even large b*tree indexes usually have a height of two or three nodes, so the number of I/O operations required to look up a single leaf node is very small. A leaf node contains the ROWID of a single record (row) in a table. A ROWID is a physical address on disk, so once the RDBMS has a record’s ROWID, it can read the row very quickly (because it knows exactly where to find it, physically speaking).
A b*tree can index a single column, or it can concatenate several columns. Consider the notional table
create table product(
You might create a single-column index like this:
create unique index product_i_npk on product(name);
(This index might enforce uniqueness of the name column as a natural primary key for products, as well as improving single-row product lookups by product name.) You might instead create a concatenated index like so:
create index product_i1 on product(name, category);
(Notice that we left out the ‘unique’ keyword this time. Had we included it, the combination of values would have to be unique—but the same name could be used within different categories, and vice versa.) Oracle can use the concatenated index to search on either name or category alone, as well as using it to search for combinations of names and categories. A search on category (the index’s trailing edge) is called an index skip scan. In OLTP contexts, where minimizing index count usually matters more, you might create a concatenated index that performs double or triple duty. In such cases, you would make sure the index’s leading edge is the most frequently used of the indexed columns. But in analytics contexts, you’re more likely to make sure that the leading edge is the most discriminative column among those indexed.
You can also create reverse-key indexes, which are b*trees with reversed keys. These are useful to reduce I/O contention for disk blocks that contain leaf blocks of indexes on sequential values such as sequence-generated surrogate primary keys or timestamps. Reverse-key indexes spread insertions across the index’s blocks.
Finally, you can create indexes that store keys in descending sort order. Such descending indexes are occasionally useful to support queries that sort some columns in ascending order and others in descending order.
Besides standalone b*trees, you can create index-organized tables. These are tables that are physically stored as b*tree indexes. (The alternative is a heap-organized table.) They avoid duplicating data in a table and an index on the table, when all reasonably anticipated query-execution plans will simply read the index. Index-organized tables have more DML overhead, but can be read faster, when the read uses the index structure to find matching rows. For example, if you have a table representing child objects in a parent-child relation (say employees reporting to the same manager), and you want to find all of a parent’s children quickly, you could index-organize the child object’s table by a foreign-key column pointing to the parent’s surrogate primary key.
An important statistic describing a b*tree index’s physical organization is its clustering factor. This statistic’s lower limit is the number of physical blocks in the index. Its upper limit is the number of indexed rows. A b*tree with a clustering factor near its lower limit has index blocks that mostly point to rows in the same data (table) blocks. This makes fetching rows within a range of index values very efficient. A b*tree with a clustering factor near its upper limit has index blocks that mostly point to rows in different data blocks. This increases I/O for range scans. Only one b*tree index on a given table can have a good clustering factor. In some cases you may want to use an index-organized table to make sure your index is organized to support efficient range scanning. Or you may want to organize the table physically with a b*tree cluster (search http://asktom.oracle.com/ for ‘cluster index’ to learn how to do that). And remember that when a b*tree index has a high clustering factor, the Oracle optimizer may decide not to use the index for range scans involving many key values, because doing so would involve lots of block I/O.
The traditional wisdom is that B*tree indexes are very common in OLTP, but have far fewer uses in analytics. This is a simplification. B*trees have two main use cases:
- You want to access a small percentage of a table. What “small percentage” means depends on whether the table is narrow (few columns) or wide (many columns). In the former case, the percentage might be two or three percent; in the latter, up to 25%. This case covers an important sub-case: you want to retrieve many rows, but you need to return the first row quickly. (In Oracle you can optimize the time required to return all rows, or to return the first row.) This is common for lookup tables, including dimensions in star schemas.
- You only want to access values of indexed columns. (Note that in an index-organized table, this isn’t much of a restriction! This makes index-organized tables especially useful in some analytics use cases. However, you cannot create a bitmap-join index on an index-organized table, so generally you should not organize dimension tables as indexes.) In this case you can use the index to read as many rows as you need, even the whole table.
Note that a b*tree index on a single column is a type of key-value datastore. Because you can pin an index in memory, a b*tree index can perform key-value lookups extremely quickly. If your application code also resides in the database (in Java or PL/SQL, say), it’s unlikely that you’ll get better performance from using an external noSQL key-value store than you would simply using a b*tree index. Beware novelty for novelty’s sake!
Bitmap indexes are by far the most common index type in analytics. A bitmap index uses very fast logical alternation (or) and conjunction (and) operations on bitmaps (binary strings) to determine which rows in a table match a set of query constraints (also represented as a bitmap). Oracle continues to improve bitmap performance, so you can now use bitmaps for low- or medium-cardinality data (columns having few distinct values, compared to the table’s row count).
The main use case for bitmaps is indexing foreign-key columns in fact tables. Recall that these columns point to the surrogate primary key columns in dimension tables. A dimension table contains the possible values of a single category used to classify facts. These categories usually have small sets of possible values, making the corresponding foreign-key columns in fact tables perfect candidates for bitmap indexing. In Oracle you must so index every foreign-key column in a fact table, for the Oracle optimizer to use the star transformation (an optimization strategy that reduces all fact-dimension joins to bitmap lookups, making queries against star schemas extremely fast).
You can also define bitmap-join indexes on joins between two tables. The join must be to a unique-valued column (such as a surrogate primary key) in the second table. In principle these let you have a level of normalization in your schema, but avoid join overhead at runtime. The main analytics use case is a bitmap-join index between a fact table and a dimension table. Oracle documentation claims bitmap-join indexes can improve performance by an order of magnitude (presumably compared to ordinary bitmap indexes). We personally have not seen this effect, at least in contexts where the star transformation is enabled. However, you should experiment with both approaches, and use whichever yields the best performance.
Bitmap indexes are extremely useful in constructing basic production-rule engines or pattern-matching algorithms, when a rule or pattern’s conditions are discretized (defined on categories). In this case a rule or pattern has the same form as a fact in a fact table. Each time an event arrives, you can query the fact table for rules or patterns that match the event. Bitmap operations make such queries fast enough to support high throughput (especially if you pin the rule/pattern table and its bitmap indexes in memory).
A function-based index indexes the argument and value of a function. Common applications include string operations such as case transformation (to support case-insensitive matching). In the absence of such indexes, where-clause conditions involving a function on a column make for slow queries, because the SQL engine must compute the function’s value at run time, for each row evaluated. Materializing the function value in an index trades space for speed.
In analytics environments, function-based indexes might be defined on measure columns of fact tables, on attribute columns of dimension tables, or similar columns in OLAP cubes. In the dimension case, consider materializing the function in a separate column within the dimension table itself.
A spatial index supports queries that express a variety of common relationships between spatial objects (termed geometries in Oracle). Oracle’s current implementation of spatial indexing uses R-tree indexes. (See http://en.wikipedia.org/wiki/R-tree for details.) The spatial relationships that R-tree indexes support include finding all geometries within a boundary and finding all geometries within a certain distance of a given geometry. In Oracle you can combine spatial constraints with ordinary where-clause conditions to express many common business questions, such as “What is the total of car-insurance liability settlements that we paid last quarter for accidents occurring in the greater Denver metropolitan area?” Oracle Locator is a set of basic spatial capabilities (including indexing) available in the Standard Edition of the Oracle database. More arcane capabilities are available in the Spatial Option.
Oracle exposes an API that lets you define your own index type. The API lets you tell the Oracle optimizer how to estimate related runtime overhead, and how to compute and store relevant performance statistics. For example, you might want to store XML representations of truck routes in the database, and create an index type that indexes routes by the roads they include. Such domain indexes extend the concept of a function-based index by supporting arbitrary operations, such as the inclusion relationship in the truck-route example. (See http://docs.oracle.com/cd/B28359_01/appdev.111/b28425/dom_idx.htm for details.) Domain-specific indexing can improve query performance when you need to express domain-specific query logic that as a practical matter cannot be expressed using elementary arithmetic operators on primitive datatypes (numbers, dates, strings, etc.), or using function-based indexes. Text, sound, and image indexing are examples.