...the same way as non-database indexing? e.g. a lookup table for values in a di...

crdrost · on March 8, 2022

There's kind of three things you want to show people to get them to understand indexes. Let's say you sell an internet police service and have a table of URLs you are crawling for a table of clients, each client gets some client_id, each URL references the client_id of the client who wants it policed. Here are the things you want to say:

1. Binary search. It's be really handy to have the URLs stored in a different order, maybe grouped by client_id first and then all the URLs for the same client are alphabetical. The index conceptually just contains the same rows in a different order.

2. The read vs write trade-off. Your data fits in memory if there's less than 10 TB of it (possibly up to 100 TB) so the big thing is latency, what an index does is, it requires you to write the same information twice in order to read it exponentially faster, so depending on how often do you write versus how often you read, indexes make less or more sense.

3. High fan-out trees. Once you have explained that the index is just an the same data in a different order, you have inserted a sort of ticking time bomb of misunderstanding. The problem is, a beginner programmer’s notion of a freely indexable list, such that binary search works on it, is an array. How do we insert into an array? With all the work of array copies! So you get people who think that random ID columns should never be indexed, because every write into the middle of the array has to shove half the data one cell to the right—unacceptable! So it really helps to say, “now we can store this as a binary tree, make one comparison per level of the tree.” Pause for understanding that the “list,” can be stored with such a tree. “but, modern computers have a nice property that right after accessing some memory the memory near that location is loaded into a cache... this makes arrays real fast. Suppose we don't just have the binary tree thing of 2 child nodes each representing 50% of our data, but an array with 100 pointers each representing 1% of the data, you get something like a 6x memory speedup from the cache, because your tree is one sixth the depth of the binary tree. And that's basically what a B-tree is, you use a higher fan-out of your sort tree to exploit the cache.”

zomnoys · on March 8, 2022

This feels a bit reductionary. Database indexes definitely have more to consider than just simple lookup tables. I’d say that mental model breaks down in practice.

mikestew · on March 8, 2022

It is, but with all respect, it’s not a useful answer (or even accurate; see below). If one takes, for example and contrast, the top-voted answer, well, I could probably implement a database index scheme of some sort based on that explanation. Might not be a good one, but it would work. Parent comment’s answer? I wouldn’t know where to start.

Additionally, parent’s answer doesn’t actually answer the question, by saying what an index is, but the question was “how does it work?” The top SO answer does a great job with this one, I think. Another one a few answer down also goes into some of the downsides (it slows down writes, for one). The whole page is worth a skim if you’re not a DBA but have to fiddle with DBs on occasion.