In this episode my friend Vikas Popuri and I chat about Google’s paper comparing ML models to traditional DB indexes.
Background:
- Google used learned indexes , machine learning models, to access data and compared these to B-Tree, Hash, and Bloom Filter indices
- Trained a model using multiple stages where the earlier stages could approximate a location and later stages would work with a subset to improve accuracy. Each stage could choose a different model to advance the search further.
- FYI, the diagram below looks like a decision tree, but it is not. Each stage/model could have different distributions and could repeat the model used above or below.
- They achieved access time and space savings across the board, even without using GPUs or TPUs (Tensor Processing Units)
- “Retraining the model” – the tests were performed on a static data set, so no retraining or index maintenance was required.
Observations / Questions:
- Used Tensorflow with Python as the front end — apparently a lot of initial overhead with this as a test stack.
- B-Tree indexes to some extent are a model, especially if they don’t store every key and instead store the first key in a page.
- The paper made some rudimentary assumptions, such as using a random hash function.
- What if the data is not static? How long would it take to retrain the model vs. maintain an index?
- What if data profiling caused you to index certain attributes and not others?
- What are the best practices with this newer approach
- The power of being able to use different models at different stages is intriguing. You could also potentially maintain traditional indexes as a backup / failsafe that would upper bound to the performance of a B-Tree.
- Load times – The folks from Google commented that they could retrain a simple model on a 200M data set in “just [a] few seconds if implemented in C++”
- Recursive question: do you need an optimizer to optimize the optimization path?
- Room for improvement:
- GPUs/TPUs
- Incorporating common queries into the model to know what questions people are asking
Music
Deep Sky Blue by Graphiqs Groove via FreeMusicArchive.org
Sources:
- http://learningsys.org/nips17/assets/papers/paper_22.pdf
- http://blog.ezyang.com/2017/12/a-machine-learning-approach-to-database-indexes-alex-beutel/
- https://towardsdatascience.com/what-if-i-told-you-database-indexes-could-be-learned-6cf8f59bff94
- https://www.arxiv-vanity.com/papers/1712.01208/
- https://www.oreilly.com/ideas/how-machine-learning-will-accelerate-data-management-systems
Podcast: Play in new window | Download