Differences between Cloudera Oryx and Apache Mahout
- There are 3 broad things an operational ML system needs to do eventually
- Build models at scale, offline
- Update models in near real time
- Query models in real time
- Most of the tools like Mahout or MLLib do building models at scale only.
- Oryx tries to do all 3, and is not doing building model.
- Therefore it is really intended as a complement to any Hadoop-based model build system.
- As a result it is MapReduce based for model building and implemented algorithms instead of using Mahout to improve on perceived problems.
- The project which is open source, is more designed as 3 complete apps rather than a platform for extension.
- It only implements
- ALS for recommendation
- Kmeans for clustering
- Random decision forests for classification and regression
- The major difference is fewer algorithms but complete apps including incremental update and serving. It is not the algorithms that are really the difference since Oryx is not a new library.
- The next version is built on Spark and Kafka then becomes more of generic lambda architecture for ML that happens to have entire apps too.
- It is kind of Summing bird for ML on Spark. It has no algorithms implementations at all, not now. Therefore it is even more different from Mahout or MLLib.