Introduction

Machine learning is an essential cornerstone of contemporary computer science. Its influence spans across various industries while tackling a widening range of complex problems which require increasingly sophisticated solutions. This field’s immense popularity reflects an increasing number of machine learning engineers and data scientists joining the ranks.

Unfortunately, throwing more resources at a problem is rarely the optimal solution. Specifically, we cannot simply rely on having bigger teams of engineers to solve complex problems. Instead, better tooling and improved methodologies can drive machine learning forward in a much more reliable and feasible manner. To this end, I will introduce the RumbleML machine learning framework in this article.

RumbleML is designed to address the two main problems with present-day machine learning tools that hinder productivity:

Poor data independence due to the imperative nature
Limited out of the box scalability and distributed processing support.

Data Independence

Wikipedia, the holy grail of internet knowledge, contains one of the best definitions I came across on data independence:

“[Data independence] refers to the immunity of user applications to changes made in the definition and organization of data.”

In other words, users can focus solely on their application semantics while they can rely on the system to handle underlying data operation complexity. The definition continues as follows:

“Application programs should not, ideally, be exposed to details of data representation and storage.”

This layering is best achieved by resorting to declarative paradigms. A declarative language is a language in which the users specify what they want rather than specifying how they want it. Properly achieved declarative nature brings good abstractions which render the system much easier to use. This, in turn, enables the users to be more productive and write comparably more bug-free code.

The trend of resorting the declarative paradigms instead of imperative ones has been continually gaining traction with modern languages and frameworks. For example, multi-paradigm languages such as Scala and JavaScript are becoming increasingly prominent. Additionally, the web technologies that enable you to display this article on your browser are also fueled by the underlying declarative nature of HTML and potentially by the declarative web frameworks such as “React.js”. Yet, when we look at the domain of machine learning, most frameworks expose APIs in imperative languages. This leads to limited productivity as users are encumbered to handle algorithm semantics, underlying data representations, and runtime execution all at once. Therefore, switching to declarative paradigms for machine learning frameworks could go a long way in enhancing user productivity.

Scalability

Another shortcoming of popular ML frameworks is the reliance on additional tools to provide scalability. The need for additional tools increases the burden on the users as they are forced to handle a larger tech stack during development and maintenance. Said additional tools also commonly expose iterative APIs that further aggravate the data independence limitations. For example, Apache Spark is a fantastic framework that provides scalability via distributed processing. It provides out of the box distributed scalability for arbitrary user programs. In other words, custom machine learning algorithms can be made scalable by using Spark. However, this can easily turn into a daunting task as Spark offers APIs only in imperative languages which require substantial effort to implement algorithm semantics.

The SparkML library, which is built into the Spark framework, offers a large variety of already implemented machine learning algorithms. If the desired use case fits into one of these available algorithms, the implementation efforts can be immensely minimized. Furthermore, If the data is already structured, applying machine learning would be almost trivial.

Unfortunately, in real-world applications, data is hardly ever in a state that is ripe for applying machine learning. Data cleaning and pre-processing operations are vital to the majority of machine learning pipelines. This is where the lack of data independence due to imperative APIs becomes a real pain point for the users. In the presence of unstructured data, which can be heterogeneous and nested, the best solution that Spark can offer is casting everything down to strings. This is extremely far from ideal as the entire type information gets lost in the process and the user is offered little to no help by the system in picking up the pieces.

Enter Rumble

The Rumble engine is introduced to address shortcomings of Spark with regards to heterogeneous data and the lack of data independence. Rumble interfaces Spark with functional and declarative querying language “JSONiq”. JSONiq inherits %95 of its features from XQuery while leaving out the peculiar and hard to understand bits. JSONiq language is fully composable and Turing-complete. The Rumble engine automatically maps the user queries written in JSONiq to Spark runtime execution plans. This enables users to have the scalability of Spark at no additional implementation cost while having full support for heterogeneous data processing.

Fig1. Spark + JSONiq = Rumble

The most prominent feature of JSONiq is the FLWOR expression as it provides the expressiveness of SQL’s SELECT-FROM-WHERE statements in the context of heterogeneous data. The querying and data manipulation capabilities of FLWOR expression will be briefly demonstrated with a simplified example. (The original example is taken from an earlier blogpost on Rumble which can be found in references). Imagine we have the following “person” data set that contains name and year data where year is both heterogeneous and nested.

{ "Name" : "Peter", "Year" : [ 2015, 2014 ] }
{ "Name" : "John", "Year" : [ 2013, "2018" ] }	// heterogeneous
{ "Name" : "Helen", "Year" : [ 2012, [ 2017, 2019 ] ] }	// nested

The following query aggregates the year data as JSONiq’s FLWOR expression immensely simplifies the processing of this heterogeneous dataset. Performing the same task with a non-declarative and non-heterogeneity supporting language immediately turns into low-level programming as the user is forced to handle the iteration and type information manually.

Query:

for $person in json-file("people.json")
let $years := flatten($person.Year)
return { "Name" : $person.Name, "NumberOfYears" : count($years) }

Result:

{ "Name" : "Peter", "NumberOfYears" : 1 }
{ "Name" : "John", "NumberOfYears" : 1 }
{ "Name" : "Helen", "NumberOfYears" : 3 }

Standing on the Shoulders of Giants: RumbleML

The Rumble engine recently took its capabilities yet another big step further. SparkML, that was mentioned earlier, is quite extensive as it is backed by a strong open source community of academics and professionals. Since Rumble was already capable of interfacing Spark, the engine was extended further to interface the capabilities of the SparkML library as well. Through these efforts, the RumbleML framework was conceived. This enabled the existing user productivity and automatic optimization benefits of Rumble to be complemented with ready to use algorithms of RumbleML. With its initial release, RumbleML has achieved over 80% coverage of the functionality offered by SparkML. The development efforts for increasing coverage and optimizing performance further are currently ongoing.

Fig2. Rumble + ML = RumbleML

At a high level, a machine learning query on Rumble can be imagined to consist of three main steps which are common across many similar frameworks: data loading, model training, model usage (omitting the evaluation steps). An example pipeline of this nature is demonstrated in pseudo-code format below:

let $train_data := load-data(...)       // structured data ready to be used in ML
let $test_data := load-data(...)
let $model := train-model($train_data)  // model training is a higher order function that returns a model function 
return $model($test_data)               // apply the model function to get predictions

The concrete implementation and usage of RumbleML revolve around mapping the SparkML concepts to Rumble. The core SparkML concepts of “estimator” and “transformer” are seamlessly mapped into function items of the JSONiq data model. Training sets, test sets, and validation sets, which contain features and labels, are exposed through JSONiq sequences of object items: the keys of these objects are the features and labels. As such, RumbleML simply delegates computation to SparkML without re-inventing the wheel.

Transformers

A transformer is a function item that maps a sequence of objects to a sequence of objects. It is an abstraction that either performs a feature transformation or generates predictions based on trained models. For example:

Tokenizer is a feature transformer that receives textual input data and splits it into individual terms (usually words), which are called tokens.
KMeansModel is a trained model and a transformer that can read a dataset containing features and generate predictions as its output.

Tokenizer operation:

Fig3. Tokenizer operation

KMeans operation:

Fig4. KMeans operation

Estimators

An estimator is a function item that maps a sequence of objects to a transformer. Since it is a function that returns another function, an estimator is a “higher-order function”. Estimators abstract the concept of a machine learning algorithm that fits or trains on data. For example, a learning algorithm such as KMeans is implemented as an Estimator. Calling this estimator on data essentially trains a KMeansModel, which is a Model and hence a Transformer.

Now having these concrete concepts in place, the previous query can be written in the format that can be directly executed in Rumble.

let $train_data := structured-json-file("/path/to/training/data")
let $test_data := structured-json-file("/path/to/test/data")
let $estimator := get-estimator("KMeans")
let $model := $estimator($train_data, { "k": 2 })   // Train KMeans with k=2
return $model($test_data)                           // apply model

Now putting it all together, RumbleML provides out of the box machine learning algorithm support. Rumble on the other hand offers fully composable querying capabilities of JSONiq to handle pre-processing of semi-structured data.

let $raw_data := json-file("people.json")
let $preprocessed_train_data :=
  for $person in $raw_data
  let $years := flatten($person.Year)
  return { "Name" : $person.Name, "NumberOfYears" : count($years) }
let $preprocessed_train_data := annotate(
  $preprocessed_train_data,
  { "Name" : "string", "NumberOfYears" : "integer" }
)
let $estimator := get-estimator("KMeans")
let $model := $estimator($preprocessed_train_data, { "k": 2 })   // Train KMeans with k=2
let $test_data := structured-json-file("people_test.json")
return $model($test_data) 

Conclusion

To wrap up, RumbleML is a new machine learning framework that aims to provide superior usability and scalability benefits compared to alternative ML frameworks thanks to its declarative nature and utilization of Apache Spark.

Why should you try Rumble & RumbleML?

I would strongly recommend trying Rumble for yourself by following the simple steps in our get-started page if any of the following apply to you:

Working with semi-structured, heterogeneous (mixed types) & nested data (JSON, CSV, Text, Parquet, Avro, and more…)
Working with machine learning whether you are a beginner or an expert (More than 70 featurization operations and ML algorithms are available out of the box)
Require scaling capabilities that range into terabytes (whether standalone Spark or on a cluster)
Curious about big data
Love open-source technologies

If you have any recommendations or run into any trouble, please feel free to create an issue at our open-source repository on GitHub.

References

https://github.com/RumbleDB/rumble/
https://rumble.readthedocs.io/en/latest/Getting%20started/
Rumble, an engine to run JSONiq on top of Spark, G. Fourny – Blogpost, https://blog.systems.ethz.ch/blog/2019/rumble.html
SystemML: Declarative Machine Learning on Spark, M. Boehm et. al., Proceedings of the VLDB Endowment, Volume 9, 2015-2016, http://www.vldb.org/pvldb/vol9/p1425-boehm.pdf

RumbleML, a declarative machine learning framework