One of the textbook examples for parallel processing is building a standard inverted index in the context of information retrieval. In this post, we will show how Rumble, a JSONiq engine running on Spark, can be used to build such an index in only a few lines of code. This demonstrates that, even if the input is not JSON but a collection of strings (lines of text), JSONiq can handle it and easily create an index in the JSON format.
For the purpose of this post, we will use the complete text of Sherlock Holmes books, which can be found and downloaded here. The code, however, can be used with any input collection.
There is nothing particular about this file; it is just text organized in a large number of small lines:
THE COMPLETE SHERLOCK HOLMES
Arthur Conan Doyle
Table of contents
A Study In Scarlet
The Sign of the Four
The Adventures of Sherlock Holmes
A Scandal in Bohemia
The Red-Headed League
A Case of Identity
The Boscombe Valley Mystery
The Five Orange Pips
The Man with the Twisted Lip
The Adventure of the Blue Carbuncle
The Adventure of the Speckled Band
The Adventure of the Engineer's Thumb
The Adventure of the Noble Bachelor
The Adventure of the Beryl Coronet
The Adventure of the Copper Beeches
...
In general, the approach described here will also work with a large number of files. Spark can partition the lines across but also within one file.
Rumble introduces the json-file()
function to read any number of JSON Lines files as a sequence of objects.
However, the data model of JSONiq supports sequences of any items – the function text-file()
reads any number of text files and returns a sequence of strings: one for each line.
Internally, Rumble actually manipulates a Spark RDD of string items.
An idempotent query that returns the entire collection would be
text-file("sherlock.txt", 100)
Where we use 100 partitions. Or more generally to read an entire directory:
text-file("/home/ghislain/corpus/*", 100)
The following query is equivalent but explicitly binds each line to a variable:
for $line in text-file("sherlock.txt", 100)
return $lin
The above queries (assuming they lie in the file query.jq) can be run on the CLI locally with:
spark-submit spark-rumble-1.1.jar \
--query-path ./query.jq \
--output-path ./output-directory
as well as on a cluster by submitting the appropriate parameters to spark-submit.
Omitting –output-path on a local execution will simply print on the standard input – it can be the best for experimenting. Only the top 1000 items will be shown to avoid a crash, but this can be changed with –result-size.
The output of all queries above will be in the typical format after a MapReduce of Spark job: a directory filled with part-iiiii files.
This is what we will start with.
The next step in building a standard inverted index is to tokenize the lines. Tokenization in information retrieval is in itself a core topic: it may involve lemmatization, stemming, etc. Here, we will keep it simple. Let us try as a start to simply tokenize based on spaces.
In order to do so, we just need to add a second for loop on an invocation of the tokenize() function, which returns a sequence of strings:
for $line in text-file("sherlock.txt", 100)
for $token in tokenize($line, " ")
where $token ne ""
return $token
We added the where clause to filter out any empty strings (when a line starts with a space, for example):
THE
COMPLETE
SHERLOCK
HOLMES
Arthur
Conan
Doyle
Table
of
contents
A
Study
In
Scarlet
The
Sign
of
the
Four
The
Adventures
of
Sherlock
Holmes
A
Scandal
in
Bohemia
The
Red-Headed
League
A
Case
of
Identity
...
There are of course duplicates; we could eliminate them with a distinct-values call (this will be executed in parallel as well):
distinct-values(
for $line in text-file("sherlock.txt", 100)
for $token in tokenize($line, " ")
where $token ne ""
return $token
)
However, since we want to associate each unique term with a postings list, a group-by is a better idea. The order clause will also sort them alphabetically, like so:
for $line in text-file("sherlock.txt", 100)
for $token in tokenize($line, " ")
where $token ne ""
group by $token
order by $token ascending
return $token
Since we didn’t do any cleanup, we get a lot of special characters.
"
"'"A
"'"Ah,
"'"And
"'"But
"'"Done!"
"'"Gone!
"'"Ha,
"'"He
"'"Hullo,
"'"I
"'"I'd
"'"I'll
"'"I'm
"'"I've
"'"It's
"'"Just
"'"Mr.
"'"No."
"'"Oh,
...
But we can easily adapt the tokenize() function call to get rid of the most common special characters:
for $line in text-file("sherlock.txt", 100)
for $token in tokenize($line, "[ \"'()&!\\-*,:\\.;/]")
where $token ne ""
group by $token
order by $token ascending
return $token
Which now looks better: numbers first.
000
1
10
100
1000
104
109
10s
10th
11
117
117th
11th
12
126b
127
129
12s
12th
...
The next step is to build the postings lists (we will here assume we have a posting for each term appearing in a line). We need to assign each line to a doc-id.
This can be done with the count clause. Also, in the return clause, we now need to return a more structured construct with a nested array of doc-ids. The syntax is that of JSON, but with the possibility to dymamically compute values with nested JSONiq expressions.
for $line in text-file("sherlock.txt", 100)
count $docid
for $token in tokenize($line, "[ \"'()&!\\-*,:\\.;/]")
where $token ne ""
group by $token
order by $token ascending
return {
"Term" : $token,
"Postings" : [ distinct-values( $docid ) ]
}
Et voilà :
{ "Term" : "000", "Postings" : [ 11552, 11631, 11634, 13921, 19725 ] }
{ "Term" : "1", "Postings" : [ 615, 3497, 3582, 9657, 22687, 33863, 40408, 62344, 76758 ] }
{ "Term" : "10", "Postings" : [ 629, 13928, 27179, 46790, 69917, 76367 ] }
{ "Term" : "100", "Postings" : [ 66036 ] }
{ "Term" : "1000", "Postings" : [ 20669, 22683 ] }
{ "Term" : "104", "Postings" : [ 67353 ] }
{ "Term" : "109", "Postings" : [ 51994 ] }
{ "Term" : "10s", "Postings" : [ 14806, 17882 ] }
{ "Term" : "10th", "Postings" : [ 14104 ] }
{ "Term" : "11", "Postings" : [ 630, 2129, 12617, 18052, 17974, 21106, 27179, 62403, 64203 ] }
{ "Term" : "117", "Postings" : [ 16194, 16199 ] }
{ "Term" : "117th", "Postings" : [ 27249, 27691 ] }
{ "Term" : "11th", "Postings" : [ 73290 ] }
{ "Term" : "12", "Postings" : [ 631, 35064 ] }
{ "Term" : "126b", "Postings" : [ 24033, 24088, 24075 ] }
{ "Term" : "127", "Postings" : [ 51993 ] }
{ "Term" : "129", "Postings" : [ 2088 ] }
{ "Term" : "12s", "Postings" : [ 16207 ] }
{ "Term" : "12th", "Postings" : [ 14105 ] }
{ "Term" : "13", "Postings" : [ 1828, 1887, 1898, 35202, 51993, 62841, 62843, 62921, 63570, 64750, 69239 ] }
...
This is it. If we output this result to a specific path, say, an “index” folder, we have built our index in the JSON Lines format.
Now we can use JSONiq to query the index. For example, if we want to query for “Crow”, we look up the postings list and scan the document for the lines given by these postings.
let $postings := json-file("/tmp/index")[$$.Term eq "Crow"].Postings[]
for $document in text-file("sherlock.txt")
count $line
where $line = $postings
return $document
And here we go:
to the Crow Hill, a huge business which was in strong hands which had
not only over the killing of the manager and engineer of the Crow
It should absolutely clear to the reader that, of course, best performance would be achieved with additional data structures such as a B+-tree or hash table to organize terms as well as a direct access to any line of text by its number: scanning at query time defeats the purpose of having an index in the first place.
However, building a standard inverted index almost gives a textbook example of how to use JSONiq, because it involves many of the different FLWOR clauses as well as a mix between a textual model and a semi-structured model.
We hope this post showed how JSONiq makes it simpler to do this kind of task than using PySpark or DataFrames. The key feature of the Rumble project is to maintain data independence and have the developer only thing in terms of one thing: sequences of items, even if the data is heterogeneous and/or nested.