Blog

Building an inverted index on a large text collection with JSONiq

Ghislain Fourny, 21 August 2019

One of the textbook examples for parallel processing is building a standard inverted index in the context of information retrieval. In this post, we will show how Rumble, a JSONiq engine running on Spark, can be used to build such an index in only a few lines of code. This demonstrates that, even if the input is not JSON but a collection of strings (lines of text), JSONiq can handle it and easily create an index in the JSON format.

The Corpus

For the purpose of this post, we will use the complete text of Sherlock Holmes books, which can be found and downloaded here. The code, however, can be used with any input collection.

There is nothing particular about this file; it is just text organized in a large number of small lines:





                          THE COMPLETE SHERLOCK HOLMES

                               Arthur Conan Doyle



                                Table of contents

               A Study In Scarlet

               The Sign of the Four

                  The Adventures of Sherlock Holmes
               A Scandal in Bohemia
               The Red-Headed League
               A Case of Identity
               The Boscombe Valley Mystery
               The Five Orange Pips
               The Man with the Twisted Lip
               The Adventure of the Blue Carbuncle
               The Adventure of the Speckled Band
               The Adventure of the Engineer's Thumb
               The Adventure of the Noble Bachelor
               The Adventure of the Beryl Coronet
               The Adventure of the Copper Beeches
...               

In general, the approach described here will also work with a large number of files. Spark can partition the lines across but also within one file.

Reading a large collection of strings

Rumble introduces the json-file() function to read any number of JSON Lines files as a sequence of objects.

However, the data model of JSONiq supports sequences of any items – the function text-file() reads any number of text files and returns a sequence of strings: one for each line.

Internally, Rumble actually manipulates a Spark RDD of string items.

An idempotent query that returns the entire collection would be

text-file("sherlock.txt", 100)

Where we use 100 partitions. Or more generally to read an entire directory:

text-file("/home/ghislain/corpus/*", 100)

The following query is equivalent but explicitly binds each line to a variable:

for $line in text-file("sherlock.txt", 100)
return $lin

The above queries (assuming they lie in the file query.jq) can be run on the CLI locally with:

spark-submit spark-rumble-1.1.jar \
    --query-path ./query.jq \
    --output-path ./output-directory

as well as on a cluster by submitting the appropriate parameters to spark-submit.

Omitting –output-path on a local execution will simply print on the standard input – it can be the best for experimenting. Only the top 1000 items will be shown to avoid a crash, but this can be changed with –result-size.

The output of all queries above will be in the typical format after a MapReduce of Spark job: a directory filled with part-iiiii files.

This is what we will start with.

Tokenizing

The next step in building a standard inverted index is to tokenize the lines. Tokenization in information retrieval is in itself a core topic: it may involve lemmatization, stemming, etc. Here, we will keep it simple. Let us try as a start to simply tokenize based on spaces.

In order to do so, we just need to add a second for loop on an invocation of the tokenize() function, which returns a sequence of strings:

for $line in text-file("sherlock.txt", 100)
for $token in tokenize($line, " ")
where $token ne ""
return $token

We added the where clause to filter out any empty strings (when a line starts with a space, for example):

THE
COMPLETE
SHERLOCK
HOLMES
Arthur
Conan
Doyle
Table
of
contents
A
Study
In
Scarlet
The
Sign
of
the
Four
The
Adventures
of
Sherlock
Holmes
A
Scandal
in
Bohemia
The
Red-Headed
League
A
Case
of
Identity
...

Getting the distinct terms

There are of course duplicates; we could eliminate them with a distinct-values call (this will be executed in parallel as well):

distinct-values(
  for $line in text-file("sherlock.txt", 100)
  for $token in tokenize($line, " ")
  where $token ne ""
  return $token
)

However, since we want to associate each unique term with a postings list, a group-by is a better idea. The order clause will also sort them alphabetically, like so:

for $line in text-file("sherlock.txt", 100)
for $token in tokenize($line, " ")
where $token ne ""
group by $token
order by $token ascending  
return $token

Since we didn’t do any cleanup, we get a lot of special characters.

"
"'"A
"'"Ah,
"'"And
"'"But
"'"Done!"
"'"Gone!
"'"Ha,
"'"He
"'"Hullo,
"'"I
"'"I'd
"'"I'll
"'"I'm
"'"I've
"'"It's
"'"Just
"'"Mr.
"'"No."
"'"Oh,
...

But we can easily adapt the tokenize() function call to get rid of the most common special characters:

for $line in text-file("sherlock.txt", 100)
for $token in tokenize($line, "[ \"'()&!\\-*,:\\.;/]")
where $token ne ""
group by $token
order by $token ascending
return $token

Which now looks better: numbers first.

000
1
10
100
1000
104
109
10s
10th
11
117
117th
11th
12
126b
127
129
12s
12th
...

Postings lists

The next step is to build the postings lists (we will here assume we have a posting for each term appearing in a line). We need to assign each line to a doc-id.

This can be done with the count clause. Also, in the return clause, we now need to return a more structured construct with a nested array of doc-ids. The syntax is that of JSON, but with the possibility to dymamically compute values with nested JSONiq expressions.

for $line in text-file("sherlock.txt", 100)
count $docid
for $token in tokenize($line, "[ \"'()&!\\-*,:\\.;/]")
where $token ne ""
group by $token
order by $token ascending
return {
  "Term" : $token,
  "Postings" : [ distinct-values( $docid ) ]
}

Et voilà :

{ "Term" : "000", "Postings" : [ 11552, 11631, 11634, 13921, 19725 ] }
{ "Term" : "1", "Postings" : [ 615, 3497, 3582, 9657, 22687, 33863, 40408, 62344, 76758 ] }
{ "Term" : "10", "Postings" : [ 629, 13928, 27179, 46790, 69917, 76367 ] }
{ "Term" : "100", "Postings" : [ 66036 ] }
{ "Term" : "1000", "Postings" : [ 20669, 22683 ] }
{ "Term" : "104", "Postings" : [ 67353 ] }
{ "Term" : "109", "Postings" : [ 51994 ] }
{ "Term" : "10s", "Postings" : [ 14806, 17882 ] }
{ "Term" : "10th", "Postings" : [ 14104 ] }
{ "Term" : "11", "Postings" : [ 630, 2129, 12617, 18052, 17974, 21106, 27179, 62403, 64203 ] }
{ "Term" : "117", "Postings" : [ 16194, 16199 ] }
{ "Term" : "117th", "Postings" : [ 27249, 27691 ] }
{ "Term" : "11th", "Postings" : [ 73290 ] }
{ "Term" : "12", "Postings" : [ 631, 35064 ] }
{ "Term" : "126b", "Postings" : [ 24033, 24088, 24075 ] }
{ "Term" : "127", "Postings" : [ 51993 ] }
{ "Term" : "129", "Postings" : [ 2088 ] }
{ "Term" : "12s", "Postings" : [ 16207 ] }
{ "Term" : "12th", "Postings" : [ 14105 ] }
{ "Term" : "13", "Postings" : [ 1828, 1887, 1898, 35202, 51993, 62841, 62843, 62921, 63570, 64750, 69239 ] }
...

This is it. If we output this result to a specific path, say, an “index” folder, we have built our index in the JSON Lines format.

Querying the index

Now we can use JSONiq to query the index. For example, if we want to query for “Crow”, we look up the postings list and scan the document for the lines given by these postings.

let $postings := json-file("/tmp/index")[$$.Term eq "Crow"].Postings[]
for $document in text-file("sherlock.txt")
count $line
where $line = $postings
return $document

And here we go:

     to the Crow Hill, a huge business which was in strong hands which had
     not only over the killing of the manager and engineer of the Crow

It should absolutely clear to the reader that, of course, best performance would be achieved with additional data structures such as a B+-tree or hash table to organize terms as well as a direct access to any line of text by its number: scanning at query time defeats the purpose of having an index in the first place.

However, building a standard inverted index almost gives a textbook example of how to use JSONiq, because it involves many of the different FLWOR clauses as well as a mix between a textual model and a semi-structured model.

JSONiq and data independence

We hope this post showed how JSONiq makes it simpler to do this kind of task than using PySpark or DataFrames. The key feature of the Rumble project is to maintain data independence and have the developer only thing in terms of one thing: sequences of items, even if the data is heterogeneous and/or nested.

Written by Ghislain Fourny