navigation

Index data and yield better search results with Apache Solr

Index data and yield better search results with Apache Solr

by
May 3, 2022
Dreamix, frontpage
No Comment

Intro 

In recent years, tools that help you build a much better, faster and more consistent search engine for your own application (than, say, using LIKE in your SQL queries or doing simple String.contains() searches ) have become extremely popular across all platforms. You may be familiar with ElasticSearch, even if it’s only by its huge role in the ELK (Elastic, Logstash, Kibana) stack, which is very widely spread. This article will focus on another alternative – Apache Solr (as I’m simply much more familiar with it), but the exact same principles and tools will be very much valid for Elastic as well. 

Both Apache Solr and ElasticSearch are based on the same open-source engine that does all the magic – Lucene. It’s extremely lightweight, flexible and very efficient in ranking and matching, tf-idf indexing and through the use of Solr, very easy to operate and set up. You can read more about it here.

What is Apache Solr

We start with briefly defining what Solr actually is:

  • An interface above the Lucene engine.
  • Provider of easy to use API’s to interact with the underlying indexes/engine
  • A full solution that includes zookeeper and solr cloud infrastructure, web admin interface, etc. 
  • Scalable and maintainable
  • Convenient and easy to set up and use 
  • Supporting multiple query languages

What Apache Solr isn’t

Some things I’ve seen people often get wrong over the years (and of course, made a few of these mistakes myself before knowing better) are very important to list. I believe everyone thinking about incorporating a search engine in their application should really consider these and try to avoid them. 

Apache Solr is NOT : 

  • A non-relational database. No! Get that out of your head! It’s the most common mistake made. You can’t get away with only using Apache Solr for holding all of your data. Tools like Solr were always meant to hold parts of your data that you want to quickly search through, not your full dataset. Many operations are not only not supported but very hard to use (e.g. joining a couple of “tables” or having any kind of parent/child relationship between “objects”). Solr doesn’t support transactions. It is NOT how it’s meant to be used.
  • Instantly available. If you already have a relatively large database that you want to quickly search through, indexing with Solr may take a little while. Not too long, but you still need to consider that.
  • Efficient, if not set upright. Very important, Solr has the bad habit of showing you just what you’re asking for – which, if not set up correctly, will be a huge problem for you. Let’s go over how to use it effectively!

Basics of a Solr system

Here, we’ll go over the basics that make a Solr system function:

Documents 

Your data points are called “documents” in Lucene-based systems. 

A document is a single row in a relational database. An instance of a certain “class”, an object of a certain type. Documents hold all of your data and are made up of a bunch of fields (and some metainformation).

Each field will have its own name, type and a default value (not mandatory). 

Important to note is that the type can and will be of some custom types because through the type of the field, we can actually tell Solr how to index this specific field.

Each field also needs information about many of the things Solr will do to it. For example, you can set a field as “large” so that it loads it lazily or as “indexed”, which will ask Solr to index it. Otherwise, you can just store it and not process it in any way (but then, you’d obviously not be able to search through that field).

The document structure is defined in the Solr schema document, usually via XML, and it’s nothing to write home about – it even supports dynamically updating the schema on its own!

Index 

The index is all the data Solr contains, processed so that it can perform very fast, efficient searches through all of it. The index will need to be rebuilt every time you add new documents – don’t worry, Solr is really good at that!

Queries 

Queries are when you perform a search and expect to find a list of documents (or only some of their fields). Of course, queries can be very different, and we’ll get into that later! Solr provides you with an API to make queries! How to improve searches 

Let’s first talk about how we can get better results. 

There are actually two main points where we can change things around : 

  1. Index time – we can change how Solr indexes the data we pass to it. 
  2. Query time – similarly, change how Solr interprets what queries we ask it to run.

These can work together or separately, depending on what we are trying to achieve.

Let’s go over some of the tools available to us: 

We start off with a very simple document, for example I want to have just a person with a couple fields – a name, their age, and an Address. Imagine I pass Apache Solr something like:

{

“name”: “Jonathan Petersone”, 

“age” : 24, 

“address”: “London, William’s Str.  #49”

}

Let’s see what I can ask Solr to do with this data, by thinking some things I would like to be able to do for searching: 

Search by name: makes sense, if I get a full, exact match. But what if I’m not sure how to spell Petersone? Maybe Jonathan is spelled “Jonathan”? 

Search by age: quite straight forward, really, I may play a little with ranges, but I probably want to be able to just search for the numbers. 

Search by address: Now, this is where Solr will really shine. So many things can go wrong here, as people often are just not sure how about the spellin of an address, and want to really widen the possibilities of finding matches. 

So, during Index and Query time, Solr can apply a few different processors to the data. 

These include: 

  • Analyzers – analyze the data and generate a token stream. Can be broken down into:
  • Tokenizers – break data into little chunks which help with searches, called “tokens”
  • Filters – modify, add or remove these tokens.

What does this all mean? 

Tokenizers will make your data more “searchable”. Apache Solr supports many tokenizers out of the box (and of course, you can always add your own), but the most common one used is the Standard Tokenizer. This will split your text data by whitespaces, commas, various other special symbols and produce something like this: 

“London, William’s Str.  #49” => “London”, “William’s”, “Str”, “49”. 

See how these are actually a lot more meaningful than what we had in the beginning? Why, you may ask? Well, for one, that comma was removed. Since we’re very likely to have commas in many of the addresses we enter, we don’t really want to search by it, because 

“London, William’s Str.  #49” and “Sofia, Tsarigradsko Shosse 931” would both end up as results if we search by something containing a comma, which is a bit useless. The same thing goes for the “#” sign, so these are just flat out removed. 

We now have some data, but we can do better, right? 

Next, come filters.

Filters take these tokens and start improving them. From the data above, something I can see really helping out is getting rid of the possessive “‘s” at the end of our street name. We achieve this by using Solr’s Classic Filter – it removes it in the token, so we can search better by not having to spell it out : 

“William’s” becomes “William”, and this is how it’s indexed. Here we have an example of something we’ll need to do when we’re querying as well, because in order to search by it, we want to make sure the input also doesn’t contain these possessive forms.

Now we’re at: “London”, “William”, “Str”, “49”. 

Another thing that’s pointless to search by is “Str” – again, most addresses would have that anyway, and it’s just extra text being indexed. 

Here, we have to rely on the help of Stopwords ( in Solr, a Stop Words Filter ) – simply said, you supply Solr a list of words that it just ignores. We can give it “Street, Str, St” and then, once this filter passes, we have: “London”, “William”, “49”. We took the data, processed it, and now it looks like we’ve extracted the core information of it, so we know what the user wants to see. 

A couple of other very useful filters to consider are : 

MappingCharFilter – takes characters in the tokens and maps them to other characters. Something commonly done is removing all non-ASCII characters, for example Ñ becomes a good old N. Newer versions of Solr even support a Fold to ASCII filter.

Lower Case Filter, Remove Duplicates Filter – well, these kind of speak for themselves. Simply add them to your field, and you no longer have to worry about those issues while searching.

Synonym filters – once again, easy to understand, but the mechanics here are important. Let’s say we want to tackle the “Str.” problem a different way – we can supply a Synonym filter with a list of synonyms – like this : 

Str – Street, Str, St 

Now, when Apache Solr is indexing this piece of data, it does this: 

 “London”, “William”, “Str”, “49” =>  “London”, “William”, “Str”, “Street”, “St”, “49”. 

We just added entirely new tokens! Now, if the query contains any of these, we still get a match, and that’s very useful! The downside of these is that they will still need supplying. But worry not, the internet has many commonly used synonyms readily available!

Let’s get back to the name of our person – “Jonathan Petersone”. 

How can we improve matching that when searching?

Well, maybe you’re thinking – Synonyms? Can we add “Jonatan” and “Peterson” and so on… sure, we could, why not? But can we cover every name possible? Maybe, but it’s still a lot of work… Here is where Phonetic filters come to help! 

Things like Beider-Morse of Double Metaphone are special algorithms, which transform our data to their “phonetic” form, so it’s easier to match different versions. Let me illustrate: 

Jonathan Petersone => JONTN PETRSON. 

Supposedly, different inputs translating to the same phonetic pronunciation would result in the same tokens. Again, important to note – we’ll need to do that both index and query time since matching JONTN to Jonathan won’t work. 

Another way to possibly tackle this is to use the Levenshtein distance between results. Read up on it as it’s a very interesting subject!

Solr provides MANY other filters and options, but the basics stay the same – transform and expand your tokens to yield better results! 

How to improve searches and ranking 

Ranking in Apache Solr usually happens out of the box, based on the total “score” a match has. This will depend on many factors, including how you index and execute your queries. 

Searches, on the other hand, support many additional options:

  • Boolean operators – of course, you can use AND, NOT or OR when running your query.
  • Spell checks – yes! Solr can even correct your grammar and spelling!
  • Suggester – what? Even give me suggestions? Yes, based on the already indexed data, Solr can actually tell you what you may be able to find. 
  • Adding weights – you want to search for “London, William’s Street” but would like results to be ordered more towards other William’s Street addresses rather than other London addresses? Entirely possible, by giving terms a weight : address:(London^2, William’s^10), easy! 
  • Faceting – add categories to your results!
  • Ranges – we haven’t forgotten that our person had an age! Simply do

age:[20 TO 30] and boom, we’ve got a match!

Conclusion 

We’ve just barely dipped our toes in what search engines have to offer, but experimenting with searches is a great way to figure out what works best for you, your users and your goals. Don’t forget to check out https://solr.apache.org/guide/8_11/ to read all about the many other toneizers, filters and query options Solr provides!

Dimitar Stanev

Senior Java Developer at Dreamix

More Posts

Do you want more great blogs like this?

Subscribe for Dreamix Blog now!