Elasticsearch vs Solr

Plasing deur krahulreddy op 18 Junie 2020 in English.

Comparison:

For indexing the nominatim data, we have two major contenders- Solr and Elasticsearch. Both are based on the Apache Lucene library and provide a wide range of search options. As a part of my GSoC project, a comparison of both of these has been done. A small project was set up to compare the functionality offered.

We have listed a few requirements for this project:

Required

Indexing: Name, Postcode fields.
Handle multiple names -> eg:- OSMNames
Handle postcodes
Scoring based on importance(0-1).
Handle data update
Store but do not index: Type, Class, id
Avoid copy fields

Desirable

Normalization
Tokenization for suggestion improvement
Consider browser defaults for language
Typo tolerance

The following table is a brief description of the results of the comparison of Solr and Elasticsearch for our requirements. This table also contains brief information about how different parts will be implemented.

Sl. no	Requirement	Support available in
		Solr	Elasticsearch
Required
1	Handle multiple names	Name field has language tags and other tags like `housename`. Maintain a hierarchy for the tags. Maintain a single extra field for all other tags. This is similar to how OSMNames handles indexing names. Index with fields that support suggestions.
		SuggestComponent in sorlconfig.xml is a good completion suggester.	search_as_you_type field Or Completion suggester in mapping
2	Handle postcodes	Normal Indexing of strings. Similar to the name indexing.
3	Scoring based on importance.	Basic scoring: Sort=importance desc, score desc Function queries allow sorting like `sort=sum(score,importance) desc` (For the second option, float values should be stored with multiValued=“false” in the schema)	Use importance score along with `_score`. Implementation in the example project set up:- #L87
4	Update the index (Avoid complete reindexing)	Select the places which are to be updated similar to photon, and then update.
5	Store but do not index: Type, Class, id	Schema can be adjusted by `stored=”true” indexed=”false”`	Store property in Mapping can be set to true and index property can be set to false.
		OSM id/place id can be used as the indexing id.
6	Avoid copy fields	Make sure no unnecessary copy field tags exist in schema	Not sure if any copy fields are made automatically in ES.
Desirable
7	Normalization	To be used for handling different languages and handling terms like ‘the’, ‘and’. Languages can be indexed separately (like photon does, but with language-specific indexing analyzers) and keep the extra tags(should be limited based on memory requirement) in an alternative_names field. Custom analyzers can be used to handle special cases like tokenstringreplacements.inc
		Solr provides filters for languages listed here. solr.MappingCharFilterFactory lets us provide our char mapping.	Similar filter language list here. Filters can be used to remove simple punctuation. There are few inbuilt char filters that can be used.
8	Tokenization	solr.NGramTokenizerFactory can be used for n-gram tokenization. There are other tokenizers too. They are available in both ES and Solr.	Using n-gram tokenizers for text fields will provide partial searching and searching with offset capabilities. There are a lot of other tokenizers available for experimentation on text fields. This might increase the size of the DB
10	Typo tolerance	Using fuzzy searches can introduce typo tolerance. But the fuzziness should not be too high. ES actually suggests keeping the value to AUTO only. Keeping it high might introduce unwanted results.

Verdict:

The comparison clearly shows that both Solr and Elasticsearch supports all the features that we need to fulfill our requirements. One point of difference is the API support. Elasticsearch API has more

Sarah, Marc and I discussed factors other than those mentioned above. One of them was the ubuntu package availability.

python3-pysolr, the python package for has support for version 3.8-all, while the current available version of Solr is 8.5.

python3-elasticseatch has support for Elasticsearch 7-all.

Considering all these factors, we have decided to stick to Elasticsearch for the project!

Bonus:

Debouncing will be implemented to improve performance. This reduces the network calls made. For suggestions, instead of making an API call after each keystroke, we make the call only if two events are a certain time period apart. This is standard practice and will decrease the total number of API calls.

A short summary of the results from https://github.com/krahulreddy/Search:

Elasticsearch:

search_as_you_type fields are now searched with ngram matching. It gives results when atleast one word is matched.

But the order of words is still not handled. Ex:

 Indexed value: United States of America

 Search term:    "States of" ⇒ gives results	
                 "America" ⇒ gives results
                 "America States" ⇒ No results

completion_suggester uses ngram analysis along with completion suggester. It returns very good results with partial input(results with even one letter input) but still lacks tokenization and typo tolerance.