Thursday, August 9, 2018

Why Sitecore Recommends SOLR For Site Search?


Before we understand what SOLR is and why Sitecore recommends it for search, let us understand some basics regarding different type of databases we have.

SQL
NoSQL
Table based databases
Document based databases
Have predefined schema
Dynamic schema
Vertically scalable. Manage increasing load by increasing CPU, RAM etc. on single server.
Horizontally scalable. We can add few more servers easily to handle large traffic.
Good fit for complex queries.
Not good fit for complex queries.
Based on ACID properties (Atomicity, Consistency, Isolation and Durability)
Based on Brewers CAP theorem (Consistency, Availability and partition tolerance)
Good fit for transactional type applications.
Not stable enough for complex transactional applications.
MS-SQL, Oracle, MySQL, SQLite, Postgres
MongoDB, Redis, BigTable, RavenDb, Cassandra, Hbase, Neo4j and CouchDb


Full text search of SQL:

Pros: For smaller databases where we have limited content and fixed schema, full text search of SQL is ideal.

Cons: But since SQL is not horizontally scalable, performing search with limited resources (CPU, RAM etc.) can be a huge set back to the application.

Full text search with NOSQL:  

Pros: NoSQL should be used when Schema (data requirements) isn’t clear at the outset or if we are dealing with massive amounts of unstructured data. Think of non-relational databases more like folders, assembling related information of all types. For example, if a blog application used a NoSQL database, each file could store data for a blog post: social likes, photos, text, metrics, links etc. 

Cons: But storing and retrieving data in bulk like this requires extra processing effort and more storage than highly organized SQL data.

Full Text Search Engines:

Full text search engines excel at quickly and effectively searching large volumes of unstructured text and returning these documents based on how well they match the user’s query. They also have the ability to quickly facet, or categorize data or search results based on specific values of specific fields.

Example: Lucene, Solr, ElasticSearch, KinoSearch, Sphinix, Xapian.

Lucene: Lucene is able to achieve fast search responses because, instead of searching the text directly, it searched an index instead.

This is like retrieving pages in a book related to a keyword by scanning the index at the back of a book, as opposed to searching every word of every page of the book.

Advantages of Lucene:

  1. Indexing: Indexing is the process of crawling through content and storing them on disk. Most of the search engines only support batch indexing; once they create an index for a set of documents, adding new documents becomes difficult without reindexing all the documents. Lucene supports both Incremental and batch indexing. Lucene allows easy adding of documents to an existing index.
  2. Data Sources: Many search engines can only index files or webpages. Lucene in addition can index data from a database, or where multiple virtual documents exist in a single file, such as a ZIP archive.
  3. Content Tagging: Lucene supports content tagging by treating documents as collection of fields, and supports queries that specify which fields to search.
  4. Stemming: Reducing a word to its root form is called Stemming. Often, a user desires a query for one word to match other similar words. For example, a query for “jump” should probably also match words “jumped”, “jumper”, or “jumps”.
  5. Stop-word processing: Common words, such as “a”, “and”, “the” etc. add little value to a search index. Lucene uses StopAnalyzer class which eliminates stop words from the input stream.
  6. Query features: Lucene supports full Boolean queries, and queries and also has ability to search multiple indexes at once and merge the results to give a meaningful relevance score.
  7. Concurrency: Lucene allows users to search an index transactionally, even if another user is simultaneously updating the index.
  8. Non-English support: Lucene allows us to perform language-specific filtering.

Disadvantages of Lucene:

  1. Scaling: Lucene works perfectly fine for single server environment. But in multi-server environment indexes have to be copied to each server which is error prone.
  2. NO UI: We have to depend on external tools like LUKE to analyse indexes and run queries during development.





Lucene Vs Solr

A simple way to conceptualize the relationship between Solr and Lucene is that of a car and its engine. You can’t drive an engine, but you can drive a car. Similarly, Lucene is a programming library which you can’t use as-is, whereas Solr is a complete application which can be used out-of-box.

Solr = Lucene + additional features

Solr is an application or HTTP wrapper on top of Lucene. Content from the Lucene index can be retrieved using an HTTP GET query in XML, JSON, or binary formats.

Extra features in addition to Lucene
  1. Faceted Search: Dynamically clusters search results into drill-down categories.
  2. Built-in Sorting: Automatic features to sort search results by a variety of characteristics.
  3. Web Admin Interface: SOLR application provides us with a web UI to run queries and check indexed data in various formats. We need not depend on external tools like LUKE.
  4. Hit Highlighting: Shows a snippet of a document in the search results that surrounds the search terms.
  5. HTTP query: Pass a number of optional request parameters to the request handler to control what information is returned.


Conclusion: Sitecore strongly recommends going with SOLR when we have scaled environment.  Lucene can only be used for a development or single server environment.