Before we understand what SOLR is and why Sitecore
recommends it for search, let us understand some basics regarding different
type of databases we have.
SQL
|
NoSQL
|
Table based databases
|
Document based databases
|
Have predefined schema
|
Dynamic schema
|
Vertically scalable. Manage increasing load by increasing CPU, RAM
etc. on single server.
|
Horizontally scalable. We can add few more servers easily to handle
large traffic.
|
Good fit for complex queries.
|
Not good fit for complex queries.
|
Based on ACID properties (Atomicity, Consistency, Isolation and
Durability)
|
Based on Brewers CAP theorem (Consistency, Availability and partition
tolerance)
|
Good fit for transactional type applications.
|
Not stable enough for complex transactional applications.
|
MS-SQL, Oracle, MySQL, SQLite, Postgres
|
MongoDB, Redis, BigTable, RavenDb, Cassandra, Hbase, Neo4j and
CouchDb
|
Full text search of SQL:
Pros: For smaller
databases where we have limited content and fixed schema, full text search of
SQL is ideal.
Cons: But since
SQL is not horizontally scalable, performing search with limited resources
(CPU, RAM etc.) can be a huge set back to the application.
Full text search with NOSQL:
Pros: NoSQL should be used when Schema (data requirements) isn’t clear at the outset or if we are dealing with massive amounts of unstructured data. Think of non-relational databases more like folders, assembling related information of all types. For example, if a blog application used a NoSQL database, each file could store data for a blog post: social likes, photos, text, metrics, links etc.
Cons: But storing and retrieving data in bulk like this requires extra processing effort and more storage than highly organized SQL data.
Full Text Search Engines:
Full text search engines excel at quickly and effectively
searching large volumes of unstructured text and returning these documents
based on how well they match the user’s query. They also have the ability to
quickly facet, or categorize data or search results based on specific values of
specific fields.
Example: Lucene, Solr, ElasticSearch, KinoSearch, Sphinix,
Xapian.
Lucene: Lucene is able to
achieve fast search
responses because, instead of searching the text directly, it searched an index instead.
This is like retrieving pages in a book related to a keyword
by scanning the index at the back of a book, as opposed to searching every word
of every page of the book.
Advantages of Lucene:
- Indexing: Indexing is the process of crawling through content and storing them on disk. Most of the search engines only support batch indexing; once they create an index for a set of documents, adding new documents becomes difficult without reindexing all the documents. Lucene supports both Incremental and batch indexing. Lucene allows easy adding of documents to an existing index.
- Data Sources: Many search engines can only index files or webpages. Lucene in addition can index data from a database, or where multiple virtual documents exist in a single file, such as a ZIP archive.
- Content Tagging: Lucene supports content tagging by treating documents as collection of fields, and supports queries that specify which fields to search.
- Stemming: Reducing a word to its root form is called Stemming. Often, a user desires a query for one word to match other similar words. For example, a query for “jump” should probably also match words “jumped”, “jumper”, or “jumps”.
- Stop-word processing: Common words, such as “a”, “and”, “the” etc. add little value to a search index. Lucene uses StopAnalyzer class which eliminates stop words from the input stream.
- Query features: Lucene supports full Boolean queries, and queries and also has ability to search multiple indexes at once and merge the results to give a meaningful relevance score.
- Concurrency: Lucene allows users to search an index transactionally, even if another user is simultaneously updating the index.
- Non-English support: Lucene allows us to perform language-specific filtering.
Disadvantages of
Lucene:
- Scaling: Lucene works perfectly fine for single server environment. But in multi-server environment indexes have to be copied to each server which is error prone.
- NO UI: We have to depend on external tools like LUKE to analyse indexes and run queries during development.
Lucene Vs Solr
A simple way to conceptualize the relationship between Solr
and Lucene is that of a car and its engine. You can’t drive an engine, but
you can drive a car. Similarly, Lucene is a programming library which you can’t
use as-is, whereas Solr is a complete application which can be used out-of-box.
Solr = Lucene + additional features
Solr is an application or HTTP wrapper on top of Lucene. Content
from the Lucene index can be retrieved using an HTTP GET query in XML, JSON, or
binary formats.
Extra features in addition to Lucene
- Faceted Search: Dynamically clusters search results into drill-down categories.
- Built-in Sorting: Automatic features to sort search results by a variety of characteristics.
- Web Admin Interface: SOLR application provides us with a web UI to run queries and check indexed data in various formats. We need not depend on external tools like LUKE.
- Hit Highlighting: Shows a snippet of a document in the search results that surrounds the search terms.
- HTTP query: Pass a number of optional request parameters to the request handler to control what information is returned.
Conclusion: Sitecore strongly recommends going with SOLR when we have
scaled environment. Lucene can only be
used for a development or single server environment.