Post-Hoc Index Design: From Regex to PEG
In September of last year, the team faced one of its first challenges in responding to the growing user base of Hosted Chef. The on-call engineer at the time received an alert that the queue used to store Chef objects waiting to be indexed for search was backed up. Soon we received alerts that calls to the search API were failing and we posted an outage notice. Restoring search functionality required us to compromise on search freshness and we began investigating the incident's root cause.
Chef's search feature is powered by Lucene via Solr. Every time an object is created or updated in Chef (e.g. a node, a role, or a data bag) the object is converted to JSON and put on a queue (we use RabbitMQ) where it will be picked up for indexing. The indexer expands the JSON into a set of key/value pairs (deep structure is flattened by concatenating keys with underscore) and sends the data to Solr for indexing.
The search outage was the result of Lucene doing a merge of its index that took minutes. Analysis of our Lucene index revealed that we had well over 900,000 fields in our index. If you look hard, you can find benchmarks on Lucene using 30 fields. Our system was using Lucene wrong.
In addition to mapping each flattened key into a field, the indexer was "expanding" all compound keys by inserting an 'X' to allow a kind of wild card searching. For example, the JSON:
{"a": {"b": {"c": 1}}}
would have a flattened key of a_b_c which would get expanded
to:
X_b_c
a_X_c
a_b_X
This was a non-optimal approach to say the least -- yet Lucene actually made it work for quite some time.
We needed to change how Chef objects were indexed to avoid creating an
unbounded number of fields in the index. The solution we came up with
was to concatenate the flattened key value pairs with a delimiter (we
chose __=__) and put all of the concatenated pairs into a single
content field in the search index separated by whitespace. Besides
reducing the number of fields in the index, this approach eliminated
the need for the "X" expansion since Lucene wild card queries using
"*" could be used instead; this saved considerable time and memory
during the indexing preprocessing phase.
Experiments with simulated node data were run to validate the new approach. Two 10,000 node save test scenarios were run: the first was optimistic, in which two random keys were added to a node being saved, and the second pessimistic, in which 50 random keys were added for each save. The results showed a decrease in average commit time of 4x and 23x, respectively.
This left us with two problems. We needed to reindex our data and all queries would fail once we did; querying the new index required a new query syntax.
The reindexing was solved with a well practiced deployment plan and a small amount of scheduled maintenance. Dealing with the change in the query syntax was somewhat more challenging.
Chef's query syntax directly exposes the Lucene syntax and, as a
result, the initial index layout. Chef object keys map to fields in
the query. For example, to search for all nodes with the "web" role,
the user query would be role:web which will be passed along to
Lucene unchanged. Lucene will interpret this as a search for documents
with "web" in the "role" field. The equivalent query for the new
"no-fields" index looks like this: content:role__=__web. In order to
deploy the new indexing scheme, we needed to be able to convert user
queries into the new format.
We were under operational duress at the time and needed a fast-as-possible solution. By examining system logs, we determined that we could cover the types of queries actually in use with a few regular expressions. If you're ever looking for a definition of "quick and dirty", this is it. To help verify the solution, we wrote a test harness that populated an organization with a known set of objects and randomly generated queries with Boolean operators in such a way that the result set was computable. This helped increase our confidence in our duct tape solution.
Having introduced a series of regular expressions into our query
handling code, we now had an additional problem. The regular
expression based solution was a stop-gap measure and didn't support
many aspects of the Lucene query syntax. For example, we failed to
handle the ! and - operators (see Lucene syntax). The query
-role:blah was transformed into content:-role__=__blah when it
should have been -content:role__=__blah. We considered adding
more regular expression based replacement rules to the already
somewhat convoluted logic, but it became clear that we'd need a
fair bit more logic to be able to identify the leading dash and ignore
the embedded dash in a query like tags:b||(-role:web-server tags:a)
(which should be transformed to content:tags__=__b ||
(-content:role__=__web-server content:tags__=__a).
In order to restore support for more of Lucene's flexible syntax and to end up with a maintainable query handling system that isn't a laundry list of regular expressions, we began work on code to parse Lucene's query syntax. The initial implementation used a Parsing Expression Grammar (PEG) created with Treetop. Here's the grammar file we used. The result was a query handling system that we have full control over and that provides a layer of abstraction over Lucene's query syntax. A side-benefit of the full-parse approach is that we can now validate user queries much earlier.
Since deploying the reorganized index, Lucene's commit times have remained short and consistent even in the face of significant growth of the index. We've ported the Hosted Chef search API endpoint to Erlang and ported the Lucene PEG to Erlang using neotoma. We have been able to address a few further edge cases of the Lucene query syntax in a structured and robust fashion (no regular expressions required).
-
http://seancribbs.com/ Sean Cribbs
-
http://twitter.com/lehuy20 Le Huy
-
http://userprimary.net/ Seth Falcon
-
http://userprimary.net/ Seth Falcon
-
http://jtimberman.myopenid.com/ jtimberman
-
http://jtimberman.myopenid.com/ jtimberman