Jan 20

Post-Hoc Index Design: From Regex to PEG

By Seth Falcon View Comments
Share and Enjoy:
  • Facebook
  • Twitter
  • Reddit
  • HackerNews
  • Slashdot

In September of last year, the team faced one of its first challenges in responding to the growing user base of Hosted Chef. The on-call engineer at the time received an alert that the queue used to store Chef objects waiting to be indexed for search was backed up. Soon we received alerts that calls to the search API were failing and we posted an outage notice. Restoring search functionality required us to compromise on search freshness and we began investigating the incident's root cause.

Chef's search feature is powered by Lucene via Solr. Every time an object is created or updated in Chef (e.g. a node, a role, or a data bag) the object is converted to JSON and put on a queue (we use RabbitMQ) where it will be picked up for indexing. The indexer expands the JSON into a set of key/value pairs (deep structure is flattened by concatenating keys with underscore) and sends the data to Solr for indexing.

The search outage was the result of Lucene doing a merge of its index that took minutes. Analysis of our Lucene index revealed that we had well over 900,000 fields in our index. If you look hard, you can find benchmarks on Lucene using 30 fields. Our system was using Lucene wrong.

In addition to mapping each flattened key into a field, the indexer was "expanding" all compound keys by inserting an 'X' to allow a kind of wild card searching. For example, the JSON:

{"a": {"b": {"c": 1}}}

would have a flattened key of a_b_c which would get expanded to:

X_b_c
a_X_c
a_b_X

This was a non-optimal approach to say the least -- yet Lucene actually made it work for quite some time.

We needed to change how Chef objects were indexed to avoid creating an unbounded number of fields in the index. The solution we came up with was to concatenate the flattened key value pairs with a delimiter (we chose __=__) and put all of the concatenated pairs into a single content field in the search index separated by whitespace. Besides reducing the number of fields in the index, this approach eliminated the need for the "X" expansion since Lucene wild card queries using "*" could be used instead; this saved considerable time and memory during the indexing preprocessing phase.

Experiments with simulated node data were run to validate the new approach. Two 10,000 node save test scenarios were run: the first was optimistic, in which two random keys were added to a node being saved, and the second pessimistic, in which 50 random keys were added for each save. The results showed a decrease in average commit time of 4x and 23x, respectively.

This left us with two problems. We needed to reindex our data and all queries would fail once we did; querying the new index required a new query syntax.

The reindexing was solved with a well practiced deployment plan and a small amount of scheduled maintenance. Dealing with the change in the query syntax was somewhat more challenging.

Chef's query syntax directly exposes the Lucene syntax and, as a result, the initial index layout. Chef object keys map to fields in the query. For example, to search for all nodes with the "web" role, the user query would be role:web which will be passed along to Lucene unchanged. Lucene will interpret this as a search for documents with "web" in the "role" field. The equivalent query for the new "no-fields" index looks like this: content:role__=__web. In order to deploy the new indexing scheme, we needed to be able to convert user queries into the new format.

We were under operational duress at the time and needed a fast-as-possible solution. By examining system logs, we determined that we could cover the types of queries actually in use with a few regular expressions. If you're ever looking for a definition of "quick and dirty", this is it. To help verify the solution, we wrote a test harness that populated an organization with a known set of objects and randomly generated queries with Boolean operators in such a way that the result set was computable. This helped increase our confidence in our duct tape solution.

Having introduced a series of regular expressions into our query handling code, we now had an additional problem. The regular expression based solution was a stop-gap measure and didn't support many aspects of the Lucene query syntax. For example, we failed to handle the ! and - operators (see Lucene syntax). The query -role:blah was transformed into content:-role__=__blah when it should have been -content:role__=__blah. We considered adding more regular expression based replacement rules to the already somewhat convoluted logic, but it became clear that we'd need a fair bit more logic to be able to identify the leading dash and ignore the embedded dash in a query like tags:b||(-role:web-server tags:a) (which should be transformed to content:tags__=__b || (-content:role__=__web-server content:tags__=__a).

In order to restore support for more of Lucene's flexible syntax and to end up with a maintainable query handling system that isn't a laundry list of regular expressions, we began work on code to parse Lucene's query syntax. The initial implementation used a Parsing Expression Grammar (PEG) created with Treetop. Here's the grammar file we used. The result was a query handling system that we have full control over and that provides a layer of abstraction over Lucene's query syntax. A side-benefit of the full-parse approach is that we can now validate user queries much earlier.

Since deploying the reorganized index, Lucene's commit times have remained short and consistent even in the face of significant growth of the index. We've ported the Hosted Chef search API endpoint to Erlang and ported the Lucene PEG to Erlang using neotoma. We have been able to address a few further edge cases of the Lucene query syntax in a structured and robust fashion (no regular expressions required).

Share and Enjoy:
  • Facebook
  • Twitter
  • Reddit
  • HackerNews
  • Slashdot
  • http://seancribbs.com/ Sean Cribbs

    Seth, this was really interesting and informative. How did the neotoma implementation compare to a traditional yecc? Do you intend to open-source any of the work?

  • http://twitter.com/lehuy20 Le Huy

    I myself prefer to get rid of Solr, Lucene, RabbitMQ  from chef server, it just cost too much and bring a too little value. How many people actually use search query from chef-server-webui , from knife? How many recipes does need search api? I guess not many, perhaps you have data collected from users, you know better.  I myself do not use it. Can we remove search features or at least switch it off easily ?

  • http://userprimary.net/ Seth Falcon

    We haven’t made a direct comparison to yecc. Since we had the PEG grammar in Treetop syntax in-hand, porting it to neotoma was very straight forward and seems to be working and performing well for our needs. I actually found the resulting grammar in neotoma to be cleaner, although it had the benefit of being my second implementation :)

    My understanding is that this work will eventually be open, but I’m not sure of the timeline.

  • http://userprimary.net/ Seth Falcon

    It turns out that we do have some data on the use of search in Chef. Looking at Opscode Hosted Chef traffic, search is used by many of our customers. Based on our in-house use of Chef, my impression is that search provides a lot of value when building out new infrastructure, orchestrating complicated deploys, as well as managing servers with knife.

    You might take a look at the following pages on the wiki for some details and examples of how to use search:
    http://wiki.opscode.com/display/chef/Introduction+to+Search+and+Data+Bags
    http://wiki.opscode.com/display/chef/Search#Search-UsingSearchinRecipes

    Since I gather you are running the open source Chef Server, you can investigate what would be involved to remove it or add an “off” switch and perhaps bring your finding to the Chef users mailing list (though I don’t think a proposal to remove search entirely would be well received).

  • http://jtimberman.myopenid.com/ jtimberman

    Search is *very* popular. It is one of the killer features of the Chef server.  We use it in many of our public cookbooks, such as the “nagios” cookbook. All the back end servers that run Opscode Hosted Chef are automated with Chef, and most (if not all) of them use search for finding who they’re supposed to connect to, such as the SOLR servers finding the RabbitMQ servers.

  • http://jtimberman.myopenid.com/ jtimberman

    Search is *very* popular. It is one of the killer features of the Chef server.  We use it in many of our public cookbooks, such as the “nagios” cookbook. All the back end servers that run Opscode Hosted Chef are automated with Chef, and most (if not all) of them use search for finding who they’re supposed to connect to, such as the SOLR servers finding the RabbitMQ servers.

blog comments powered by Disqus