Darrin's Tech Ramblings

Amazon AWS CloudSearch Hierarchies


AWS CloudSearch

I’ve been working with Amazon’s AWS CloudSearch product for about a year now and it continues to be a challenge in minimalism.  The product seemingly only has a few features. The trick is to figure out how to do more than just  few things with it….

Handling Hierarchical data:

We have a 3-tier hierarchy of Geography data for each Activity stored in the search domain. Unfortunately, there is no native support for anything like hierarchical data in AWS-CS.  So, we need to create our own custom solution.  There are two parts to the solution: (a) structuring the data in AWS and (b) post-processing the search results into a nested (hierarchical) array.

Given an Activity that takes place in 2 different Geo’s as follows:

  • Phoenix
    -> Scottsdale
    -> Downtown Scottsdale
  • Phoenix
    -> Mesa
    -> Apache Wells

Create 3 literal facet fields with the following content.NOTE:  AWS uses |‘s as delimiters between faceted fields and we chose to use >‘s between hierarchy levels of the geographies.

In the search results from AWS, the above 3 facet fields will be represented as facet data “something like” this:

  • geo_level_1:
    • value=”Phoenix”    count=12
  • geo_level_2:
    • value=”Phoenix>Scottsdale”  count=4
    • value=”Phoenix>Mesa” count=5
  • geo_level_3
    • value=”Phoenix>Scottsdale>Downtown Scottsdale” count=3
    • value=”Phoenix>Mesa>Apache Wells” count=2

Since the level n data contains the level n-1 parent values in it, it’s a simple matter to convert this “in code” to a proper multi-dimensional array like this…

  • geo:
    • [0]
      • value=”Phoenix”
      • count=12
      • primary_key=”Phoenix”
      • [0]
        • value=”Scottsdale”
        • count=4
        • primary_key=”Phoenix>Scottsdale”
        • [0]
          • value=”Downtown Scottsdale”
          • count=3
          • primary_key=”Phoenix>Scottsdale>Downtown Scottsdale”
      • [1]
        • value=”Mesa”
        • count=5
        • primary_key=”Phoenix>Mesa”
        • [0]
          • value=”Apache Wells”
          • count=2
          • primary_key=”Phoenix>Mesa>Apache Wells”

The above array makes presenting this information on a web page in a visually intuitive way super simple.  Notice that we’ve change the ‘value’ field to match just the last node/level of the geo name since this is what want to display to the end users.  And, we’ve introduced a new field called primary_key that can be used as an ‘exact match’ filter value for drilling to that specific geo.

When filtering on the geography, the calling program will pass the ‘primary_key’ to the search function. The search function looks at the number of >‘s and determines if the filter is for a geo_level_1, 2, or 3 field. It then constructs a match express like (and geo_level_3:’Phoenix>Mesa>Apache Wells’). Since the geo fields are defined as ‘literal’ we don’t need to worry about CloudSearch inadvertently doing any partial matching. We always pass a 100% exactly matching value for these fields.

This approach provides a complete, end-to-end solution for dealing with displaying and filtering hierarchical data in AWS CloudSearch.

Searching and Filtering on the same faceted data:

In the above scenario, we created a way to handle displaying and filtering hierarchical facet data.  But what if we want to search that data too?  The are two blockers to that idea (1) facet fields can not ‘also’ be searchable and (2) literal fields require exact matching to be returned.  The solution is to create another field that is (a) searchable and (b) defined as text (assuming you want partial matching to return results!).  The other “trick” is to define the new field to use the existing field as it’s source.  In this way, no additional work is required to maintain the new fields’ data.   If you know that all your geo’s are 3 levels deep, then you can just use geo_level_3 as a source since it has the super set of names in it already.  Otherwise, fields for additional levels will also need to be created.  Alternatively, you can specify all 3 fields (geo_level_1, geo_level_2, and geo_level_3) as source fields for the new searchable field.  The values for each of the source fields will be concatenated into the target field.

If you do need to include multiple geo level fields into your search field, be aware that this will cause values like ‘Phoenix’ to appear more than once in the data.  As a result when someone searches for “gondola rides in Phoenix”, the text_relevance score will be more weighted toward the geo_search field than other fields (i.e. wherever ‘gondola’ and ‘rides’ was found).  This can be handled using relative field weighting in you rank expression… but that’s a topic for another day 🙂   If you can’t wait, here’s Amazon’s documentation about that

… http://docs.aws.amazon.com/cloudsearch/latest/developerguide/fwranking.html