AWS CloudSearch
I’ve been working with Amazon’s AWS CloudSearch product for about a year now and it continues to be a challenge in minimalism. The product seemingly only has a few features. The trick is to figure out how to do more than just few things with it….
This article is a quick introduction to CloudSearch including some of the limitations and workarounds possible. My background is very Relational Database heavy which actually made it harder for me to grok CloudSearch. Hopefully, this article will help others get up to speed quickly.
What CloudSearch is not
- It is not a relational DB — it has no tables, no joins, no data ‘structure’.
What CloudSearch is
- A flat domain or repository (i.e. a single “table”) of documents (i.e. “records”)
- Each document is made up of many fields (i.e. columns)
- The fields are automatically indexed in a MASSIVE way so queries are really fast
CloudSearch Schema
- Field Types — there are only 3 data types in a document
- text — data that can be searched for partial matches
- literal — data that can be search for complete matching only
- uint — unsigned integer, can be filtered on, searched, and returned as a facet
- Field Attributes
- search — can the field be searched?
- facet — should faceted results be generated for this field?
- result — should the contents of this field be returned in the results?
- Ranking… is the ability to sort.
- Rankings can be ‘defined’ to Amazon for pre-indexing
- or they can be provided on-the-fly
- Example of a complex ‘test’ ranking built on-the-fly (in PHP)
- $loc = 1.5;
- $type = 1.0;
- $activity = 0.7;
- $title = 1.0;
- $reviewQuality = 1.0;
- $ratingPoints = 0.9;
- $this->addExpression(“test”, “(((((rating_avg*{$reviewQuality} / 10 ) + (rating_points*{$ratingPoints}) + cs.text_relevance({weights:{title:{$title}, geo:{$loc}, type:{$type}, default_weight:0.5})))/2 + ((cs.text_relevance({weights:{title:{$title}, geo:{$loc}, type:{$type}, default_weight:0.5}))/2)) * 2000)”);
What are Facets
- Facets are literal or uint fields used to return ‘group’ counts based on the field values.
- Text fields can not be used for facets.
Two ways to search
- General Query: use q=<search terms> in the URL and it will search all the ‘searchable’ fields looking for <search terms> somewhere in those fields.
- If the field is a literal it must match exactly.
- Boolean Query: use bq=<field_name>:<search term> in the URL and it will search that particular field.
- This is how facet drill down filtering is generally accomplished
- Example: bq=(or title:’star’ (not title:’wars’)
- If the field is a literal it must match exactly.
Loading data
- To load data, send a json encoded associative array of arrays.
- CloudSearch will always completely replace (never update) pre-existing documents. This means we can not update single fields within a document. All fields must be provided each time.
- By default, the name of the associated key in the array decides which field in the index is updated.
- There is, however, a way to tell CloudSearch to look for a fields data under a different key name.
Limitations and Gotcha’s and how to get around them
- Only unsigned integer numbers are supported.
- This means a ticket_price of $10.75 must be stored as 1075.
- You will need to post-process the results and convert it back to 10.75.
- Max of 100 values in a field
- This means we can’t put a years worth of dates in a date field.
- The solution is to have a date field for each month of the year (or each quarter) and post process the results into a combined array of values.
- When Filtering on a facet field if the field type is ‘text’ then partial matches are acceptable. For example if the facet = “New York City” and the search is for “York New” it would match. This great for “searching” but usually NOT what we want when filtering!
- To drill into New York City, we need exact matching, so the field type must be literal. But for General Search’s we want partial matching so we need another field of type ‘text’. The solution is to have 2 fields (one text; the other literal) and to assign them the same source field.
- When obsoleting/renaming fields, there is always a race condition during the release process. The old field is needed for a small amount of time and the new field is also needed at the same time.
- The solution is to create a new field and tell it to get it’s data from two different sources: (1) the old fields data source and (2) the new fields data source. And also modify the old field and have it get it’s data from these two sources as well. For example:
- old_field => old_field + new_field
- new_field => old_field + new_field
- The above provides forward and backward compatibility for code that queries as well as updates the Index.
- The obsolete field can be removed after the new field is poplated and all the old code references are no longer used.
- The solution is to create a new field and tell it to get it’s data from two different sources: (1) the old fields data source and (2) the new fields data source. And also modify the old field and have it get it’s data from these two sources as well. For example:
- Deleted documents are sometimes returned if no search criteria is provided. For example, you can do “searches” with “no search criteria” in order to get high-level facet counts. Having deleted records included in these counts is a problem:
- The solution is to set a default_value of 0 to a field that would never normally have a zero value. An internal document_id value works well for this. Then, whenever you do an ‘empty’ search, include a filter of document_id != 0.