Steps for implementing semantic search on a text corpus

  • Clean and process corpus
  • [optional] Enrich the data so that there are features in the data that assist better matches to queries.
  • Embed corpus on a sentence embedding model
  • Index embedding using some ANN like NGT/FAISS/HNSW
  • Embed query using the same model
  • Search index using query embedding
  • [optional] Rerank the hits using a cross encoder
  • Serve the top k hits

Training Notes.

When it comes to poetry, users desire recommendations, not just matches. Instead of finding poems that merely contain the query’s terms or related words, we seek instances where the themes, emotions, or concepts in the query are explored within the poems. This is a nuanced challenge.

If I search for “our relationship with technology” instead of matching poems that contain th term or related terms to “relationship” and or “technology”, I want instances of the relationship explored in the poems.

Features that are relevant to the recommendations are things like

  • Settings - physical setting like sunrise, sunset dusk etc. emotional settings like break ups, winning a lottery, death etc.
  • Themes as in the emotive theme or philosophical theme , eg dealing with death, impermanence of life, nature
  • Tone- humor, sarcasm, bitterness
  • Style - acrostic, free verse, etc.

In order to improve the retrieval, I used VLLM and Wizard Vicuna to enrich the documents, I generated a list of emotions, themes and tones of the poem which I appended to the poem before embedding so that some descriptive stuff is also added to the retrieval. I think this has improved the retrieval a lot. but still, it’s tricky to coax a model into coherently give you answers to “this poem is an instance of” to enrich the document.

When I was first reading about semantic search, I thought this would be search that understood my query and then found what was matching that. But really what search of any kind does (as long as the unit of query is words and fitness of match is also based on words) is optimizing the similarity between the query and the text. This to me is just a variation of lexical search. So embedding based retrieval is better at lexical semantics than really parsing the query and extracting the meaning from it.

By lexical semantics here I am referring to the structural and functional relationships between word and their meanings that are encoded in text corpora. IN linguistics these are represented as semantic networks or ontologies. The word color is a semantic neighbor of the words red, green and yellow. confusingly they are not lexical neighbors.

What would be cool is if you could search for ‘bitter poem about love set in a cafe’. in this case if we had ways to identify the different components of the query and then run separate searches, or some chaining of searches, that could improve things. eg.

  • search tone : bitter
  • search theme: love | loss
  • search setting: cafe

But as you can imagine, this creates too many moving parts to manage.

A shortcut would be use a LLM to do query expansion and that is the approach most current papers and large organizations seem to be exploring. It’s just that that is so resource intensive, and generally needs cloud computing.

Could this situation be improved using different types of embeddings. eg. embeddings that are designed for topic modelling or identifying and retaining topic and ontological models of the text? I think so. But do not have enough empirical evidence.

As a thinking tool, semantic search is a definite improvement over lexical but I think there are still many problems left to solve here.

Converting search into recommendations

  1. Add some way of taking in user preferences either upfront, or for each search result
  2. Use these to create (or match) a features for your corpus
  3. update the search and ranking method to add these features

More information about design patterns in recommendation systems can be found here: Yan, Ziyou. (Jun 2021). System Design for Recommendations and Search. eugeneyan.com.