Four Opinionated Heuristics For RAG in Medicine

Published

November 20, 2024

She’d been a kitchen maid and now she was subjecting the Book to critical analysis and talking to a religious icon. That sort of thing led to friction. The presence of those seeking the truth is infinitely to be preferred to the presence of those who think they’ve found it. – Polly Perks, Monstrous Regiment

The heuristics are:

Low Stakes: Use vanilla RAG
- Easy to implement and looks cool
High Stakes: Use hybrid retrieval ± recommender systems
- Minimize generation ¹
- Focus on accurate information retrieval then generate
Complex Reasoning or High Accuracy Needed: Use knowledge graphs or graph embeddings
- Combine with hybrid retrieval, recommender systems, and minimal generation
- Helps reduce hallucinations and improve accuracy
Always Show Sources: Cite alongside generated data
- Addresses concerns of hallucinations, omissions, and bias
- Promotes transparency and verifiability

On low stakes

The stakes are (relatively) low when

No serious problem arises from error:
- This can be because the use itself is low stakes ²
- Or because you have mechanisms in place to verify information.
The user can easily identify error
- Via UI design
- Via expertise or
- Via directly examining and comparing data

On using RAG

The most touted uses of RAG in medicine are question answering and paper or information summarization.

The state of the art in medical question answering on closed, carefully selected datasets using models that in all likelihood have had access to the test data, is around 80% which means the probability of getting at least one wrong answer in 10 is around 89%.The median numbers are a whole lot worse. If you can’t afford the latest model to be answering your questions or doing the generation, imagine the possibility of error. (or do the math, your call)

Stats for summarization are a whole lot worse. And a whole lot tougher to quantify. But, if you want some quick summaries, with the aim of deciding what to focus on, or what to read etc. this is probably pretty good.

Another low stakes situation is when the data is small. Summarizing a couple of paragraph is likely to have less errors than summarizing a paper. or summarizing multiple papers at once. Context lengths are not created equally.

RAG that is generation heavy is going to be about as good as vanilla generation, and this excels at boilerplate generation, ideation, rephrasing the immensely common, responds to carefully worded/structured prompts, has high accuracy with chunked generation, and is great for iterative interactions.

The evidence for the awesomness of graph embeddings and knowledge graphs in the LM contexts are not all that clear, but they have been shown to reduce halluciations and that is good enough for me.

The above heuristics are for deployment in the wild, not personal or research purposes.

References

Hybrid retrieval:

Xiong, G., Jin, Q., Lu, Z., & Zhang, A. (2024). Benchmarking retrieval-augmented generation for medicine. arXiv preprint arXiv:2402.13178

Xiong, G., Jin, Q., Wang, X., Zhang, M., Lu, Z., & Zhang, A. (2024). Improving retrieval-augmented generation in medicine with iterative follow-up questions. arXiv preprint arXiv:2408.00727

Turpin, M., Michael, J., Perez, E., & Bowman, S. (2024). Language models don’t always say what they think: unfaithful explanations in chain-of-thought prompting. Advances in Neural Information Processing Systems, 36

Chen, J., Chen, L., Huang, H., & Zhou, T. (2023). When do you need chain-of-thought prompting for chatgpt?. arXiv preprint arXiv:2304.03262

Footnotes

Chain of thought prompting is probably useless, can justify things in an authoritative sounding way, which can confuse and waylay the user.↩︎
I know, circular reasoning, but low stakes are easier to identify than define.↩︎