Case Studies
Aug 8, 2024

An automatic ICD-10 medical dataset

ICD-10 extraction is a common and expensive process in healthcare. This article explores an innovative approach to generating synthetic ICD-10 medical coding data using large language models (LLMs). By creating realistic, contextually accurate synthetic transcripts of doctor-patient interactions, we can fine tune models on the intricacies of different codes and power better extraction models

Low-code tools are going mainstream

Purus suspendisse a ornare non erat pellentesque arcu mi arcu eget tortor eu praesent curabitur porttitor ultrices sit sit amet purus urna enim eget. Habitant massa lectus tristique dictum lacus in bibendum. Velit ut viverra feugiat dui eu nisl sit massa viverra sed vitae nec sed. Nunc ornare consequat massa sagittis pellentesque tincidunt vel lacus integer risu.

  1. Vitae et erat tincidunt sed orci eget egestas facilisis amet ornare
  2. Sollicitudin integer  velit aliquet viverra urna orci semper velit dolor sit amet
  3. Vitae quis ut  luctus lobortis urna adipiscing bibendum
  4. Vitae quis ut  luctus lobortis urna adipiscing bibendum

Multilingual NLP will grow

Mauris posuere arcu lectus congue. Sed eget semper mollis felis ante. Congue risus vulputate nunc porttitor dignissim cursus viverra quis. Condimentum nisl ut sed diam lacus sed. Cursus hac massa amet cursus diam. Consequat sodales non nulla ac id bibendum eu justo condimentum. Arcu elementum non suscipit amet vitae. Consectetur penatibus diam enim eget arcu et ut a congue arcu.

Vitae quis ut  luctus lobortis urna adipiscing bibendum

Combining supervised and unsupervised machine learning methods

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

  • Dolor duis lorem enim eu turpis potenti nulla  laoreet volutpat semper sed.
  • Lorem a eget blandit ac neque amet amet non dapibus pulvinar.
  • Pellentesque non integer ac id imperdiet blandit sit bibendum.
  • Sit leo lorem elementum vitae faucibus quam feugiat hendrerit lectus.
Automating customer service: Tagging tickets and new era of chatbots

Vitae vitae sollicitudin diam sed. Aliquam tellus libero a velit quam ut suscipit. Vitae adipiscing amet faucibus nec in ut. Tortor nulla aliquam commodo sit ultricies a nunc ultrices consectetur. Nibh magna arcu blandit quisque. In lorem sit turpis interdum facilisi.

“Nisi consectetur velit bibendum a convallis arcu morbi lectus aecenas ultrices massa vel ut ultricies lectus elit arcu non id mattis libero amet mattis congue ipsum nibh odio in lacinia non”
Detecting fake news and cyber-bullying

Nunc ut facilisi volutpat neque est diam id sem erat aliquam elementum dolor tortor commodo et massa dictumst egestas tempor duis eget odio eu egestas nec amet suscipit posuere fames ded tortor ac ut fermentum odio ut amet urna posuere ligula volutpat cursus enim libero libero pretium faucibus nunc arcu mauris sed scelerisque cursus felis arcu sed aenean pharetra vitae suspendisse ac.

Project Overview

Deciphering medical data is often a tedious and challenging task, typically requiring a specialist like an MD to manually label thousands of rows. However, Large Language Models (LLMs) can significantly reduce the need for expert human labeling. The key insight here is that we're not always seeking human feedback.

While we might never replicate the instinctual insights of a radiologist honed over 20 years of practice, we can teach AI to recognize the symptoms of long COVID or the diagnostic criteria for clinical depression. By leveraging well-encoded knowledge in medical domains, we can transform this knowledge into rows of example data. This helps train models to accurately extract procedures from clinical notes or identify symptoms from patient records.

We started by defining every ICD-10 medical code and gathering examples for each one. We then generated thousands of examples as they would appear in appointment transcripts, including both positive and negative instances. As a result, our solution matches MDs in accuracy while being able to scale in quantity exponentially.

Methodology

The core of our solution involved creating synthetic transcripts of doctor-patient interactions, each meticulously labeled with the appropriate ICD-10 codes. Here’s how we did it:

Knowledge Ingestion

We gathered a wide range of medical texts and resources, including lists of ICD-10 codes and diagnostic textbooks, to ensure comprehensive coverage of medical scenarios.

Ground Truth Creation

Rather than extracting information from unstructured data, we created a new dataset from scratch based on the knowledge we gathered. Textbooks, diagnostic criteria, and operating procedures provided the information needed to create nearly perfect labeled transcripts of patients receiving care for any ICD-10 code.

Style and Distribution Matching

We transformed our data to mirror real-world examples of patient transcripts in tone, length, and style. This step ensured our data looked exactly like it would in real-world scenarios while maintaining the accuracy gained from generating it from scratch.

Challenges and Solutions

We encountered several key technical challenges along the way:

  • Maintaining Medical Accuracy: Ensuring that synthetic data was as accurate and useful as real data required rigorous validation and continuous refinement of our pipeline. Properly ingesting available knowledge was crucial here.
  • Balancing Realism and Anonymity: Our data needed to be realistic enough for training purposes while ensuring no actual patient data was used. We achieved this by carefully curating training materials to transfer only style and format from real data, not actual content.
  • Model Bias: To prevent generating biased data, we incorporated bias detection and mitigation strategies throughout the development process.

Future Implications

The success of this project opens the door to numerous future applications:

  • Automatic ICD-10 Datasets: Every time CMS updates the ICD-10 code set (twice a year), we can programmatically generate rows of data that help a model embed that knowledge. This makes training data easily accessible for any procedure, eliminating skewed data and performance biases toward the most common codes.
  • Medical Uses Outside of Billing: Many fields in medicine face barriers due to the lack of proper training data. From chronic illness diagnostic criteria to DSM classifications, we can turn any unstructured knowledge into data that a model can use.

Sample Data

Here’s a sample of the synthetic data generated during the project, demonstrating realistic artifacts and common issues encountered in real data interpretation:

Doctor's Transcript:

Date: 10/11/2023

Presenting problem: L wrist pain

Hx: Pt tripped over curb, landed on outstretched hand. Immediate pain/swelling. No LOC.

Exam: L wrist swollen, tender over radial styloid. No deformity. ROM limited by pain.

Imaging: X-ray shows nondisplaced fracture of radial styloid.

Dx: Nondisplaced fracture of L radial styloid process

Plan: Splint, pain management, ortho follow-up in 1 week. Educated patient on RICE protocol.

Expected ICD-10 Code:

S52.515F - Nondisplaced fracture of left radial styloid process, subsequent encounter for open fracture type IIIA, IIIB, or IIIC with routine healing.

There's a few key issues highlighted in this example:

  1. Inconsistent Abbreviations: Medical notes often include inconsistent abbreviations like "Pt" (patient), "ovr" (over), and "fx" (fracture).
  2. Varied Data Formats: Real medical data can come in many formats, each with unique artifacts. This includes speech-to-text transcripts, OCR of handwritten notes, text converted from HTML, and data with headers or custom formats, all adding complexity and noise.
  3. Complexity and Noise: Notes are often poorly detailed and complicated. Our pipelines capture these issues when generating data, ensuring the model generalizes well to real-world data variability, dealing with incomplete information, transcription errors, and varying levels of detail.

This sample shows how our synthetic data includes realistic artifacts such as abbreviations, shorthand, and variability in data format, common in real-world medical data. This level of detail ensures the data is robust and valuable for training advanced medical models.

Get a demo

Learn how you can use better data to power training and evaluation today

Thank you

We'll reach out within 24 hours.
Oops! Something went wrong while submitting the form.