Open-ended survey coder with LLM

In quantitative research, open-ended questions are the uncomfortable ingredient. Everything else in the survey produces numbers directly: 1-to-10 scales, multiple choice, net promoter scores. But open-ended questions produce free text, and free text doesn’t fit into a regression or a crosstab.

To convert them into data, there’s a process called coding: someone reads all the responses, groups the ones that say the same thing, assigns a number to each group, and replaces the text with that number in the database. “Hellmanns”, “the yellow lid one”, and “hellmans (sic)” all become code 5, which corresponds to Hellmann’s.

It’s tedious, repetitive, and prone to consistency errors. When you’re dealing with thousands of responses to multiple questions across several parallel projects, it’s also expensive. I automated it.

The problem in all its complexity

Before writing a single line of code, I had to understand the problem properly. There are two fundamentally different situations:

When a codebook already exists. In longitudinal studies, panels, or multi-wave projects, someone already defined the categories in a previous round. There’s an Excel file — the codebook, or LDC — with a list of numeric keys and their canonical values. The task here is imputation: take each new response and map it to the right key. The main challenge is orthographic variability. “Cocca-cola”, “coca cola”, “CocaCola”, and “the black one” are all key 7.

When no codebook exists. In new studies or questions that have never been asked before, the categories need to be invented from scratch. The analyst doesn’t know in advance what they’ll find. First discover what categories emerge from the data, then assign each response to one, then generate a reusable codebook for future waves.

Both flows share infrastructure but have completely different logic.

A problem nobody talks about: the multi-value response

There’s a third problem that doesn’t appear in coding textbooks but shows up constantly in real data.

Imagine a question: “What qualities should a good football player have?” One respondent answers: “Speed, technique, and grit”. Another: “Attitude and commitment to the team”.

The first mentioned three things. The second mentioned two. But in the data file, both responses occupy a single cell. If we code that cell as a unit, we lose mentions. In a brand image or product attribute study, those lost mentions can change the results.

The solution is ad-hoc expansion: detect when a response contains multiple real mentions, split them, and distribute them into additional columns. The original column (Q23) keeps the first mention (top of mind), and new columns are created (Q23adhoc2, Q23adhoc3, etc.) for the rest. Detection can’t rely on regex alone — the final decision goes to the LLM.

The architecture

Multi-project and per-project configuration

The tool wasn’t designed for a single study. Each project has its own directory structure with data/, LDC/, REINFORCEMENTS/, and outputs/. Subprojects are detected automatically. If files follow the pattern PROJECT1.xlsx, PROJECT2.xlsx, it also asks for the wave number. Each project can have its own special codes configured in a central JSON.

The imputation flow

Reinforcements are an incremental learning mechanism. Every time the LLM maps a new variation (“cocca-cola” → Coca-Cola), that relationship is saved in a JSON file. Over time, the system needs fewer API calls because most variations are already cached.

The codification flow (no codebook)

The two-phase separation is deliberate. If done in a single step, the LLM tends to create too many specific categories or groups inconsistently across batches. By fixing the list first, the second step has a constant response space.

Frequencies as a quality signal

The LLM alone doesn’t always detect well what deserves its own category. If there are 500 distinct responses, “ICBC” might appear 5 times with 5 different spellings. Each one looks like a rare case; together they’re a relevant category. The solution was passing normalized frequencies alongside the values, and using them in the interactive review where the analyst sees the percentage each category represents.

Resilience: checkpoints

Processing a thousand responses means dozens of API calls. If anything fails midway, the state is saved in a Parquet file after each question, along with a JSON cache of already-classified values. The tool looks for existing checkpoints at startup and asks whether to continue from there.

Interactivity isn’t a defect, it’s the product

Full automation isn’t the goal. The goal is for the analyst to trust the result enough to defend it to the client. Before classifying, they can edit proposed categories. After classifying, they can review each category and move misassigned values. Before saving, they see a complete distribution.

What the LLM does is eliminate mechanical work. What the analyst does is make the decisions that require domain judgment.

Production results

A dataset with 1,500 responses to 4 open-ended questions that previously took 4 to 6 hours now takes 25 to 45 minutes: 15-25 of automatic processing and 10-20 of interactive review. Orthographic variant detection is more robust than human detection for non-obvious typos, and there’s no consistency drift across long sessions.

The code is on my GitHub.