Creating data dictionary drafts using Claude
For years I have been manually creating data dictionaries before creating data collection instruments so that I can plan for how I want to do things such as name variables, code values, assign variable types, and so forth. I’ve talked about this process in my book, in blog posts, and pretty much anywhere I can. It’s often a very manual process but I find the time spent doing this well worth it because I see the quality of collected data greatly improve when this process is used.
I’ve been slow to integrate AI into this process because I appreciate the exploratory nature of building these manually. However, I recently decided to start experimenting with how I might integrate AI in a useful way, and as I’ve been testing it out, I’m finding that it is a huge time saver. I can have AI do a lot of the heavy lifting that doesn’t require much thought (e.g., manually entering item wording and value response options), but then still allow myself the ability to think through how I want my final data set to look for a specific project (i.e., choosing variable names, choosing value codes, thinking through necessary transformations).
This blog post walks through an example of the types of documentation and prompts you can provide to Claude to get a working draft of a data dictionary that can be used to then guide the development of your data collection tool (in an instrument like Qualtrics or Google Forms), and can also be used to guide your data cleaning and validation process after data is collected, as well as shared alongside your final data set for future interpretation.
Shortly after I started testing out this process, I came across a blog post from Britt DeVries where she demonstrates how she used ChatGPT 4.0 to do something similar. Her blog post is especially interesting because it is geared towards people who collect data using REDCap. Using REDCap you can create data collection instruments in one of two ways.
- The first way is the traditional one where you build a form using the GUI (hopefully following the guidelines that you have laid out for yourself in a data dictionary - naming variables the way you want them rather than using default names, and so forth).
- However, one of the nice features of REDCap is that you can also create (or modify) a form by uploading a completed data dictionary. Here, REDCap provides a CSV Data Dictionary template that must be completed in a very specific way. Once you complete that template, you can upload that template and REDCap will create a draft of your instrument using that information.
With that said, creating a draft of a REDCap data dictionary using AI is really powerful because it then makes the process of creating a data collection instrument that much quicker. I also wonder if using the standardized REDCap data dictionary template allows AI to more accurately understand how to input information into columns. You’ll notice in Britt’s post that without prompting it to do so, ChatGPT added the value labels and codes using the REDCap comma convention (0, Female | 1, Male) as opposed to using something like equal signs.
However, as someone who occasionally uses REDCap but mostly uses other tools, I am interested in creating a data dictionary that can be used more generally. Sometimes I may want to create a data dictionary to guide my process of building a survey in a tool like Qualtrics, other times I may be building a data dictionary to guide my cleaning process of an already collected data set.
In this post I am interested in seeing if Claude can build a data dictionary using a template that I provide for my current needs, rather than the standardized REDCap one.
The task
For the purpose of this blog post, I created a sample teacher survey draft using a variety of question types to gauge how well AI could interpret each question type. I included a few questions to understand who is completing the survey and to gather background information on the participant (which includes open text, multiple choice, and select-all items), and then I included items from the Perceived Stress Scale, which is an open access scale, free to use. That scale includes both reverse scoring as well as summary scores which I wanted to see how Claude would handle.
Generating a data dictionary
Before starting a session I decided I had one of two options to ask Claude to create this data dictionary.
- I could give very specific prompts about how I want the data dictionary to be organized (e.g., how to name variables, code variables, etc)
- Or I could provide an example of a completed data dictionary and asked Claude to model that, like the one you see on the second tab of this template
I chose the first option for this test (Data Dictionary Template). Note that in hindsight I would’ve provided a CSV formatted template, rather than XLSX, for better interoperability.
Starting a fresh chat with Claude Sonnet 4.6, I provided the following prompt:
I have a survey in Word format and a template data dictionary in Excel. Can you draft a data dictionary for the survey using the template? For the “name” column, use a snake case naming convention, start all variables with “t_” to denote this is a teacher survey, and use naming conventions that clearly convey what the variable represents. If the variable comes from a scale, use a short abbreviation for the scale and the item number. For the “values” column use the pipe followed by a space as the delimiter between response options. Can you name the data dictionary “tch_svy_data_dictionary”?
With my simple prompt, Claude produced a really great first draft of a data dictionary in no time.
At this point I could probably go ahead and use this to guide my process of creating a survey form in a tool like Qualtrics. This dictionary would guide how I name variables, code values and so forth. However, because I tend to use the data dictionary to guide my entire data management process (including future cleaning), I also want to add in additional recodes and sum scores that will be created after data is collected (during the cleaning process). So I provided Claude with documentation on the Perceived Stress Scale so that it could add in the reverse coded and total score items that will be created after the data is collected.
Great! Using this Perceived Stress Scale Documentation, can you add in additional variables for the reverse coded items? Name those items by adding “_r” to the end of the variable names and add information in the “transformations” column. Can you put those variables below all of the other variables? And then can you also add a variable for the total sum score that is calculated using the 10 items from the scale?
Again, it did an excellent job filling in that information.
Last, I wanted to confirm that the way the Perceived Stress Scale items were worded and coded in the survey draft, matched the PSS documentation. Many times someone provides me with a survey draft and they either don’t realize that they’ve miscoded something or an item wording is incorrect. OR sometimes they purposefully revise items to meet the needs of their project but I’m not aware of that and I would like to note that in the data dictionary.
Looks good. Using that same PSS documentation, can you confirm that the PSS items in the survey are correctly worded? And that the response options are accurate and coded correctly?
In the response, Claude caught that Item 3 and 9 were worded slightly different across the two documents. It also noted that the anchors differed across the two documents, “In the past month” vs “In the last month”. These issues were very minor so I decided to leave the survey as is.
However, what Claude did miss is that there were other items that were worded slightly different as well (Items 4 and 6). So I think from here, even if you feel good about the state of the data dictionary, it is still very important to validate things yourself so you can catch things that Claude may have gotten wrong or missed.
Cleaning up the dictionary
While I could have continued adding prompts to clean up the dictionary, at this point it was a very good first draft and with a few manual changes, I had it exactly where I wanted it. Besides just updating the fonts and spacing, I made a few additional changes. I added in unique study IDs that would eventually be added in to replace identifying information (participant and school names), and I updated a few other small things. All changes are highlighted in yellow.
Considerations
Ultimately, I think this is a huge time saver. As someone who has always manually inputted this information, even just having Claude enter the labels and values alone can save me tons of time. I think to get the best possible outcome, it’s important to
- Provide really thorough prompts (including information about how you want the dictionary styled, what are your conventions, and so forth). I’m certain my prompt could be greatly improved.
- Again you could instead provide an example, or even an existing Style Guide if you have one
- Uploading documentation/citations for existing measures is a great way to validate any manual drafts of instruments that you’ve created or been given (e.g., see what is different across the forms, what is missing, etc.)
- Always validate what AI has provided is accurate. This includes checking labels, values, transformations for all items in the data dictionary.
Also make sure to consider any legal and ethical consequences of using AI. A few things that come to mind in my work are just making sure you have permission to provide information to an LLM (e.g., consider proprietary information/copyright issues for some measures and scales), as well as never providing identifiable information in this process.
Creating a Qualtrics import file
Similar to REDCap, you can import a file into Qualtrics to build a draft of your data collection instrument, as opposed to using the GUI. However, I would not say Qualtrics intends for you to create surveys that way. The import files are not nearly as user friendly to create (see their documentation on creating the TXT, Word, or QSF files). And from what I understand, the import is not intended to be used for creating import files from scratch but instead as a way to export an existing Qualtrics instrument and then share that with someone who can then import that instrument for their project.
However, with that said, I thought it would be worth a try to ask Qualtrics to create an import file using my cleaned up data dictionary. I started an entirely new discussion for this:
Can you create a TXT import file for Qualtrics from this data dictionary? Only include items where the “origin” column says “Teacher Survey”. Make sure all of the variables are named the same as the data dictionary and items are recoded to match the data dictionary. Also add any validation to make sure that items are restricted to their variable type. For example, t_survey_date is a date, t_age is an integer, t_email is an email. Last, please make sure that t_gender_other is provided as a text box below the “Gender not listed here” option in the t_gender question, not a separate item.
From there it provided me a TXT file. When I attempted to import that file into Qualtrics, I received an error. I went through two more iterations of creating the TXT file with Claude after telling it about the error message I received. On the third try it created a usable file. Claude did inform me that the validation features I wanted to include were not possible using the TXT format. Therefore I would need to add that manually in Qualtrics.
I imported the file and it created a workable draft of a survey. While it was nowhere near perfect (e.g., it left an empty question at the beginning of the survey, it used paragraph boxes for short answer questions, it provided a fake text box for t_gender_other that you can’t actually input information into), but it did get several things right including the correct variable names and value codes for response options. All in all, it was a great working draft that, even with the editing that is needed, has the potential to save you time.
| Import File | Qualtrics Form |
|---|---|
![]() |
![]() |
Side note: In my session, Claude did mention that advanced features, such as validation, are available in the QSF format, but that the QSF format is much more complicated to manually create. I tried many times, in other sessions, to create the QSF file (including all of the features I wanted), and every file Claude created failed to import.
All materials used in this example can be accessed in this GitHub Repository.
- Claude chat transcript to create data dictionary
- Teacher Survey Draft
- Data Dictionary Template
- PSS Documentation
- All versions of the data dictionary created throughout the session. I renamed each iteration of the data dictionary output with a version number because Claude kept saving over every version.
- Claude chat transcript to create Qualtrics import file
- Qualtrics TXT Import file
- Link to the Qualtrics survey draft
Huge thank you to Britt DeVries whose 2024 blog post heavily influenced how I organized this post. And also thank you to Sara Hart for having several exchanges with me about this data dictionary/instrument creation process while I was testing it all out and for being my brilliant thought partner!
References:
Anthropic. (2025). Claude Sonnet 4.5 [Large language model]. https://claude.ai
Cohen, S., Kamarck, T., & Mermelstein, R. (1983). A Global Measure of Perceived Stress. Journal of Health and Social Behavior, 24(4), 385–396. https://doi.org/10.2307/2136404
DeVries, B. (2024, November 23). Creating REDCap data dictionaries using ChatGPT 4.0. Medium. https://medium.com/@brittdev31/creating-redcap-data-dictionaries-using-chatgpt-4-0-1a67c5a934f4
Citation
@online{lewis2026,
author = {Lewis, Crystal},
title = {Creating Data Dictionary Drafts Using {Claude}},
date = {2026-04-03},
url = {https://cghlewis.com/blog/claude_dictionary/},
langid = {en}
}





