An Introduction to Zelma

Our work to bring you accessible school assessment data

Sep 03, 2024

Welcome to the State Assessment Data Repository. In this post, we’ll give an overview of this project (for those of you who are new here). The next posts after this will provide initial insights as states release their 2024 data, so stay tuned for that.

Very broadly, the goal of this project is to make state standardized assessment data accessible to as many users – parents, researchers, policymakers – as possible. We are doing that by cleaning the raw data from individual states and making it more accessible, as well as providing an AI tool for visualization.

We’ll explain more below, but if you want to dive right in you can visit the tool at Zelma.ai and see and download all of the raw data on our Codebook & Downloads page.

A (Very!) Brief Background on State Assessment Data

In the U.S., the Every Student Succeeds Act (ESSA) requires that state departments of education administer academic assessments each year in English language arts (ELA), math, and science. ELA and math assessments are given to students in each of grades 3 through 8 (and at least once in high school), with fewer grade levels required for science. These tests are used for federal accountability and for states to track academic progress. Each state gives its own test, and each state reports performance outcomes at the state, district, and (sometimes) school level, overall and by student groups such as race/ethnicity.

State-level standardized testing in the U.S. is the most comprehensive view we have of student performance – it illustrates differences across schools, districts and demographic groups. Because these tests are taken every year and by nearly all students, they can provide an important piece of information to inform how our students are doing academically.

The problem is that the data files states post on their websites are often messy, incomplete, or hard to access. States might change their reporting format from the prior year, or post the results in dozens of different spreadsheets or tabs. It is not always clear if or how the assessment has changed from the year before. And sometimes information is simply not available from states, such as outcomes by race/ethnicity or outcomes by grade level (or both!).

Zelma: The Big Picture

To better utilize the incredible amount of data available through school testing data and to overcome the data access issues, we created Zelma.

Zelma compiles all publicly-available state assessment data from state departments of education for students in each of grades 3 through 8, for all available subjects (including ELA, math, science, social studies, and others), and for all available student groups (including race/ethnicity, gender, economic status, and others). Our goal is to make this data as transparent as possible, so that we can make better decisions for our kids.

Currently, Zelma includes data through Spring 2023 for all states and DC. In December, we will update the dataset with Spring 2024 data that states have made available up until that time.

Zelma Step 1: Data Cleaning

The first step of the Zelma project is to download, clean and document the raw data provided by states. More specifically, this involves the following process:

Data Access. We access/download the assessment data from state websites and data portals. We download all available spreadsheets and data files that states post, or download as many files as needed from data dashboards, if the state does not have an easy way to download all of the data at one time.

Data Documentation. We carefully review every data file from every state and every year. We document what variables and value labels are included in the files, and identify any gaps. We also document whether or not the state has changed any of their subject-area assessments compared to the prior year (sometimes this is an entirely new assessment, while other times the name remains the same but the state has changed how it defines proficiency, typically due to updated standards).

Data Cleaning and Review. For each state, we map the data onto our project’s standardized file format so that the data for all years and states have the same set of variable names, in the same file format. We integrate all years, subjects, and demographic data available. So while we cannot make the tests comparable across states (the tests themselves are different), we can make each state’s data easier to work with over time. The data files undergo a review process to ensure accuracy and consistency within each file and across years.

Data Requests. Many states have data components that we believe should be publicly available – such as outcomes by race/ethnicity, outcomes by grade level, or the number of students that were tested. For all of the “data gaps” that we identify as part of the Data Documentation process, we submit public data requests or FOIAs for this information. Note: None of the data received from states includes student-level information. All data follow the state’s data suppression guidelines.

NCES Components. Once we have the pieces we need, we incorporate key district and school characteristics from the National Center for Education Statistics (NCES), such as the IDs that are assigned to every public district and school in the U.S.

Posting Clean, Downloadable Files. If you’re a researcher and you want to use these data, we’ve got you covered with clean, downloadable files for all states. The Zelma dataset will be updated each December and June with new data - including the most recent spring assessment data available, as well as new data received from data requests since the prior version. An exciting new feature is that users can now access the data via API as well!

Zelma Step 2: An AI Tool

As clean as the data files are, we realized that to make insights from these data truly accessible, we needed to do more than provide data files. We needed a way for everyone to more easily interact with the data, especially if they don’t have time to use Excel formulas or write statistical code for analyses. We wanted to help bring the data to life for a broad range of users, including parents, education leaders, school board members, and policymakers.

This is where Zelma.ai comes in. We partnered with Novy.ai to build a ChatGPT-powered AI tool that can generate data insights. Zelma is designed to write code to query the database, based on your questions. If you ask, say, “What are the top 5 school districts in Mississippi in ELA in 2023?”, for example, Zelma will return the graph below.

Zelma can write SQL code to query only the test score database. What that means is, it can’t make up test scores or proficiency rates, and it can’t incorporate other data, such as graduation rates or community employment rates. Underneath every figure, we have posted the SQL code that Zelma is using to show you the results, so that you can see exactly what she is using to generate each figure or table.

Please note that states often need to change their state assessments, or refrain from publishing results in a given year and subject if they are field testing a new test. In these cases, data are not available and you will see gaps in the visualizations. We have done our best to capture these changes, which affect comparability over time, below all figures - you’ll see these under “Notable Events.”

As we are all learning about any AI prompt, it is important to be as specific as possible when asking Zelma to return results to you - are you interested in a specific year or over time? What subject? What state? Would you prefer results in a bar graph, line graph, or table? We hope Zelma will have answers to your questions, so please try! If you have any trouble with your query, feel free to email us at zelmadata@gmail.com.

Limitations

We’d like to share some of the limitations of this database and the Zelma AI tool.

There are still data gaps! While we have done everything we can to get as much data as possible (and are doing so at this very moment!), some states may continue to have missing information by race/ethnicity or grade level. One state told us that they couldn’t give us the counts of students who took its test — even the total for the state — due to privacy concerns. Some of these limitations mean that the queries may not be able to produce a figure for all subjects for all years. If your state is missing data, please email us at zelmadata@gmail.com. We welcome help in pushing them to provide it!

Data suppression. For privacy reasons, all states suppress numbers for some small groups, and the degree of this varies. We cannot avoid this when working with public data, but we try to make this as evident as possible in our data files by using separate conventions for data that are suppressed (*) versus data that are just not available or missing (--).

The data are specific to assessment data for grades 3-8. The database does not (currently) include additional variables, such as high school outcomes or SAT scores. So queries for this information will unfortunately not return any usable results!

Do you want to help?

We are always looking to improve and enhance our data and our Zelma AI tool (the new API feature is our newest development!). As we continue to compile our ideas and future planning, we would like to hear from you! What would you like to see from Zelma? Please share your feedback here or email us at zelmadata@gmail.com — the good and the bad.

And if you’d like help working with the data, or would like to connect with our team, please email. Our goal is to make this data work for everyone.

Also, if you are a Delaware resident and are interested in this project, please email us at zelmadata@gmail.com. We could use your help submitting a data request!

Team

Our Zelma team consists of Emily Oster as the PI/Executive Director, Clare Halloran as the Project Director, and a team of wonderful research assistants at Brown University.

We are also grateful for financial support from Brown University, the Walton Family Foundation, Novy and OpenAI.

EDC State Assessment Data Substack

Discussion about this post